"Yandex", I am not sure how many people might have heard of this company name but let me tell you a fact. It is a major played in search space in Russia and the former Soviet Union. Yandex has launched several open source projects, of these the one that has created the most noise is called CatBoost.
Sounds interesting right? And yes, it has nothing to do with cats.
CatBoost is a Machine learning library which is used primarily in the classification of categorical data. The more we work with datasets, the more subsets of data we discover. CatBoost enables you to create and build models without having to encode the data to one hot array. The library can be further extended to other ML libraries like Keras and Tensorflow.
But why would we use it? How does it perform compared to other Gradient Boosting techniques? And when to use it? These are important questions, which I will attempt to answer.
Here are some of the great advantages of CatBoost
- It is said to be faster in implementation of GPU/CPU training.
- Because it uses symmetric trees, that makes it to have a fast inference
- Its boosting schemes helps to reduce over fitting and improves quality of the model
- It supports sophisticated categorical features
- It is easy to use and supports Python and R
Its training speed is said to be 4 times faster on larger datasets and 2 times faster in smaller datasets when compared with other boosting methods.
CatBoost can perform very well in situations where the data changes frequently. It’s boosting algorithm works and gives predictions in very less time thereby rendering fast model services.
Using parameters in Cat boosting we can make improvements in situations based on weighable parameters. It also enables quick monitoring of Error function.
One of the cool things about CatBoost is its stability when changing hyperparameters especially when used with large training sets. It gives optimal result with default parameters thereby saving time on parameter tuning
When we consider all of these great features, available in one package, it is definitely a racer, in the true sense of the word, of a gradient boosting technique. The CatBoost model should be able to predict the best result in any given situation. Still there is lots of debate about its performance compared to XGBoost, LightGBM and others. So, it is best to gain your own personal experience as it really depends up your problem you face and dataset you have to work with to choose which technique is best for you.
To practice and know more. here is a Tutorial
pip install catboost
devtools::install_github('catboost/catboost', subdir = 'catboost/R-package')
Do explore and share your experience. One key feature which I liked about this, is the way it allocates categorical values automatically by using various statistical methods.
If you found this Article interesting, why not review the other Articles in our archive.