当前位置：网站首页>Xgboost, lightgbm, catboost -- try to stand on the shoulders of giants

Xgboost, lightgbm, catboost -- try to stand on the shoulders of giants

2022-06-26 03:22:00 【Not reassuring】

Preface

Recently, I was playing the telecom customer loss prediction challenge of iFLYTEK , The title of the competition is a typical binary classification problem , Use AUC As an evaluation indicator . From the official baseline I learned a lot from it , Here is a summary .

baseline：https://mp.weixin.qq.com/s/nLgaGMJByOqRVWnm1UfB3g
match ：https://challenge.xfyun.cn/topic/info?type=telecom-customer&ch=ds22-dw-zs01

baseline Provides a training strategy （KFold）, Three Boosting Algorithm （XGBoost, lightGBM,CatBoost）, This article mainly focuses on them .

KFold

k-fold cross-validation,k Crossover verification , The initial sample is divided into k Subsample , A separate sub sample is retained as data to validate the model , other k-1 A sample is used to train , repeat k Time , Ensure that each sub sample is verified once , Average k Or use other combinations , Finally, a single estimate is obtained . take k=5, A picture shows ：
k-fold cross-validation
baseline It is equivalent to writing this idea once , Actually familiar with grid search (Grid Search) Classmate , In the use of sklearn The bag of , Will notice **GridSearchCV()** Class has a parameter cv, This parameter actually specifies the cross validation discount . Using this cross validation method , All the data will be involved in training and prediction , So as to effectively avoid the occurrence of over learning and under learning , The final result will be more persuasive .baseline There are also references to StratifiedKFold( Although not used ), The method is KFold Enhanced version of , And KFold The biggest difference is StratifiedKFold The method is based on the label y To split data based on the proportion of different categories in , It can ensure that the target variable of each discount has the same ratio as the whole data set , The comparison applies to unbalanced data sets . The official data set of this competition , The training data totaled 150000 strip , Among them, positive samples 75042, Almost 1:1 Of , in admirable proportion .
except k-fold, also Hold-Out Method( Divided into two groups , One group does training , One group does the verification )、Double Cross-Validation(2-fold Cross Validation)、Leave P Out Cross Validation( Use... From the original sample P Item as validation data , The rest is left as training data , Repeat the process )、Shuffle Split( Select a part of the data for the training set , Select a part as the verification set , Training set + The sum of the proportion of the verifier <=100%)

Boosting

There are three common integrated learning frameworks ：Bagging,Boosting and Stacking.XGBoost、LightGBM and CatBoost It's all based on Boosting The mainstream integration algorithm of the framework , The three brothers' information has been searched on the Internet , I won't repeat it here .
XGBoost、LightGBM Their syntax is very similar in use , With LightGBM For example ：

#lightgbm Model building and training 
import lightgbm as lgb	
d_train = lgb.Dataset(x_train, label=y_train)  # Training data 
params = {
    }	
params['boosting_type'] = 'gbdt'	
params['objective'] = 'binary'	
params['metric'] = 'auc'	
params['learning_rate'] = 0.003	
#  Several parameters are omitted here 
model = lgb.train(params, d_train, 50000)  #lightgbm model training 
y_pred=clf.predict(x_test)	# Model to predict

CatBoost Generation is eliminated Dataset The step of , It is indeed closer to sklearn Of svm、tree These common packages ：

import catboost
params = {
    }	
params['boosting_type'] = 'Bernoulli'	
params['depth'] =5
params['learning_rate'] = 0.02
#  Several parameters are still omitted here 
model = catboost.CatBoostRegressor(iterations=20000, **params)
model.fit(x_train, y_train)  #  It doesn't feel familiar ？？
y_pred = model.predict(x_test)

I personally got the data , Tried to do one-hot( Classified data )+normlization( Numerical data ), Throw it LightGBM Run in , result AUC Non ascending and descending , With doubts , I did some more searches , Discover Feature Engineering for machine learning , There has been a lot of discussion , such as ： Category variable , Don't just come up and do it one-hot code , There is no data standardization （z-score standardization ） Examples of poor effects after ？. It turned out ,LightGBM It is designed to optimize category variables , What I have done can really be called painting the snake to the foot . It can only be said that the better the encapsulated model , The more points worth digging inside , Whether it is feature engineering or parameter adjustment , Should have respect .

Reference material

scikit-learn Official documents .sklearn.model_selection.cross_val_score.
jasonfreak. Use sklearn Integrated learning —— theory .
Nay smiles , etc. .LightGBM Chinese document .
A kind of tang Two flavors .LightGBM Algorithm details
Huang Bo's machine learning circle .XGBoost、LightGBM And CatBoost Algorithm comparison and parameter adjustment .

Last words

Finally, I would like to quote the big man jasonfreak The signature of the ：“ A lazy man , Always want to design smarter programs to avoid repetitive work ”. It must be said that thanks to these “ Lazy people ”, More and more intelligent programs can be used by the public at such a low threshold , Solve all kinds of complicated problems .
Level co., LTD. , I hope you can correct the deficiencies .

原网站

版权声明
本文为[Not reassuring]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/177/202206260249466508.html