当前位置:网站首页>Xgboost, lightgbm, catboost -- try to stand on the shoulders of giants
Xgboost, lightgbm, catboost -- try to stand on the shoulders of giants
2022-06-26 03:22:00 【Not reassuring】
Preface
Recently, I was playing the telecom customer loss prediction challenge of iFLYTEK , The title of the competition is a typical binary classification problem , Use AUC As an evaluation indicator . From the official baseline I learned a lot from it , Here is a summary .
- baseline:https://mp.weixin.qq.com/s/nLgaGMJByOqRVWnm1UfB3g
- match :https://challenge.xfyun.cn/topic/info?type=telecom-customer&ch=ds22-dw-zs01
baseline Provides a training strategy (KFold), Three Boosting Algorithm (XGBoost, lightGBM,CatBoost), This article mainly focuses on them .
KFold
k-fold cross-validation,k Crossover verification , The initial sample is divided into k Subsample , A separate sub sample is retained as data to validate the model , other k-1 A sample is used to train , repeat k Time , Ensure that each sub sample is verified once , Average k Or use other combinations , Finally, a single estimate is obtained . take k=5, A picture shows :
baseline It is equivalent to writing this idea once , Actually familiar with grid search (Grid Search) Classmate , In the use of sklearn The bag of , Will notice **GridSearchCV()** Class has a parameter cv, This parameter actually specifies the cross validation discount . Using this cross validation method , All the data will be involved in training and prediction , So as to effectively avoid the occurrence of over learning and under learning , The final result will be more persuasive .baseline There are also references to StratifiedKFold( Although not used ), The method is KFold Enhanced version of , And KFold The biggest difference is StratifiedKFold The method is based on the label y To split data based on the proportion of different categories in , It can ensure that the target variable of each discount has the same ratio as the whole data set , The comparison applies to unbalanced data sets . The official data set of this competition , The training data totaled 150000 strip , Among them, positive samples 75042, Almost 1:1 Of , in admirable proportion .
except k-fold, also Hold-Out Method( Divided into two groups , One group does training , One group does the verification )、Double Cross-Validation(2-fold Cross Validation)、Leave P Out Cross Validation( Use... From the original sample P Item as validation data , The rest is left as training data , Repeat the process )、Shuffle Split( Select a part of the data for the training set , Select a part as the verification set , Training set + The sum of the proportion of the verifier <=100%)
Boosting
There are three common integrated learning frameworks :Bagging,Boosting and Stacking.XGBoost、LightGBM and CatBoost It's all based on Boosting The mainstream integration algorithm of the framework , The three brothers' information has been searched on the Internet , I won't repeat it here .
XGBoost、LightGBM Their syntax is very similar in use , With LightGBM For example :
#lightgbm Model building and training
import lightgbm as lgb
d_train = lgb.Dataset(x_train, label=y_train) # Training data
params = {
}
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'auc'
params['learning_rate'] = 0.003
# Several parameters are omitted here
model = lgb.train(params, d_train, 50000) #lightgbm model training
y_pred=clf.predict(x_test) # Model to predict
CatBoost Generation is eliminated Dataset The step of , It is indeed closer to sklearn Of svm、tree These common packages :
import catboost
params = {
}
params['boosting_type'] = 'Bernoulli'
params['depth'] =5
params['learning_rate'] = 0.02
# Several parameters are still omitted here
model = catboost.CatBoostRegressor(iterations=20000, **params)
model.fit(x_train, y_train) # It doesn't feel familiar ??
y_pred = model.predict(x_test)
I personally got the data , Tried to do one-hot( Classified data )+normlization( Numerical data ), Throw it LightGBM Run in , result AUC Non ascending and descending , With doubts , I did some more searches , Discover Feature Engineering for machine learning , There has been a lot of discussion , such as : Category variable , Don't just come up and do it one-hot code , There is no data standardization (z-score standardization ) Examples of poor effects after ?. It turned out ,LightGBM It is designed to optimize category variables , What I have done can really be called painting the snake to the foot . It can only be said that the better the encapsulated model , The more points worth digging inside , Whether it is feature engineering or parameter adjustment , Should have respect .
Reference material
scikit-learn Official documents .sklearn.model_selection.cross_val_score.
jasonfreak. Use sklearn Integrated learning —— theory .
Nay smiles , etc. .LightGBM Chinese document .
A kind of tang Two flavors .LightGBM Algorithm details
Huang Bo's machine learning circle .XGBoost、LightGBM And CatBoost Algorithm comparison and parameter adjustment .
Last words
Finally, I would like to quote the big man jasonfreak The signature of the :“ A lazy man , Always want to design smarter programs to avoid repetitive work ”. It must be said that thanks to these “ Lazy people ”, More and more intelligent programs can be used by the public at such a low threshold , Solve all kinds of complicated problems .
Level co., LTD. , I hope you can correct the deficiencies .
边栏推荐
- 2021-08-04
- Learn Tai Chi Maker - mqtt (IV) server connection operation
- 上传文件/文本/图片,盒子阴影
- 力扣(LeetCode)175. 组合两个表(2022.06.24)
- Types and application methods of screen printing
- Analysis and optimization of ue5 global illumination system lumen
- MySQL增删查改(进阶)
- Do you want to add a key to the applet or for sequence?
- 小程序或者for循序要不要加key?
- 工作室第3次HarmonyOS培训笔记
猜你喜欢

计组笔记——CPU的指令流水

【读点论文】FBNetV3: Joint Architecture-Recipe Search using Predictor Pretraining 网络结构和超参数全当训练参数给训练了

Analysis on the diversification of maker space mechanism construction

多媒体元素,音频、视频

Analysis and optimization of ue5 global illumination system lumen

类图

附加:HikariCP连接池简述;(并没有深究,只是对HikariCP连接池有个基本认识)

拖放

Une citation classique de la nature humaine que vous ne pouvez pas ignorer

ArrayList # sublist these four holes, you get caught accidentally
随机推荐
R language Markov chain Monte Carlo: practical introduction
Vulhub replicate an ActiveMQ
经典模型——AlexNet
Cultivate children's creativity under the concept of project steam Education
Graphics card, GPU, CPU, CUDA, video memory, rtx/gtx and viewing mode
Where is it safe to open a fund account?
Plug in installation and shortcut keys of jupyter notebook
MySQL数据库基础
小程序或者for循序要不要加key?
Leetcode 176 The second highest salary (June 25, 2022)
Drawing structure diagram with idea
P2483-[模板]k短路/[SDOI2010]魔法猪学院【主席树,堆】
经典模型——ResNet
Utonmos adheres to the principle of "collection and copyright" to help the high-quality development of traditional culture
Translation notes of orb-slam series papers
用元分析法驱动教育机器人的发展
XGBoost, lightGBM, CatBoost——尝试站在巨人的肩膀上
Is it safe to open an online stock account?
开通基金账户是安全的吗?怎么申请呢
类图