当前位置:网站首页>Mpai data science platform random forest classification \ explanation of regression parameter adjustment
Mpai data science platform random forest classification \ explanation of regression parameter adjustment
2022-06-25 12:05:00 【Halosec_ Wei】
Number of decision trees (n_estimators):
This is the number of trees in the forest , That is, the number of base evaluators . The effect of this parameter on the accuracy of the stochastic forest model is monotonic , The larger the number of decision trees , Models tend to work better . But the corresponding , Any model has a decision boundary after the number of decision trees reaches a certain degree , The accuracy of random forests is often not rising or beginning to fluctuate , also , The larger the number of decision trees , The larger the amount of computation and memory required , The training time will be longer and longer . For this parameter , We are eager to strike a balance between the difficulty of training and the effect of the model , The number of decision trees is usually no more than 1000.
Value :【1,+∞】
The principle of division (criterion):
Return to : Regression tree is an indicator of branch quality , Supported standards are 2 Kind of :MAE,MSE( The specific formula is self-contained );
classification :CART The evaluation criteria of tree division on features , Supported standards are 2 Kind of ,: gini index (Gini), Information gain (entropy);
Maximum depth of decision tree (max_depth):
The default value means that the decision tree will not limit the depth of the subtree when building the optimal model . If the sample size of the model is large , When there are many features , It is recommended to limit the maximum depth ; If the sample size is small or the characteristics are small , The maximum depth is not limited ,max depth Usually no more than 50.
Value :【1,+∞】
Splitting an internal node requires a small number of samples (min_samples_split):
Integer or floating point , The default is 2. It specifies to split an internal node ( Nonleaf node ) Minimum number of samples required . This value limits the conditions for the continued division of the subtree , If the number of samples of a node is less than min_samples_split, Then we will not continue to try to select the best feature for classification . The default is 2. If the sample size is small , You don't need to worry about this value . If the sample size is very large , It is recommended to increase this value
Value :【2,+∞】
The minimum number of samples required for each leaf node (min_samples_leaf):
This value limits the minimum number of samples for leaf nodes , If the number of leaf nodes is less than the number of samples , Will be pruned together with brother nodes . The default is 1, An integer that can enter the minimum number of samples , Or the minimum number of samples as a percentage of the total number of samples . If the sample size is small , You don't need to worry about this value . If the sample size is very large , It is recommended to increase this value .
Value :【1,+∞】
The number of features to consider when searching for the optimal partition of nodes (max_features):
When selecting the optimal attribute, the divided characteristics cannot exceed this value , When it is an integer , That is, the maximum characteristic number ; When decimal , Number of training set features * decimal ; auto when max_features=sqrt(n_features).
Value :(0,1】
Maximum number of leaf nodes (max_leaf_nodes):
By limiting the maximum number of leaf nodes , Can prevent over fitting , The default is "None”, That is, the maximum number of leaf nodes is not limited . If there are restrictions , The algorithm will establish the optimal decision tree within the maximum number of leaf nodes . If there are not many features , This value can be ignored , But if the features are divided into many parts , Can be limited , Specific values can be obtained through cross validation
Value :(0,1】
Information entropy or Gini coefficient impurity threshold (min_impurity_split):
This value limits the growth of the decision tree , If the impurity of a node ( Based on Gini coefficient , Mean square error ) Less than this threshold , Then the node is not regenerated to a child node . Leaf node . It is generally not recommended to change the default value 1e-7.
Value :(0,1】
There is a sample put back (bootstrap:)
seeing the name of a thing one thinks of its function , That is to say, whether there is a sampling of the land to be put back when building a decision tree for a random forest , The default is True, That is to say, the strategy of "put back sampling" is adopted
Value : Yes 、 nothing
Out of bag estimation (oob_score):,
bagging The random sampling method is adopted to establish the tree model , So those sample sets that have not been extracted , That is, the data set that is not involved in establishing the tree model is the data set outside the bag , This data set can be used to verify the effect of the model , Parameter training of multiple models , We know that cross validation can be used to , But it takes a lot of time , And there is no great need for random forest , So we use this data to verify the decision tree model , It's a simple cross validation . Low performance consumption , But the effect is good . The default value is False.
Value : Yes 、 nothing
边栏推荐
- Why can't you Ping the website but you can access it?
- What are redis avalanche, penetration and breakdown?
- Under what circumstances will Flink combine operator chains to form operator chains?
- plt. GCA () picture frame and label
- Multiple clicks of the button result in results
- Update of complex JSON in MySQL
- redis的dict的扩容机制(rehash)
- 为什么ping不通网站 但是却可以访问该网站?
- 黑马畅购商城---6.品牌、规格统计、条件筛选、分页排序、高亮显示
- Is it safe to open an account and buy stocks? Who knows
猜你喜欢
plt. GCA () picture frame and label
Real software developers will use this method to predict the future
架构师为你揭秘在阿里、腾讯、美团工作的区别
Translation of meisai C topic in 2022 + sharing of ideas
Why can't the form be closed? The magic of revealing VFP object references
SQL server saves binary fields to disk file
Capacity expansion mechanism of Dict Of redis (rehash)
Windows11 MySQL service is missing
Evaluating the overall situation of each class in a university based on entropy weight method (formula explanation + simple tool introduction)
confluence7.4.X升级实录
随机推荐
黑马畅购商城---3.商品管理
文献之有效阅读
What should I do to dynamically add a column and button to the gird of VFP?
R语言dplyr包filter函数过滤dataframe数据中指定数据列的内容不是(不等于指定向量中的其中一个)指定列表中的数据行
R语言dplyr包summarise_at函数计算dataframe数据中多个数据列(通过向量指定)的计数个数、均值和中位数、在每个函数内部指定na.rm参数、通过list指定函数列表
JS indexof() always returns -1
Dark horse shopping mall ---8 Microservice gateway and JWT token
The cloud native data lake has passed the evaluation and certification of the ICT Institute with its storage, computing, data management and other capabilities
How TCP handles exceptions during three handshakes and four waves
Share 7 immortal wallpaper websites, let the new wallpaper give you a little joy, and don't fall into the repetition year after year.
quarkus saas动态数据源切换实现,简单完美
网络 | traceroute,路由跟踪命令,用于确定 IP 数据包访问目标地址所经过的路径。
Windows11 MySQL service is missing
Capacity expansion mechanism of Dict Of redis (rehash)
Application of analytic hierarchy process in college teaching evaluation system (principle + example + tool)
网络上开户买股票是否安全呢?
R语言使用glm函数构建泊松对数线性回归模型处理三维列联表数据构建饱和模型、epiDisplay包的poisgof函数对拟合的泊松回归模型进行拟合优度检验(检验模型效果)
按钮多次点击造成结果
数据库系列:MySQL索引优化总结(综合版)
优品购电商3.0微服务商城项目实战小结