当前位置:网站首页>Mpai data science platform random forest classification \ explanation of regression parameter adjustment

Mpai data science platform random forest classification \ explanation of regression parameter adjustment

2022-06-25 12:05:00 Halosec_ Wei

Number of decision trees (n_estimators):

This is the number of trees in the forest , That is, the number of base evaluators . The effect of this parameter on the accuracy of the stochastic forest model is monotonic , The larger the number of decision trees , Models tend to work better . But the corresponding , Any model has a decision boundary after the number of decision trees reaches a certain degree , The accuracy of random forests is often not rising or beginning to fluctuate , also , The larger the number of decision trees , The larger the amount of computation and memory required , The training time will be longer and longer . For this parameter , We are eager to strike a balance between the difficulty of training and the effect of the model , The number of decision trees is usually no more than 1000.

 

Value :【1,+

 

The principle of division (criterion):

Return to : Regression tree is an indicator of branch quality , Supported standards are 2 Kind of :MAE,MSE( The specific formula is self-contained );

classification :CART The evaluation criteria of tree division on features , Supported standards are 2 Kind of ,: gini index (Gini), Information gain (entropy);

 

Maximum depth of decision tree (max_depth):

The default value means that the decision tree will not limit the depth of the subtree when building the optimal model . If the sample size of the model is large , When there are many features , It is recommended to limit the maximum depth ; If the sample size is small or the characteristics are small , The maximum depth is not limited ,max depth Usually no more than 50.

Value :【1,+

 

Splitting an internal node requires a small number of samples (min_samples_split):

Integer or floating point , The default is 2. It specifies to split an internal node ( Nonleaf node ) Minimum number of samples required . This value limits the conditions for the continued division of the subtree , If the number of samples of a node is less than min_samples_split, Then we will not continue to try to select the best feature for classification . The default is 2. If the sample size is small , You don't need to worry about this value . If the sample size is very large , It is recommended to increase this value

Value :【2,+

 

The minimum number of samples required for each leaf node (min_samples_leaf): 

This value limits the minimum number of samples for leaf nodes , If the number of leaf nodes is less than the number of samples , Will be pruned together with brother nodes .  The default is 1, An integer that can enter the minimum number of samples , Or the minimum number of samples as a percentage of the total number of samples . If the sample size is small , You don't need to worry about this value . If the sample size is very large , It is recommended to increase this value .

Value :【1,+

 

The number of features to consider when searching for the optimal partition of nodes (max_features):

When selecting the optimal attribute, the divided characteristics cannot exceed this value , When it is an integer , That is, the maximum characteristic number ; When decimal , Number of training set features * decimal ; auto when max_features=sqrt(n_features).

Value :(0,1】

 

 

Maximum number of leaf nodes (max_leaf_nodes):

  By limiting the maximum number of leaf nodes , Can prevent over fitting , The default is "None”, That is, the maximum number of leaf nodes is not limited . If there are restrictions , The algorithm will establish the optimal decision tree within the maximum number of leaf nodes . If there are not many features , This value can be ignored , But if the features are divided into many parts , Can be limited , Specific values can be obtained through cross validation

Value :(0,1】

 

Information entropy or Gini coefficient impurity threshold (min_impurity_split):  

This value limits the growth of the decision tree , If the impurity of a node ( Based on Gini coefficient , Mean square error ) Less than this threshold , Then the node is not regenerated to a child node . Leaf node  . It is generally not recommended to change the default value 1e-7.

Value :(0,1】

 

There is a sample put back (bootstrap:

seeing the name of a thing one thinks of its function , That is to say, whether there is a sampling of the land to be put back when building a decision tree for a random forest , The default is True, That is to say, the strategy of "put back sampling" is adopted

Value : Yes 、 nothing

 

Out of bag estimation (oob_score):,

bagging The random sampling method is adopted to establish the tree model , So those sample sets that have not been extracted , That is, the data set that is not involved in establishing the tree model is the data set outside the bag , This data set can be used to verify the effect of the model , Parameter training of multiple models , We know that cross validation can be used to , But it takes a lot of time , And there is no great need for random forest , So we use this data to verify the decision tree model , It's a simple cross validation . Low performance consumption , But the effect is good . The default value is False.

 

Value : Yes 、 nothing

 

原网站

版权声明
本文为[Halosec_ Wei]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202200535108638.html

随机推荐