当前位置：网站首页>Mpai data science platform random forest classification \ explanation of regression parameter adjustment

Mpai data science platform random forest classification \ explanation of regression parameter adjustment

2022-06-25 12:05:00 【Halosec_ Wei】

Number of decision trees （n_estimators）：

This is the number of trees in the forest , That is, the number of base evaluators . The effect of this parameter on the accuracy of the stochastic forest model is monotonic , The larger the number of decision trees , Models tend to work better . But the corresponding , Any model has a decision boundary after the number of decision trees reaches a certain degree , The accuracy of random forests is often not rising or beginning to fluctuate , also , The larger the number of decision trees , The larger the amount of computation and memory required , The training time will be longer and longer . For this parameter , We are eager to strike a balance between the difficulty of training and the effect of the model , The number of decision trees is usually no more than 1000.

Value ：【1,+∞】

The principle of division （criterion）：

Return to ： Regression tree is an indicator of branch quality , Supported standards are 2 Kind of ：MAE,MSE（ The specific formula is self-contained ）;

classification ：CART The evaluation criteria of tree division on features , Supported standards are 2 Kind of ,： gini index （Gini）, Information gain （entropy）;

Maximum depth of decision tree （max_depth）：

The default value means that the decision tree will not limit the depth of the subtree when building the optimal model . If the sample size of the model is large , When there are many features , It is recommended to limit the maximum depth ; If the sample size is small or the characteristics are small , The maximum depth is not limited ,max depth Usually no more than 50.

Value ：【1,+∞】

Splitting an internal node requires a small number of samples （min_samples_split）：

Integer or floating point , The default is 2. It specifies to split an internal node ( Nonleaf node ) Minimum number of samples required . This value limits the conditions for the continued division of the subtree , If the number of samples of a node is less than min_samples_split, Then we will not continue to try to select the best feature for classification . The default is 2. If the sample size is small , You don't need to worry about this value . If the sample size is very large , It is recommended to increase this value

Value ：【2,+∞】

The minimum number of samples required for each leaf node （min_samples_leaf）:

This value limits the minimum number of samples for leaf nodes , If the number of leaf nodes is less than the number of samples , Will be pruned together with brother nodes . The default is 1, An integer that can enter the minimum number of samples , Or the minimum number of samples as a percentage of the total number of samples . If the sample size is small , You don't need to worry about this value . If the sample size is very large , It is recommended to increase this value .

Value ：【1,+∞】

The number of features to consider when searching for the optimal partition of nodes （max_features）:

When selecting the optimal attribute, the divided characteristics cannot exceed this value , When it is an integer , That is, the maximum characteristic number ; When decimal , Number of training set features * decimal ; auto when max_features=sqrt(n_features).

Value ：(0,1】

Maximum number of leaf nodes （max_leaf_nodes）:

By limiting the maximum number of leaf nodes , Can prevent over fitting , The default is "None”, That is, the maximum number of leaf nodes is not limited . If there are restrictions , The algorithm will establish the optimal decision tree within the maximum number of leaf nodes . If there are not many features , This value can be ignored , But if the features are divided into many parts , Can be limited , Specific values can be obtained through cross validation

Value ：(0,1】

Information entropy or Gini coefficient impurity threshold (min_impurity_split):

This value limits the growth of the decision tree , If the impurity of a node ( Based on Gini coefficient , Mean square error ) Less than this threshold , Then the node is not regenerated to a child node . Leaf node . It is generally not recommended to change the default value 1e-7.

Value ：(0,1】

There is a sample put back （bootstrap:）

seeing the name of a thing one thinks of its function , That is to say, whether there is a sampling of the land to be put back when building a decision tree for a random forest , The default is True, That is to say, the strategy of "put back sampling" is adopted

Value ： Yes 、 nothing

Out of bag estimation (oob_score):,

bagging The random sampling method is adopted to establish the tree model , So those sample sets that have not been extracted , That is, the data set that is not involved in establishing the tree model is the data set outside the bag , This data set can be used to verify the effect of the model , Parameter training of multiple models , We know that cross validation can be used to , But it takes a lot of time , And there is no great need for random forest , So we use this data to verify the decision tree model , It's a simple cross validation . Low performance consumption , But the effect is good . The default value is False.

Value ： Yes 、 nothing

原网站

版权声明
本文为[Halosec_ Wei]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202200535108638.html

当前位置：网站首页>Mpai data science platform random forest classification \ explanation of regression parameter adjustment

Mpai data science platform random forest classification \ explanation of regression parameter adjustment

边栏推荐

猜你喜欢

随机推荐