当前位置：网站首页>Machine learning (Zhou Zhihua) Chapter 4 notes on learning experience of decision tree

Machine learning (Zhou Zhihua) Chapter 4 notes on learning experience of decision tree

2022-07-24 05:52:00 【Ml -- xiaoxiaobai】

The first 4 Chapter Decision tree The learning

The basic flow

It is mainly a recursive process

def TreeGenerate(D, A): # D For the sample set / list ,A For Attribute Collection 
     Generate the nodes node
    if D  The samples in belong to the same class C:
        node Marked as C Leaf like node 
        return
    if A  For an empty set （ That is, all attributes have been used up ） or D The sample is in A Same value as above :
         take node As leaf nodes , Marked as D The class with the largest number of samples 
        return
     from A Choose the best attribute to divide attributes a_best
    for a in a_best The value of :
         from node Generate a branch point , Make D_divided by D In the a_best The property value is a A sample set of 
        if D_divided is empty:
             This branch node becomes a leaf node , use D The class tag with the most samples in the class （ namely , When predicting a 
             Sample time , The value of this property is a when , another use D Estimated by the largest number of categories , Obedience in the book 
             A priori distribution of categories of parent nodes ）.
            return
        else:
             The branch node is TreeGenetate(D_divided, A)

Divide and choose

Information gain （information gain）

Information entropy （information entropy）

Measure the probability of random variables “ Degree of confusion ” Of , The entropy mentioned in high school chemistry is a little different , This is very intuitive to understand with physical entropy , The larger the entropy, the more “ confusion ” The more “ Random ” The more ” disorder ”, The smaller the entropy, the more “ Orderly “ The more ” neat “. Formula for ：
$\operatorname{Ent}(D)=-\sum_{k=1}^{|\mathcal{Y}|} p_{k} \log _{2} p_{k}$
commonly code Use directly e As the base number, don't 2.

Conditional entropy

Here, it is essentially understood as weighted information entropy , The formula is very intuitive ：
$\frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right)$
among $D^{v}$ From $D$ Subset divided in （ Take a known “ Conditions ”）.

So we can get the expression of information gain ：
$\operatorname{Gain}(D, a)=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right)$
That is, subtract the divided conditional entropy from the undivided information entropy / Weighted information entropy , It is also intuitive , If the gap is bigger , It shows that the classification after classification is more “ Orderly ” Of course, it's the result we want . This is also ID3 The idea of Algorithm （Iterative Dichotomiser, Iterative two classifier ）.

Gain rate （gain ratio）

ID3 One disadvantage of this information gain idea used by the algorithm is , When the attribute value increases , The resulting conditional entropy tends to become smaller （ The more classes , Naturally, each classification will be more tidy intuitively ）, In extreme cases, each attribute value in the book corresponds to a sample , The conditional entropy is 0 了 , The information gain is the current maximum , But this situation obviously cannot be generalized well . So in order to balance , Balance the entropy increase caused by the increase of classification number （ Multi class itself is also a kind of disorder ）, Therefore, we can use the entropy increase caused by the division itself to reduce the information gain , Formula for ：
$\operatorname{Gain} \operatorname{ratio}(D, a)=\frac{\operatorname{Gain}(D, a)}{\operatorname{IV}(a)}$
among ,
$\operatorname{IV}(a)=-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \log _{2} \frac{\left|D^{v}\right|}{|D|}$
be called a The intrinsic value of （intrinsic value）, In fact, according to A Entropy increase caused by set partition , Obviously, the more divisions, the more entropy increases , So as to balance the increase of information gain . Therefore, the gain rate has a certain preference for attributes with few values as partition attributes .

gini index （Gini index）

Define to take two samples randomly from the data set , The probability of not belonging to the same class is Gini value ：
$\begin{aligned} \operatorname{Gini}(D) &=\sum_{k=1}^{|\mathcal{Y}|} \sum_{k^{\prime} \neq k} p_{k} p_{k^{\prime}} \\ &=1-\sum_{k=1}^{|\mathcal{Y}|} p_{k}^{2} \end{aligned}$
Intuitively , If the purity of the data set is large , that Gini value The smaller it is .
Corresponding to information entropy and conditional entropy , Gini value is similar to Gini index “ weighting ” Relationship , gini index Defined as ：
$Gini_index ⁡ ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ Gini ⁡ ( D v ) \operatorname{ Gini\_index }(D, a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Gini}\left(D^{v}\right)$
So the smaller Gini index , It means that the division purity is higher , This is a CART（Classification and Regression Tree） The idea of the algorithm .

Pruning （pruning）

To solve over fitting , Reduce the generation of too many leaf nodes , Control tree “ depth ”.

pre-pruning （prepruning）

In the process of generating nodes , Use the verification set to measure the generalization ability before and after node partition / Accuracy rate , Thus, the accuracy after division / Nodes with reduced generalization ability are divided into leaf nodes .
Pre pruning also has some problems , It cannot guarantee whether the reduction of generalization ability after the partition of a node is temporary , such as , After its division, if you continue to divide, you may achieve a lower error rate , Therefore, pre pruning has the risk of under fitting .

After pruning （postpruning）

Mr. A becomes a complete decision tree , Then investigate the intermediate nodes with leaf nodes from bottom to top , Check to remove the leaf nodes , After itself as a leaf node , Whether the correctness of the verification set is improved , If yes, prune .
Generally, the risk of under fitting after pruning is small , But the time cost of training is greater than the decision tree generation itself and pre pruning .

Continuous and missing values

Processing of continuous values

In fact, it is in the generation process of regression tree , Selection of dividing points , The book says, for example n Attributes , Then there are n-1 Points of division , Not necessarily , It's quite possible to add one , That is, all of them are transformed into one class . In this way, information gain can be avoided in practice , And heuristic learning , For example, those greater than a certain partition are positive classes , Less than is negative , Or less than is positive , Greater than is negative , Then find the optimal partition point according to the minimum error rate and take the inequality sign of greater or less （ See the code you wrote before ）.

Processing of missing values

For the missing value, the book introduces C4.5 Processing method of algorithm , The basic idea can be summarized as , When selecting partition attributes , According to the weighted proportion of non missing samples in the total samples , weighting （ Calculated from a subset of samples without missing attribute values ） Information gain , So as to balance the difference of information gain of different attributes caused by different sample numbers ; After selecting the best attribute , When dividing sub nodes , For samples with missing attributes , Distribution of the number of samples divided according to no missing samples , The number of samples is weighted and divided into each node （ Equivalent to, for example, will 0.1 Samples are distributed to a child node ）, This quantity will be retained or re divided in the future “ weighting ”（ In addition, when the partition attribute is also missing a value ）.
In particular , Definition ：
$\rho=\frac{\sum_{\boldsymbol{x} \in \tilde{D}} w_{\boldsymbol{x}}}{\sum_{\boldsymbol{x} \in D} w_{\boldsymbol{x}}}$
$\tilde{p}_{k}=\frac{\sum_{\boldsymbol{x} \in \tilde{D}_{k}} w_{\boldsymbol{x}}}{\sum_{\boldsymbol{x} \in \tilde{D}} w_{\boldsymbol{x}}} \quad(1 \leqslant k \leqslant|\mathcal{Y}|)$
$\tilde{r}_{v}=\frac{\sum_{\boldsymbol{x} \in \tilde{D}^{v}} w_{\boldsymbol{x}}}{\sum_{\boldsymbol{x} \in \tilde{D}} w_{\boldsymbol{x}}} \quad(1 \leqslant v \leqslant V)$
Among them $w_{\boldsymbol{x}}$ Quantity weight , At first, it was 1, Represents the number of each sample itself , When there is a missing value and it is divided according to the missing attribute , The quantity weight will change . At this time, the information gain is ：
$\begin{aligned} \operatorname{Gain}(D, a) &=\rho \times \operatorname{Gain}(\tilde{D}, a) \\ &=\rho \times\left(\operatorname{Ent}(\tilde{D})-\sum_{v=1}^{V} \tilde{r}_{v} \operatorname{Ent}\left(\tilde{D}^{v}\right)\right) \end{aligned}$
among
$\operatorname{Ent}(\tilde{D})=-\sum_{k=1}^{|\mathcal{Y}|} \tilde{p}_{k} \log _{2} \tilde{p}_{k}$

Multivariate decision trees （multivariate decision tree）

Also known as oblique decision tree （oblique decision tree） In essence, it is to divide the original attributes （ The axes in the feature space are parallel axis-parallel Categorize ）, Become a linear partition in the feature space ( Linear classifier ), As shown in the figure ：
Please add a picture description