当前位置：网站首页>[fundamentals of machine learning 01] blending, bagging and AdaBoost

[fundamentals of machine learning 01] blending, bagging and AdaBoost

2022-06-22 06:54:00 【chad_ lee】

Last year's study notes , Take out your recent review and sort it out .

Aggregation Model

Suppose there is $T$ A model $g_{1}, \cdots, g_{T}$ , Common aggregation methods are ：

Select the best from a validation set ：

$G(\mathbf{x})=g_{t_{*}}(\mathbf{x}) \text { with } t_{*}=\operatorname{argmin}_{t \in\{1,2, \ldots, T\}} E_{\text {val }}\left(g_{t}^{-}\right)$

Average all models ：

$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} 1 \cdot g_{t}(\mathbf{x})\right)$

Weighted average all models ：

$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} \alpha_{t} \cdot g_{t}(\mathbf{x})\right) \text { with } \alpha_{t} \geq 0$

Determine according to the current status ：

$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} q_{t}(\mathbf{x}) \cdot g_{t}(\mathbf{x})\right) \text { with } q_{t}(\mathbf{x}) \geq 0$

These are all intuitive ideas , Now let's go further .

Blending

Mean fusion （Uniform Blending）

In the classification problem ：
$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} g_{t}(\mathbf{x})\right)$
The minority obeys the majority in multi classification tasks ：

In regression, the mean value of all model outputs is taken .

When $g_{t}$ Similar predicted values , Then the performance remains the same . When $g_{t}$ Diverse democracy , Some classification results $g_{t}(\mathbf{x})>f(\mathbf{x})$ , Other classification results $ g_{t}(\mathbf{x})<f(\mathbf{x})$, Then the best solution can be obtained by averaging the ideal state . Combine the above two requirements , Diverse hypotheses It is easier to make the fusion model perform better .

It can be deduced theoretically , Mean value fusion can reduce the error ：
$\operatorname{avg}\left(E_{\text {out }}\left(g_{t}\right)\right)=\operatorname{avg}\left(\mathcal{E}\left(g_{t}-G\right)^{2}\right)+E_{\text {out }}(G)$
From the above equation we can see that unless all $g_t$ equal , Otherwise, the fused model $G$ The error of is smaller than the average error of all sub models （ There is a variance difference ）.

Linear fusion （Linear Blending）

Classification problem ：
$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} \alpha_{t} \cdot g_{t}(\mathbf{x})\right) \text { with } \alpha_{t} \geq 0$
The return question ：
$\min _{\alpha_{t} \geq 0} \frac{1}{N} \sum_{n=1}^{N}\left(y_{n}-\sum_{t=1}^{T} \alpha_{t} g_{t}\left(\mathbf{x}_{n}\right)\right)^{2}$
Then the optimization problem can be written as ：
$\min _{\alpha t \geq 0} \frac{1}{N} \sum_{n=1}^{N} \operatorname{err}\left(y_{n}, \sum_{t=1}^{T} \alpha_{t} g_{t}\left(\mathbf{x}_{n}\right)\right)$
Use the training set to get $g_t$ , But it's best to use a validation set to get $\alpha_t$ .

Stacking

As shown in the figure above , The first part is to use $N$ Basic models such as xgb1、lgb1 Conduct $n$ Crossover verification , Then you get a $N * n$ Characteristic matrix of , This characteristic matrix is used as the input of the second layer model . The original data set shall also pass the test $N$ Base models are mapped to $N$ Dimensional space .

Bagging

Blending Is to learn something different $g_t$ And then put them together , Can we learn from it $g_t$ Edge aggregation .

Gain diversity $g_t$ The way to do this is ：

diversity by different models
diversity by different parameters: For example, optimization methods GD The step size of varies
diversity by algorithmic randomness： For example, the model algorithm has its own randomness
diversity by data randomness

Bootstrap Aggregation

Starting from data , If you can get a lot of diverse data sets $D_t$ To train a $g_t$ Just fine . Data resampling can be used to obtain simulated $D_t$ ：

The original size is $N$ Data set of $\mathcal{D}$ On , There is a sample put back $N^{\prime}$ Get the simulation data set for the second time $\mathcal{D}_{t} \rightarrow$ This step is Bootstrap operation
adopt $\mathcal{A}\left(\tilde{\mathcal{D}}_{t}\right)$ obtain $g_{t}$ , Then use the mean value fusion to obtain : $G=\operatorname{Uniform}\left(\left\{g_{t}\right\}\right)$ .

Boostrap The premise of reasonable method is ： Diversity of data sets and base algorithms $\mathcal A$ Sensitive to random data .

Adaptive Boosting （AdaBoost ） Actually from Bagging At the heart of bootstrap A fusion algorithm based on , Here is a detailed introduction .

AdaBoost

AdaBoost The core idea of is that the data of all current model misclassifications will be strengthened , Train the next weak classifier . Finally all weak classifiers vote . A classic diagram ：
Insert picture description here

Simple diagram of algorithm flow ：

Algorithm flow

To sum up the problem of binary classification ：

Enter as sample set $D=\left\{\left(x, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots\left(x_{n}, y_{n}\right)\right\}$ ,, Weak learner algorithm , Number of weak learner iterations $T$ .

The output is the final strong learner $G (x)$ .

The initialization sample set weight is

$\mathbf{u}^{(1)}=\left[\frac{1}{N}, \frac{1}{N}, \cdots, \frac{1}{N}\right]$

about $t = 1, 2, . . ., T$ ：
(a) Use with weights $\mathbf{u}^{(t)}$ To train the data , Get the weak learner $g_t(x)$
(b) Calculation $g_t(x)$ Classification error rate
$\epsilon_{t}=\frac{\sum_{n=1}^{N} u_{n}^{(t)} [ y_{n} \neq g_{t}\left(\mathbf{x}_{n}\right) ]}{\sum_{n=1}^{N} u_{n}^{(t)}}$
Update the weight distribution of the sample set ： Samples of classification pairs , The weight $u_{n}^{(t+1)} \leftarrow u_{n}^{(t)} \cdot \sqrt{\frac{1-\epsilon_{t}}{\epsilon_{t}}}$ ; Misclassified samples , The weight $u_{n}^{(t+1)} \leftarrow u_{n}^{(t)} / \sqrt{\frac{1-\epsilon_{t}}{\epsilon_{t}}}$
(d) Calculate the coefficients of weak classifiers ：
$\alpha_{t}=\ln \left( \sqrt{\frac{1-\epsilon_{t}}{\epsilon_{t}}} \right)$
The final classifier is ：

$G(\mathbf{x})=\operatorname{sign}\left(\sum_{t=1}^{T} \alpha_{t} g_{t}(\mathbf{x})\right)$

The weakest classifier here can be a decision stump（ Single level decision tree ）, There are only three parameters ： Select a feature $i$ , Set a threshold $\theta$ , Determine what kind of threshold is $s$ .

原网站

版权声明
本文为[chad_ lee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202220543470796.html