当前位置：网站首页>Mathematical derivation in [pumpkin Book ml] (task4) neural network

Mathematical derivation in [pumpkin Book ml] (task4) neural network

2022-07-24 21:55:00 【Evening scenery at the top of the mountain】

Learning summary

List of articles

Learning summary
One 、 Neuron model and MLP Basics
Two 、 Error back propagation BP Algorithm
3、 ... and 、 Global minimum And Local minimum
Reference

One 、 Neuron model and MLP Basics

Such as CTR Tasks , Judge whether the user has clicked this item ” wait . When the cell body receives these signals , Will make a simple judgment , Then output a signal through the axon , The size of the output signal represents the user's interest in the item .

Insert picture description here

be based on Sigmoid Neurons that activate the function

The activation function in the above figure is sigmoid Activation function , Its mathematical definition is ： $f(z)=\frac{1}{1+\mathrm{e}^{-z}}$ Its function image is the one in the above figure S Type curve , Its function ：

Turn the input signal from （-∞,+∞） The definition field of is mapped to （0,1） Range of values （ Because in Click through rate forecast , Recommended questions , It is often necessary to predict one from 0 To 1 Probability ）.
sigmoid Functions are derivable everywhere , Facilitate the subsequent gradient descent learning process , So it has become a frequently used activation function . Other popular ones are tanh、ReLU etc. .

1.1 BP How neural networks learn

The network composed of multiple neurons has stronger data fitting ability , As shown in the figure below, an input layer 、 A simple neural network consisting of two neuron hidden layers and a single neuron output layer ：
Insert picture description here
The structure of each blue neuron is the same as that of the single neuron ,h1 and h2 The input of neurons is made by x1 and x2 The eigenvectors of the components , And neurons o1 The input of is made by h1 and h2 The input vector composed of outputs .

Important training methods of Neural Networks , Forward propagation （Forward Propagation） And back propagation （Back Propagation）.

（1） Forward propagation

The purpose of forward propagation is to get the estimated value of the model on the input based on the current network parameters , That is what we often call the model inference process . for instance , We need to pass the weight of a classmate 、 Height prediction TA Gender of , The process of forward propagation is a given weight value 71, Height value 178, Through neurons h1、h2 and o1 The calculation of , Get a sex probability value , for instance 0.87, This is it. TA Probability of being male .

（2） Loss function

If the real gender of this classmate is male , The real probability value is 1, According to the formula 2 Definition of absolute value error of , The loss of this prediction is |1-0.87| = 0.13.
$l_{1}\left(y_{i}, \hat{y}_{i}\right)=\left|y_{i}-\hat{y}_{i}\right|$

（3） gradient descent

The error between the predicted value and the real value is found （Loss）, We need to use this error to guide the weight update , Make the whole neural network more accurate in the next prediction . The most common way to update weight is gradient descent , It updates the weight by calculating the partial derivative . such as , We need to update the weight w5, We must first find the loss function to w5 Partial derivative of $\frac{\partial L_{o 1}}{\partial w_{5}}$ From a mathematical point of view , The direction of the gradient is the direction where the function grows fastest , that The opposite direction of the gradient is the direction in which the function drops fastest , So let the loss function decrease in the fastest direction Is that we want the gradient w5 Update direction . Here we introduce another super parameter α, It represents the strength of gradient updating , Also known as Learning rate . Now you can write the formula of gradient update ： $w_{5}^{t+1}=w_{5}^{t}-\alpha * \frac{\partial L_{o 1}}{\partial w_{5}}$ Formula w5 Of course, it can be replaced by other parameters to be updated , Formula t Represents the number of updates .

For output layer neurons （ In the picture o1）, We can directly use the gradient descent method to calculate the neuron correlation weight （ This is the picture 5 Weight in w5 and w6） Gradient of , So as to update the weight , But for the relevant parameters of hidden layer neurons （ such as w1）, How can we use the loss of the output layer for gradient descent ？

——“ Use the chain rule in the derivation process （Chain Rule）”. Through the chain rule, we can solve the problem of gradient layer by layer back propagation . The final loss function to weight w1 The gradient of is from loss function to neuron h1 Partial derivative of output , And neurons h1 Output to weight w1 Multiplied by the partial derivatives of . in other words , The final gradient is conducted back layer by layer ,“ To guide the ” The weight w1 Update .
$\frac{\partial L_{o 1}}{\partial w_{1}}=\frac{\partial L_{o 1}}{\partial h_{1}} \cdot \frac{\partial h_{1}}{\partial w_{1}}$

1.2 Linear separable and nonlinear separable

To solve the nonlinear separable problem , We need to consider using multilayer functional neurons . The following is a two-layer perceptron that can solve the XOR problem ：
Insert picture description here

Two 、 Error back propagation BP Algorithm

BP The idea of algorithm ： First, the error is propagated back to the hidden layer neurons , Adjust the connection weight from the hidden layer to the output layer and the threshold of the neurons in the output layer ; Then, according to the mean square error of hidden layer neurons , To adjust the connection weight from the input layer to the hidden layer and the threshold of the hidden layer neuron .
BP The goal of the algorithm ： Minimize the training set D Cumulative error on ： $E=\frac{1}{m} \sum_{k=1}^{m} E_{k}$

BP The basic flow of the algorithm :
Input :
Training set $D=\left\{\left(x_{k}, y_{k}\right)\right\}_{k=1}^{m}$ ; Learning rate $\eta$

The process :
(1) stay $(0, 1)$ All connection weights and thresholds in the network are initialized randomly in the range ;
(2) repeat
(3) ——for all $\left(x_{k}, y_{k}\right) \in D$ do
(4) ———— According to the current parameters and $\hat{y}_{j}^{k}=f\left(\beta_{j}-\theta_{j}\right)$ Calculate the output of the current sample $\hat{y}_{k}$ ;
(5) ———— according to $g_{j}=\hat{y}_{j}^{k}\left(1-\hat{y}_{j}^{k}\right)\left(y_{j}^{k}-\hat{y}_{j}^{k}\right)$ Calculate the gradient term of neurons in the output layer $g_{j}$ ;
(6) ———— according to $e_{h}=b_{h}\left(1-b_{h}\right) \sum_{j=1}^{l} w_{h j} g_{j}$ Gradient term of hidden layer neurons $e_{h}$ ;
(7) ———— Update connection rights $w_{h j}, v_{i h}$ And threshold $\theta_{j}, \gamma_{h}$ ;
(8) ——end for
(9) until Stop condition reached
Output ： Multilayer feedforward neural network with connection weight and threshold determination

Prevent over fitting ：

Stop early （early stopping）： Divide the data into training set and verification set , The training set is used to calculate the gradient 、 Update weights and thresholds , The verification set is used to estimate the error , If the training set error decreases but the verification set error increases , Then stop training , Colleagues return the connection weight and threshold with the minimum verification set error .
Regularization （regularization）： stay loss Add an expression to the function to describe the network complexity , Such as the sum of squares of connection weight and threshold ： $E=\lambda \frac{1}{m} \sum_{k=1}^{m} E_{k}+(1-\lambda) \sum_{i} w_{i}^{2},$
- $E_{k}$ It means the first one $k$ Errors on training samples ,
- $w_{i}$ Indicates connection rights and thresholds
- $\lambda \in(0,1)$ It is used to make a compromise between empirical error and network complexity , It is often estimated by cross validation .

3、 ... and 、 Global minimum And Local minimum

Gradient based search （ Gradient descent method ） It is the most widely used parameter optimization method . But if the error function has multiple local minima , If the gradient of the error function at the current point is 0, We can't guarantee that the solution we find is global minimum .
Try to jump out of the local minimum ：
- Simulated annealing ： At each step, accept a worse result than the current solution with a certain probability .
- Stochastic gradient descent ： When calculating the gradient, add random factors , So even if it falls into a local minimum , The calculated gradient may still not be 0, There is a chance to jump out of the local minimum and continue to search for the optimal parameters .

Reference

[1] Edited by Chen Xiru . Probability theory and mathematical statistics [M]. China University of science and Technology Press ,2009
[2] B Stop video tutorial ：https://www.bilibili.com/video/BV1Mh411e7VU
[3] Online pumpkin book ：https://datawhalechina.github.io/pumpkin-book/#/chapter1/chapter1
[4] Open source address ：https://github.com/datawhalechina/pumpkin-book

原网站

版权声明
本文为[Evening scenery at the top of the mountain]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/205/202207242039308486.html