当前位置：网站首页>Mathematical derivation from perceptron to feedforward neural network

Mathematical derivation from perceptron to feedforward neural network

2022-06-27 19:44:00 【Barnepokhev】

perceptron

Perceptron is a classical linear model for binary classification . Its principle is simple and intuitive , Introduced the concept of learning , It is one of the basic components of neural network .

The definition of perceptron

Suppose the input space is $X\in \Bbb R^n$ , The output space is $Y=\{+1,-1\}$ . Input $x\in X$ Represents the eigenvector in the example , Corresponds to a point in the input space ; Output $y\in Y$ Represents the category of the instance . Then called formula (1) This function from input space to output space is the perceptron model .
$=sign(w\cdot x+b) \tag{1}$
among , $w$ and $b$ For the parameters of the perceptron model , $w\in\Bbb R^n$ It is called weight or weight vector , $b\in \Bbb R$ It's called bias , $w\cdot x$ Express $w$ and $x$ Inner product , $s i g n$ It's a symbolic function , namely ：
$sign(x)=\begin{cases} +1,&x\ge 0\\ -1,&x\lt 0 \end{cases} \tag{2}$

The geometric meaning of the perceptron

For a given sample set $T=\{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}$ , In which the opposite formula is (1) The linear equation appearing in $w\cdot x+b=0$ Corresponding to the feature space $\Bbb R^n$ A hyperplane of $S$ , among $w$ yes $S$ The normal vector of , $b$ yes $S$ The intercept of . If the characteristic space is linearly separable , Then the hyperplane can exactly divide the feature space into two parts （ Pictured 1 Shown ）. The eigenvectors in the two parts correspond to positive and negative samples respectively . here , The hyperplane 𝑆 It is also called separating hyperplane .
Please add a picture description

chart 1 Separation hyperplane in two-dimensional space

Loss function of perceptron

Sum up , The main goal of the perceptron is to determine a hyperplane that can completely separate the positive and negative samples of the data set , The hyperplane is mainly through the continuous adjustment of parameters during training $w$ and $b$ To determine . as for $w$ and $b$ Then you can define about $w$ and $b$ And minimize the loss function to get .

When the number of misclassification points is used as the loss function , The loss function is not about $w$ and $b$ Continuous differentiable function of , It is not easy to optimize in actual operation , So you can use all misclassification points to the hyperplane $S$ As a loss function . input space $\Bbb R^n$ Any point in $x_0$ To the hyperplane $S$ The distance to ：
$\frac{1}{||w||}|w\cdot x+b| \tag{3}$
When sample label $y_i$ by $+ 1$ when , $w\cdot x_i+b\gt 0$ The classification is correct ; When sample label $y_i$ by $- 1$ when , $w\cdot x_i+b\lt 0$ The classification is correct , Therefore, the two formulas are integrated as follows ：
$\begin{cases} y_i(w\cdot x_i+b)\gt0,&\text{ Correct classification }\\ y_i(w\cdot x_i+b)\lt0,&\text{ Classification error } \end{cases} \tag{4}$
Therefore, the misclassification point $x_i$ To the hyperplane $S$ The distance is ：
$-\frac{1}{||w||}y_i(w\cdot x_i+b) \tag{5}$
Let's set a hyperplane $S$ The set of misclassification points is $M$ , So all misclassification points to the hyperplane $S$ The total distance is ：
$-\frac{1}{||w||}\sum_{x_i\in M}y_i(w\cdot x_i+b)\tag{6}$
Stop thinking about $\frac{1}{||w||}$ , The loss function of perceptron learning and training is obtained $L (w, b)$ ：
$L(w,b)=-\sum_{x_i\in M}y_i(w\cdot x_i+b) \tag{7}$

Perceptron learning algorithm

The learning goal of the known perceptron is to get the parameter that minimizes the value of the loss function $w$ and $b$ , namely ：
$\min_{w,b}L(w,b)=-\sum_{x_i\in M}y_i(w\cdot x_i+b) \tag{8}$
because $L (w, b)$ Is an independent variable $w$ and $b$ Simple linear function of , And because only misclassification points in the loss function are involved in the calculation , So you don't have to take derivatives for all the points . Therefore, the descent method adopted by the perceptron is random gradient descent （Stochastic Gradient Descent,SGD）.

In the random gradient descent method , Only one sample is randomly selected for each iteration , If the sample is correctly classified under the current hyperplane , No operation ; If the classification is wrong , For this misclassified sample $x_i$ Find gradient ：
$\nabla w = \frac{\partial-y_i(w\cdot x_i+b)}{\partial w}=-y_ix_i \tag{9}$ $\nabla b=\frac{\partial-y_i(w\cdot x_i+b)}{\partial b}=-y_ix_i \tag{10}$

Using the gradient of the solution $w$ and $b$ updated , have to ：
$w\leftarrow w+\eta y_ix_i \tag{11}$ $\leftarrow w+\eta y_i \tag{12}$

among , $\eta$ For learning rate , Represents the speed of learning , Usually take $0 \sim 1$ Between . Again by Novikoff The theorem shows that , The loss function can be made as small as possible after finite iterations , Until zero , At this point, you can get $w$ and $b$ Value , Then we get the hyperplane $S$ .

Defect and improvement of perceptron model

The perceptron model is simple and intuitive in binary classification , But it has great limitations , That is, we can only deal with linear separable problems . Take the XOR problem in logic as an example , For input ${A,B\}$ , When $A = B $ When the output is $1 $ , When $A\ne B$ When the output is $0 $ . With ${0,0}$ 、 ${0,0}$ 、 ${1,0}$ , ${1,1}$ For example , The output results are shown in table 1 Shown ：

surface 1 The output of the XOR problem

A	B	A XOR B
0	0	1
0	1	0
1	0	0
1	1	1

The XOR problem seems very simple , But it is a nonlinear separable problem , So for the perceptron model , The problem is complicated . Will table 1 Display the results on the two-dimensional coordinates as shown in the figure 2 Shown .
Please add a picture description

chart 2 The distribution of two perceptrons in XOR problem

From the figure 2 Shown , Since the points with the same DE value occupy the diagonal respectively , So you can't find a straight line anyway （ hyperplane ） Separate the two classes . Of course , Using two perceptron models seems to solve the problem . Pictured 2 Shown , The segmentation hyperplane corresponding to the two perceptron models is the two solid lines in the graph . Set up perceptron 1 by $f_1 (x)=sign(w_1⋅x+b_1)$ ; perceptron 2 by $f_2 (x)=sign(w_2⋅x+b_2)$ . And make the perceptron 1 The lower right of the corresponding hyperplane and the perceptron 2 The upper left output of the corresponding hyperplane is $+ 1$ ; perceptron 1 The upper right of the corresponding hyperplane and the perceptron 2 The lower right of the corresponding hyperplane is output as $- 1$ .

At this point, continue to add the perceptron 3 To integrate the perceptron 1 And perceptron 2 Output result of , You might as well set up a perceptron 3 by $f_3 (x_3)=sign(w_3⋅x_3+b_3)$ . among , $w_3=[1,1]$ , $x_3=[f_1 (x),f_2 (x)]$ , $b_3=-1$ .

obviously , When the perceptron 3 Output is $+ 1$ when , Indicates that the sample belongs to a positive class ; Output is $- 1$ when , Indicates that the sample belongs to negative class . The operation architecture of the three perceptrons is shown in Figure 3 Shown ：
Please add a picture description

chart 3 Multiple perceptron implementation XOR problem architecture diagram

actually , The XOR problem can be solved by using the superposition value of two perceptrons , perceptron 3 Just added for uniform output . In terms of promotion , When the problem is more complicated , We can consider using more perceptron models to stack . Although a perceptron has limited capabilities , But once the sensors are stacked in multiple layers , Its ability will be greatly enhanced . Practical application , In order to better solve the linear inseparable problem , The perceptron model is on the way to multi-layer , This has become the cornerstone of various neural networks . The neurons in each layer are all interconnected with those in the next layer , The neural network structure without the same layer or cross layer connection is called multilayer perceptron or multilayer feedforward neural network .

原网站

版权声明
本文为[Barnepokhev]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206271713316083.html