当前位置：网站首页>Deeply understand the mathematics behind deep neural networks (mysteries of neural networks Part I)

Deeply understand the mathematics behind deep neural networks (mysteries of neural networks Part I)

2022-06-25 13:50:00 【Xiaoshuai acridine】

List of articles

Original title ：Deep Dive into Math Behind Deep Networks
Original author ：Piotr Skalski
Link to the original text ：https://medium.com/towards-data-science/https-medium-com-piotr-skalski92-deep-dive-into-deep-networks-math-17660bc376ba

At present, we have many high-level specialized libraries and frameworks, such as Keras, TensorFlow and PyTorch, We no longer need to worry about the scale of our weighting matrix or remember the formula derivation of a series of activation functions we decided to use . When we need to create a neural network , Even if it has a more complex structure , We can also build it with a few lines of code . This saves us a lot of searching bug And simplify our work . However, only by learning the knowledge behind neural networks can we help us to choose some network structures 、 Adjust the super parameter or optimizer to play an effective role .

Introduction

For example, we will solve the binary classification problem as shown in the figure above , All points belong to two types of circles . It is very inconvenient to use the traditional machine learning algorithm to solve this problem , But a small neural network can solve this problem well . In order to solve this problem, we need the neural network structure shown in the figure below —— Five fully connected networks contain different numbers of neurons . The hidden layer uses ReLU Activation function , Output layer use Sigmoid Activation function . This is a very simple structure , But it is useful enough for the above problem .
Insert picture description here

KERAS solution

Next, let's show you the popular machine learning library KERAS Build the above neural network .

from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(4, input_dim=2,activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50, verbose=0)

It can be seen as introduction With a few lines of code, it is possible to create a model to classify the data of the data set and achieve a close 100% The accuracy of . Our task boils down to providing super parameters for the selected structure （ Number of layers 、 Number of neurons per layer 、 Activation function 、 Number of rounds of training ） The following figure shows the effect behind the training process .
Insert picture description here

What are neural networks?

It is a biologically inspired way to learn and discover the relationship between independent data by building computer programs . As shown in the second figure above , A network is a series of layers connected by neurons , And these layers are connected in some way to allow communication .

Single neuron

Each neuron receives a series of x Value as input and calculate the predicted value y-hat. vector x It actually includes the training set m The eigenvalue of one of the examples . what's more , Each neuron has its own set of parameters , Often referred to as w( Weight column vector ) and b( deviation ), Changes will occur in the learning process . In each iteration , According to its current weight vector w Calculate the vector x The product of the values of , And add deviation . Last , The result of this calculation is passed to a nonlinear activation function g. Insert picture description here

Single layer

Now let's analyze the computation of a neural network layer . We will take advantage of our knowledge in the single neuron section , And vectorize the whole layer , Combine these calculations into a matrix equation . To unify symbols , Write the equation of the selected layer as [l]. By the way ,i Represents the index value of neurons in the neural network layer .
Insert picture description here
Another important note : When we write the formula for a single neuron , We use x-hat and y-hat, They are the column vectors of the features and the predicted values . When converting to the general representation of layers , We use vectors a —— Represents the output of the upper level activation function . therefore ,x The vector represents 0 layer —— The output of the activation function of the input layer . Each neuron in the layer performs a similar calculation according to the following formula :

Insert picture description here
For the sake of clarity , Let's start with 2 Layer as an example , Write down these equations :

As you can see , For each layer , We all have to do a lot of very similar things . Use for this purpose for-loop The efficiency of is not very high , So in order to speed up the calculation , We will vectorize the data . First , By setting the weight w The horizontal vector of the ( Transposition ) Put it all together , We will build the matrix w. Similarly , We superimpose the deviations of each neuron in the layer , Form a vertical vector b. Now nothing can stop us from building a matrix equation , It allows us to calculate all the neurons in this layer at once . Let's also write down the dimensions of matrices and vectors .
Insert picture description here

Vectorizing across multiple examples

The equation we have used so far has only one data . In the learning process of neural network , You usually have to deal with huge data sets , Millions of entries . therefore , The next step will be vectorization across multiple examples . Let's assume that our data set has m Entries , Each entry has nx Features . First , We will take the vertical vector of each layer x、a and z Put together , Create separate X、A and Z matrix . Then we rewrite the equation we laid out before , Consider the newly created matrix .（ This part should correspond to batch Calculation ）
Insert picture description here

What is activation function and why do we need it?

Activation function is a key element of neural network . Without activation function, our neural network will become a combination of many linear functions , Then it is a linear function . The expansibility of our model is limited , No more than logistic regression . Non linear elements allow for greater flexibility and the creation of complex functions in the learning process . Activation function also has a significant effect on learning speed , This is one of the main criteria for selecting them . The following figure shows some common activation functions . at present , The most popular hidden layer is probably ReLU. We still use sometimes sigmoid, Especially at the output layer , When we deal with a binary classification , We want the value returned from the model to be in 0 To 1 Within the scope of .
Insert picture description here

Loss function

The basic source of information about the progress of the learning process is the value of the loss function . Generally speaking , The loss function is designed to show that we are far from “ Ideal ” How far is the solution . In our case , We use the binary cross entropy loss function , But according to the problems we deal with , You can use different functions . The function we use is described by the following formula , The change of its value in the learning process is as follows . It shows how the value of the loss function decreases and the accuracy increases with each iteration .

Insert picture description here

How do neural networks learn?

The process of neural network learning is change W and b The process of minimizing the value of a loss function . In order to achieve this goal , We will turn to calculus , Using gradient descent method to find the minimum value of function . In each iteration , We will calculate the value of the partial derivative of the loss function with respect to each parameter of the neural network . For those less familiar with this calculation , I just want to say that the derivative has a good ability to describe the slope of a function . Because of that , We know how to manipulate variables to move down the graph . To develop intuition about how gradient descent works ( And prevent you from falling asleep again ), I prepared a small Visualization . You can see , With each successive epoch, How do we go to the minimum . In our neural network , It works the same way —— The gradient calculated for each iteration shows the direction we should move . The main difference is this , In our example neural network , We have more parameters to operate on . Yeah , How to calculate such a complex derivative ?
Insert picture description here

Backpropagation

Back propagation is an algorithm that allows us to compute a very
Complex gradients . The parameters of the neural network are adjusted as follows :

Insert picture description here
In the equation above ,α Representative learning rate - A super parameter , It allows you to control the value at which adjustments are performed . Choosing the learning rate is crucial —— We set it too low , our NN Will learn very slowly , We set it too high , We will not be able to reach the minimum .dW and db The chain rule is used in the calculation of , The loss function pair W and b Partial derivative of ,dW and db Of size Respectively with W and b Of size identical . The following figure shows the sequence of operations within the neural network . We clearly see how forward and backward propagation work together , To optimize the loss function .

Insert picture description here

Conclusion

In the use of NN when , At least understanding the basics of this process would be very helpful . I think what I mentioned is the most important , But they are just the tip of the iceberg . I strongly recommend that you try to write such a small neural network yourself , Instead of using advanced frameworks , Use only Numpy.

原网站

版权声明
本文为[Xiaoshuai acridine]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/176/202206251310052799.html