当前位置:网站首页>Deeply understand the mathematics behind deep neural networks (mysteries of neural networks Part I)
Deeply understand the mathematics behind deep neural networks (mysteries of neural networks Part I)
2022-06-25 13:50:00 【Xiaoshuai acridine】
List of articles
Original title :Deep Dive into Math Behind Deep Networks
Original author :Piotr Skalski
Link to the original text :https://medium.com/towards-data-science/https-medium-com-piotr-skalski92-deep-dive-into-deep-networks-math-17660bc376ba
At present, we have many high-level specialized libraries and frameworks, such as Keras, TensorFlow and PyTorch, We no longer need to worry about the scale of our weighting matrix or remember the formula derivation of a series of activation functions we decided to use . When we need to create a neural network , Even if it has a more complex structure , We can also build it with a few lines of code . This saves us a lot of searching bug And simplify our work . However, only by learning the knowledge behind neural networks can we help us to choose some network structures 、 Adjust the super parameter or optimizer to play an effective role .
Introduction

For example, we will solve the binary classification problem as shown in the figure above , All points belong to two types of circles . It is very inconvenient to use the traditional machine learning algorithm to solve this problem , But a small neural network can solve this problem well . In order to solve this problem, we need the neural network structure shown in the figure below —— Five fully connected networks contain different numbers of neurons . The hidden layer uses ReLU Activation function , Output layer use Sigmoid Activation function . This is a very simple structure , But it is useful enough for the above problem .
KERAS solution
Next, let's show you the popular machine learning library KERAS Build the above neural network .
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(4, input_dim=2,activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50, verbose=0)
It can be seen as introduction With a few lines of code, it is possible to create a model to classify the data of the data set and achieve a close 100% The accuracy of . Our task boils down to providing super parameters for the selected structure ( Number of layers 、 Number of neurons per layer 、 Activation function 、 Number of rounds of training ) The following figure shows the effect behind the training process .
What are neural networks?
It is a biologically inspired way to learn and discover the relationship between independent data by building computer programs . As shown in the second figure above , A network is a series of layers connected by neurons , And these layers are connected in some way to allow communication .
Single neuron
Each neuron receives a series of x Value as input and calculate the predicted value y-hat. vector x It actually includes the training set m The eigenvalue of one of the examples . what's more , Each neuron has its own set of parameters , Often referred to as w( Weight column vector ) and b( deviation ), Changes will occur in the learning process . In each iteration , According to its current weight vector w Calculate the vector x The product of the values of , And add deviation . Last , The result of this calculation is passed to a nonlinear activation function g.
Single layer
Now let's analyze the computation of a neural network layer . We will take advantage of our knowledge in the single neuron section , And vectorize the whole layer , Combine these calculations into a matrix equation . To unify symbols , Write the equation of the selected layer as [l]. By the way ,i Represents the index value of neurons in the neural network layer .
Another important note : When we write the formula for a single neuron , We use x-hat and y-hat, They are the column vectors of the features and the predicted values . When converting to the general representation of layers , We use vectors a —— Represents the output of the upper level activation function . therefore ,x The vector represents 0 layer —— The output of the activation function of the input layer . Each neuron in the layer performs a similar calculation according to the following formula :

For the sake of clarity , Let's start with 2 Layer as an example , Write down these equations :
As you can see , For each layer , We all have to do a lot of very similar things . Use for this purpose for-loop The efficiency of is not very high , So in order to speed up the calculation , We will vectorize the data . First , By setting the weight w The horizontal vector of the ( Transposition ) Put it all together , We will build the matrix w. Similarly , We superimpose the deviations of each neuron in the layer , Form a vertical vector b. Now nothing can stop us from building a matrix equation , It allows us to calculate all the neurons in this layer at once . Let's also write down the dimensions of matrices and vectors .

Vectorizing across multiple examples
The equation we have used so far has only one data . In the learning process of neural network , You usually have to deal with huge data sets , Millions of entries . therefore , The next step will be vectorization across multiple examples . Let's assume that our data set has m Entries , Each entry has nx Features . First , We will take the vertical vector of each layer x、a and z Put together , Create separate X、A and Z matrix . Then we rewrite the equation we laid out before , Consider the newly created matrix .( This part should correspond to batch Calculation )

What is activation function and why do we need it?
Activation function is a key element of neural network . Without activation function, our neural network will become a combination of many linear functions , Then it is a linear function . The expansibility of our model is limited , No more than logistic regression . Non linear elements allow for greater flexibility and the creation of complex functions in the learning process . Activation function also has a significant effect on learning speed , This is one of the main criteria for selecting them . The following figure shows some common activation functions . at present , The most popular hidden layer is probably ReLU. We still use sometimes sigmoid, Especially at the output layer , When we deal with a binary classification , We want the value returned from the model to be in 0 To 1 Within the scope of .
Loss function
The basic source of information about the progress of the learning process is the value of the loss function . Generally speaking , The loss function is designed to show that we are far from “ Ideal ” How far is the solution . In our case , We use the binary cross entropy loss function , But according to the problems we deal with , You can use different functions . The function we use is described by the following formula , The change of its value in the learning process is as follows . It shows how the value of the loss function decreases and the accuracy increases with each iteration .


How do neural networks learn?
The process of neural network learning is change W and b The process of minimizing the value of a loss function . In order to achieve this goal , We will turn to calculus , Using gradient descent method to find the minimum value of function . In each iteration , We will calculate the value of the partial derivative of the loss function with respect to each parameter of the neural network . For those less familiar with this calculation , I just want to say that the derivative has a good ability to describe the slope of a function . Because of that , We know how to manipulate variables to move down the graph . To develop intuition about how gradient descent works ( And prevent you from falling asleep again ), I prepared a small Visualization . You can see , With each successive epoch, How do we go to the minimum . In our neural network , It works the same way —— The gradient calculated for each iteration shows the direction we should move . The main difference is this , In our example neural network , We have more parameters to operate on . Yeah , How to calculate such a complex derivative ?
Backpropagation
Back propagation is an algorithm that allows us to compute a very
Complex gradients . The parameters of the neural network are adjusted as follows :

In the equation above ,α Representative learning rate - A super parameter , It allows you to control the value at which adjustments are performed . Choosing the learning rate is crucial —— We set it too low , our NN Will learn very slowly , We set it too high , We will not be able to reach the minimum .dW and db The chain rule is used in the calculation of , The loss function pair W and b Partial derivative of ,dW and db Of size Respectively with W and b Of size identical . The following figure shows the sequence of operations within the neural network . We clearly see how forward and backward propagation work together , To optimize the loss function .


Conclusion
In the use of NN when , At least understanding the basics of this process would be very helpful . I think what I mentioned is the most important , But they are just the tip of the iceberg . I strongly recommend that you try to write such a small neural network yourself , Instead of using advanced frameworks , Use only Numpy.
边栏推荐
- 楼宇自动化专用BACnet网关BL103
- Judge whether it is a mobile terminal
- [pit avoidance means "difficult"] antd select fuzzy search
- Insight into heap and stack stored in new string() /string() in JS
- 删库跑路、“投毒”、改协议,开源有哪几大红线千万不能踩?
- 初始c语言时的一些知识
- 国信证券股票账户开户安全吗?请问。
- How to solve SQL import
- Download File blob transcoding
- k线图24种经典图解(影线篇)
猜你喜欢

How to determine if a web worker has been created or closed

多台云服务器的 Kubernetes 集群搭建

Where can the brightness of win7 display screen be adjusted
Deep parsing and implementation of redis stream advanced message queue [10000 words]

數據在內存中的存儲相關內容

网络远程访问的方式使用树莓派

Test your earning power? What will you do in the future?

[pit avoidance means "difficult"] the antd form dynamic form is deleted, and the first line is displayed by default

[pit avoidance means "difficult"] to realize editable drag and drop sorting of protable

Drago Education - typescript learning
随机推荐
Table de hachage, conflit de hachage
##脚本编写ssh免密功能
Discuz copy today's headlines template /discuz news and information business GBK template
Shenzhen mintai'an intelligent second side_ The first offer of autumn recruitment
Turtlebot+lms111+gmapping practice
Openstack learning notes -nova component insight
OpenStack学习笔记(二)
sql导入这样怎么解决
Hash table, hash conflict
Application of tactile intelligent sharing-rk3568 in financial self-service terminal
哈希表、哈希冲突
leetcode:剑指 Offer II 091. 粉刷房子【二维dp】
Judge whether it is a mobile terminal
[proteus simulation] 51 MCU +ds1302+lcd1602 display
Problems encountered in debugging botu TCP connection with debugging tool
Prototype relationship between constructor and instance (1)
哈希錶、哈希沖突
Qt内存映射
Openstack learning notes (II)
Asp. Net webform exporting excel using npoi