当前位置：网站首页>[deep learning] teach you to write "handwritten digit recognition neural network" hand in hand, without using any framework, pure numpy

[deep learning] teach you to write "handwritten digit recognition neural network" hand in hand, without using any framework, pure numpy

2022-07-24 05:56:00 【Meow, meow, hammer, you cute】

List of articles

Preface
Handwritten numeral recognition neural network design
About weight initialization
- Analysis of convergence rate of different initialization methods ：
- Why do different initialization converge at different speeds ？
On the loss function
About my feelings
appendix
- Article resources
- Reference resources

Preface

I have always been interested in machine learning , But because the major of university has nothing to do with machine learning , So there has been no time ( ~~Procrastination~~ ) Write a neural network completely without a framework . Fortunately, there is a major elective course this semester , Take this opportunity to publish your learning achievements .

The premise of learning this article is that you have a certain Python、 linear algebra 、 Advanced mathematics knowledge , And have a certain understanding of fully connected neural networks , Know that the fully connected nerve is actually a The nonlinear function of any function can be approximated by adjusting parameters ( That's how I understand it , Of course, I also believe that there are many functions that it cannot accurately describe ), Know the process of forward and backward propagation .

If you know something about neural networks , But I didn't think about it carefully I suggest you take a look first B standing 3Blue1Brown About Deep learning Deep Learning In the video , After learning this video, you can be right All connected neural networks Have a deeper understanding . Of course, you may still not be able to write a complete neural network , Because many people may not know much about the derivation of matrix in the process of back propagation , As a result, the calculation formula of the gradient of the back propagation process cannot be derived .

Handwritten numeral recognition neural network design

Suppose you have seen the video , Or have a better understanding of this , Then we can start designing .

Here for the convenience of explanation and making the first The procedure is clear ！！！, Don't use MINIST The data set . Here is a data set provided by the tutorial I studied , This data set looks like this , The content of the file is the picture data of handwritten digits , Equivalent to binary image . Naming rules ： The numbers in the picture _ Serial number --> label_seq.txt. Next, use the tag to call the value of the number written in the picture . The focus now is not on data sets , So here is the function of loading data and . Click here to download the dataset .
Insert picture description here

#  function img2vector Convert the image to a vector 
from numpy import *
from os import listdir

def img2vector(filename):
    returnVect = np.zeros((1, 1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32 * i + j] = int(lineStr[j])
    return returnVect
    
#  Read handwritten fonts txt data 
def handwritingData(dataPath):
    hwLabels = []
    FileList = listdir(dataPath)  # 1  Get the contents of the table of contents 
    m = len(FileList)
    np.Mat = np.zeros((m, 1024))
    for i in range(m):
        # 2  Parse the classification number from the file name 
        fileNameStr = FileList[i]
        fileStr = fileNameStr.split('.')[0]  # take off .txt
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        np.Mat[i, :] = img2vector(dataPath + '/%s' % fileNameStr)
    return np.Mat, hwLabels

Dataset use

Extract the dataset to the specified directory ,like this:

..
└ Work Folder
	└ Net.py
	  trainingDigits
      testDigits

#  Read training data 
trainDataPath = "./trainingDigits"
trainMat, trainLabels = handwritingData(trainDataPath)

trainMat.shape -> (1934, 1024)
trainMat[i] The corresponding label is trainLabels[i]

Neural network design

Insert picture description here
Here is the model, which is like this , Pictured above , An input layer , Two hidden layers , Activate function selection Sigmod, The loss function selects the sum of squared errors function （ I call it that , I don't know what it's called ）, This is shown below . It is suggested that the loss function should be the sum of squares of errors for the first time , Think this loss function is simple , It is convenient to seek guidance , Easy to learn , Later, we can make improvements on this basis .

Cost ： cost , Here is what needs to be calculated Loss value .
$\frac{1}{1+e^{-x}} \\ Sigmod'(x) = Sigmod(x)(1 - Sigmod(x)) \\$
$\hat{y} , y) = \frac{1}{2} ( \hat{y} - y)^2$

If the mathematical ability is not strong enough, it is recommended not to use cross entropy loss function , Don't use Softmax Activation function these functions with troublesome derivation , It's not good for learning .

mathematical model

$：A^{( layer )}$
$S = S i g m o d$

$\boldsymbol{x}_{1024x1} | \boldsymbol{W}_{16x1024}^{(1)} | \boldsymbol{b}_{16x1}^{(1)} | \boldsymbol{W}_{16x16}^{(2)} | \boldsymbol{b}_{16x1}^{(2)} | \boldsymbol{W}_{10x16}^{(3)} | \boldsymbol{b}_{10x1}^{(3)} | \boldsymbol{\hat{y}_{10x1}} |$

$front towards Pass on seeding ：$
$\mathbf{z}^{(l)}=\mathbf{W}^{(l)} \mathbf{a}^{(l-1)}+\mathbf{b}^{(l)}$
$\mathbf{a}^{(l)}=f\left(\mathbf{z}^{(l)}\right)$
$\hat{\mathbf{y}}=\mathbf{a}^{(L)}=f\left(\mathbf{z}^{(L)}\right)$
$\mathcal{L}=\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})$

$\boldsymbol{x} = a^{(0)} That is, the input image data$
$\boldsymbol{z}^{(1)} = \boldsymbol{W}^{(1)}\boldsymbol{a}^{(0)}+\boldsymbol{b}^{(1)}$
$\boldsymbol{a}^{(1)} = S(\boldsymbol{z}^{(1)})$

$\boldsymbol{z}^{(2)} = \boldsymbol{W}^{(2)}\boldsymbol{a}^{(1)}+\boldsymbol{b}^{(2)}$
$\boldsymbol{a}^{(2)} = S(\boldsymbol{z}^{(2)})$

$\boldsymbol{z}^{(3)} = \boldsymbol{W}^{(2)}\boldsymbol{a}^{(2)}+\boldsymbol{b}^{(3)}$
$\boldsymbol{\hat{y}} = S(\boldsymbol{z}^{(3)}) Predicted results , The maximum subscript is the predicted numerical result$

$\boldsymbol{\hat{y}} , \boldsymbol{y})$

$:\odot It's the multiplication of the corresponding elements$

$\frac{\partial Loss}{\partial \boldsymbol{\hat{y}}}$ : Here is the derivation of the corresponding element

Output layer ：
$\boldsymbol{\delta}^{(L)}=\nabla_{\mathbf{a}^{(L)}} \mathcal{L} \odot S^{\prime}\left(\mathbf{z}^{(L)}\right) = \frac{\partial Loss}{\partial \boldsymbol{\hat{y}}} \odot S^{\prime}\left(\mathbf{z}^{(L)}\right) = \frac{\partial \mathcal{L} }{\partial \mathbf{z}^{(L)}}\\ Here we find ： \frac{\partial Loss}{\partial \boldsymbol{\hat{y}}} \odot S^{\prime}\left(\mathbf{z}^{(3)}\right)$
Other layers apply formulas ：
$\boldsymbol{\delta}^{(l)}=\left(\left(\mathbf{W}^{(l+1)}\right)^{\mathrm{T}} \boldsymbol{\delta}^{(l+1)}\right) \odot S^{\prime}\left(\mathbf{z}^{(l)}\right) \\ \frac{\partial \mathcal{L}} {\partial\mathbf{W}^{(l)}}=\boldsymbol{\delta}^{(l)}\left(\mathbf{a}^{(l-1)}\right)^{\mathrm{T}} \\ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}}=\boldsymbol{\delta}^{(l)}$

$ginseng Count more new :$
$\begin{aligned} \mathbf{W}^{(l)} &=\mathbf{W}^{(l)}-\alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} \\ \mathbf{b}^{(l)} &=\mathbf{b}^{(l)}-\alpha \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} \end{aligned}$

About matrix derivation

If you just want to be able to find out the gradient in the back propagation process by yourself , Then remember the above formula is enough , If you want to know more about the formula, you can refer to this article ： Neural network derivation .

Scalar $f$ The matrix $\boldsymbol{X}$ The derivative of , Defined as $\frac{\partial f}{\partial X}=\left[\frac{\partial f}{\partial X_{i j}}\right]$ .

Of course , You may not want to stop here , Maybe some students didn't contact matrix derivation and got stuck in the derivation of the back-propagation formula . If you want to know how to perform matrix derivation, you can refer to ： Matrix derivation . Of course, there is a simple way , Is to write out the equation solved by each element of the matrix together with its subscript , like this ： $y_{i \times1} = \sum_{j} W_{i \times j} x_{j \times 1}$ . And then express the partial derivative , like this ： $\frac{\partial y_{i \times1}}{\partial W_{i \times j} } = x_{j \times 1}$ . It's equivalent to just $\mathbf{W}$ Of the $i$ That's ok $\boldsymbol{W}_{i \times \boldsymbol{j}}$ Made a contribution , and $\mathbf{W}$ Of the $i$ That's ok $\boldsymbol{W}_{i \times \boldsymbol{j}}$ (j It's equivalent to having a vector ) Only right $y_{i \times1}$ Contribute .

Make $y_{i \times1} = \boldsymbol{W}_{i \times \boldsymbol{j}}\boldsymbol{x}_{\boldsymbol{j} \times 1} + \boldsymbol{b}_{i \times 1},$ hold $\boldsymbol{W}_{i \times \boldsymbol{j}} Each element of W_{i \times j}$ As a variable , $y_{i \times1}$ Equivalent to a multivariate function , Elements $W_{i \times j}$ increase 1 Yes $y_{i \times1}$ The impact of this is $\frac{\Delta y_{i \times1}}{\Delta W_{i \times j} }$ , Take the limit as $\frac{\partial y_{i \times1}}{\partial W_{i \times j} } = x_{j \times 1}$ , Because there is only number $\boldsymbol{W}$ Of the $i$ Column , Have an impact on , So the final gradient should be one line , The value of this line Transposition Namely $\boldsymbol{x}$ . $\boldsymbol{x}$ Take the derivative as a constant .

Say something reasonable. , Here's the picture ： hypothesis $i = 1$ .
$\frac{\partial {y}_{i \times1}}{\partial \boldsymbol{W}_{i \times \boldsymbol{j}} } = \left[ \begin{matrix} \frac{\partial y_{1 \times1}}{\partial W_{1 \times 1} } =x_{1 \times 1} & \frac{\partial y_{1 \times1}}{\partial W_{1 \times 2} } =x_{2 \times 1} & \cdots & \frac{\partial y_{i \times1}}{\partial W_{i \times j} } = x_{j \times 1} \\ \end{matrix} \right] \\ =\left[ \begin{matrix} x_{1 \times 1} \\ x_{2 \times 1} \\ \vdots \\ x_{j \times 1} \\ \end{matrix} \right]^T = \boldsymbol{x}^T$

Of course , This proof is incomplete , I don't think the rest is necessary ( ~~Don't bother~~ ) Yes . Notice that this is the derivative of the variable to the matrix , although $\boldsymbol{y}$ There are many , But they came separately . Again because $L o s s$ Yes, yes $\boldsymbol{y}$ The sum function , So according to the chain derivation rule ：
$\boldsymbol{y} = \boldsymbol{W} \boldsymbol{x} \\ Loss = C(\boldsymbol{y}) \\ \frac{\partial Loss }{\partial \boldsymbol{W} } = \frac{\partial Loss }{\partial \boldsymbol{y} } \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{W} }$

$\frac{\partial Loss }{\partial \boldsymbol{y} }$ It's deriving item by item , Get vector $\boldsymbol{\delta}$ Just put each $\frac{\partial y_{i \times1}}{\partial W_{i \times j} }$ " Moved to " Corresponding rows .
$\frac{\partial Loss }{\partial \boldsymbol{W} } = \frac{\partial Loss }{\partial \boldsymbol{y} } \boldsymbol{x}^T$

Complete parsing ：
$\frac{\partial Loss }{\partial \boldsymbol{W} } = \frac{\partial Loss }{\partial \boldsymbol{y} } \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{W} } = \\ \left[ \begin{matrix} \frac{\partial Loss }{\partial {y_{1 \times 1}} } \frac{\partial {y}_{1 \times1}}{\partial \boldsymbol{W}_{1 \times \boldsymbol{j}} }\\ \frac{\partial Loss }{\partial {y_{2 \times 1}} } \frac{\partial {y}_{2 \times1}}{\partial \boldsymbol{W}_{2 \times \boldsymbol{j}} } \\ \vdots \\ \frac{\partial Loss }{\partial {y_{i \times 1}} } \frac{\partial {y}_{i \times1}}{\partial \boldsymbol{W}_{i \times \boldsymbol{j}} } \\ \end{matrix} \right] = \left[ \begin{matrix} \frac{\partial Loss }{\partial {y_{1 \times 1}} } \frac{\partial {y}_{1 \times1}}{\partial {W}_{1 \times {1}} } = x_{1 \times 1} & \frac{\partial Loss }{\partial {y_{1 \times 1}} } \frac{\partial {y}_{1 \times1}}{\partial {W}_{1 \times {2}} } = x_{2 \times 1} & \cdots & \frac{\partial Loss }{\partial {y_{1 \times 1}} } \frac{\partial {y}_{1 \times1}}{\partial {W}_{1 \times {j}} } = x_{j \times 1} \\ \frac{\partial Loss }{\partial {y_{2 \times 1}} } \frac{\partial {y}_{2 \times1}}{\partial {W}_{2 \times {1}} } = x_{1 \times 1} & \frac{\partial Loss }{\partial {y_{2 \times 1}} } \frac{\partial {y}_{2 \times1}}{\partial {W}_{2 \times {2}} } = x_{2 \times 1} & \cdots & \frac{\partial Loss }{\partial {y_{2 \times 1}} } \frac{\partial {y}_{2 \times1}}{\partial {W}_{2 \times {j}} } = x_{j \times 1} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial Loss }{\partial {y_{i \times 1}} } \frac{\partial {y}_{i \times1}}{\partial {W}_{i \times {1}} } = x_{1 \times 1} & \frac{\partial Loss }{\partial {y_{i \times 1}} } \frac{\partial {y}_{i \times1}}{\partial {W}_{i \times {2}} } = x_{2 \times 1} & \cdots & \frac{\partial Loss }{\partial {y_{i \times 1}} } \frac{\partial {y}_{i \times1}}{\partial {W}_{i \times {j}} } = x_{j \times 1} \\ \end{matrix} \right] \\ =\frac{\partial Loss }{\partial \boldsymbol{W} } = \frac{\partial Loss }{\partial \boldsymbol{y} } \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{W} } = \\ \left[ \begin{matrix} \frac{\partial Loss }{\partial {y_{1 \times 1}} } \\ \frac{\partial Loss }{\partial {y_{2 \times 1}} } \\ \vdots \\ \frac{\partial Loss }{\partial {y_{i \times 1}} } \\ \end{matrix} \right] \left[ \begin{matrix} x_{1 \times 1} &x_{2 \times 1} & \cdots &x_{j \times 1} \end{matrix} \right] = \frac{\partial Loss }{\partial \boldsymbol{y} } \boldsymbol{x}^T$

I can't prove it mathematically , however , From the above derivation, it must be right . Of course, I recommend you to have a look Matrix derivation . Secondly, the above is the derivation of scalar to matrix or vector .

~~So you can probably see what the derivative looks like .~~ ~~I forgot how to deduce , But I have a better idea , From gradient $\nabla$ Starting with .~~

Program realization

as follows , The initialization selection here can be all 1 Or all 0, But that will not work well , Convergence is very slow （ Why? ？ I'll study it later ）. have access to NumPy Of randn() and rand() To initialize ,numpy.random.rand(d0, d1, …, dn) A random sample of [0, 1) Within the scope of ,numpy.random.randn(d0, d1, …, dn) Is in accordance with the Standard normal distribution Generate random values . Of course, the convergence speed and stability of the two initialization methods are slightly different , Discussed later .

Here is the main code of neural network , See the download address of the appendix for the complete project . Try to match the variable names in the code one by one according to the results in the previous formula , Easy to understand . for example ：y_hat It refers to the output of neural network $\boldsymbol{\hat{y}}$ ,dW1 refer to $\frac{\partial \mathcal{Loss}}{\partial \mathbf{W}^{(1)}}$ , alpha It refers to the learning rate $\alpha$ ,SquareErrorSum It's the cost function $\boldsymbol{\hat{y}} , \boldsymbol{y})$ , And so on .

#!/usr/bin/python3
# coding:utf-8
# @Author: Lin Misaka
# @File: net.py
# @Data: 2020/11/30
# @IDE: PyCharm
import numpy as np
import matplotlib.pyplot as plt

# diff = True Derivation 
def Sigmoid(x, diff=False):
    def sigmoid(x):        # sigmoid function 
        return 1 / (1 + np.exp(-x))
    def dsigmoid(x):
        f = sigmoid(x)
        return f * (1 - f)
    if (diff == True):
        return dsigmoid(x)
    return sigmoid(x)

# diff = True Derivation 
def SquareErrorSum(y_hat, y, diff=False):
    if (diff == True):
        return y_hat - y
    return (np.square(y_hat - y) * 0.5).sum()


class Net():
    def __init__(self):
        # X Input
        self.X =  np.random.randn(1024, 1)
        self.W1 = np.random.randn(16, 1024)
        self.b1 = np.random.randn(16, 1)

        self.W2 = np.random.randn(16, 16)
        self.b2 = np.random.randn(16, 1)

        self.W3 = np.random.randn(10, 16)
        self.b3 = np.random.randn(10, 1)
        self.alpha = 0.01  # Learning rate 
        self.losslist = [] # For drawing 

    def forward(self, X, y, activate):
        self.X = X
        self.z1 = np.dot(self.W1, self.X) + self.b1
        self.a1 = activate(self.z1)
        self.z2 = np.dot(self.W2, self.a1) + self.b2
        self.a2 = activate(self.z2)
        self.z3 = np.dot(self.W3, self.a2) + self.b3
        self.y_hat = activate(self.z3)
        Loss = SquareErrorSum(self.y_hat, y)
        return Loss, self.y_hat

    def backward(self, y, activate):
        self.delta3 = activate(self.z3, True) * SquareErrorSum(self.y_hat, y, True)
        self.delta2 = activate(self.z2, True) * (np.dot(self.W3.T, self.delta3))
        self.delta1 = activate(self.z1, True) * (np.dot(self.W2.T, self.delta2))
        dW3 = np.dot(self.delta3, self.a2.T)
        dW2 = np.dot(self.delta2, self.a1.T)
        dW1 = np.dot(self.delta1, self.X.T)
        d3 = self.delta3
        d2 = self.delta2
        d1 = self.delta1
        #update weight
        self.W3 -= self.alpha * dW3
        self.W2 -= self.alpha * dW2
        self.W1 -= self.alpha * dW1
        self.b3 -= self.alpha * d3
        self.b2 -= self.alpha * d2
        self.b1 -= self.alpha * d1

    def setLearnrate(self, l):
        self.alpha = l

    def train(self, trainMat, trainLabels, Epoch=5, bitch=None):
        for epoch in range(Epoch):
            acc = 0.0
            acc_cnt = 0
            label = np.zeros((10, 1))# Master as one 10x1 It's a vector , Reduce computation . Used to generate one_hot Format label
            for i in range(len(trainMat)):# It can be used batch, Less data , Train all data sets at once 
                X = trainMat[i, :].reshape((1024, 1)) # To generate the input 

                labelidx = trainLabels[i]
                label[labelidx][0] = 1.0

                Loss, y_hat = self.forward(X, label, Sigmoid)# Forward propagation 
                self.backward(label, Sigmoid)# Back propagation 

                label[labelidx][0] = 0.0# Revert to 0 vector 
                acc_cnt += int(trainLabels[i] == np.argmax(y_hat))

            acc = acc_cnt / len(trainMat)
            self.losslist.append(Loss)
            print("epoch:%d,loss:%02f,accrucy : %02f" % (epoch, Loss, acc))
        self.plotLosslist(self.losslist, "Loss")

Running results

The learning rate is 0.01, Iterate over all data 1743 The accuracy rate reached 98.7%. Here are some records ：

epoch:0,loss:0.483388,accrucy : 0.107032
epoch:1,loss:0.443391,accrucy : 0.217684
epoch:16,loss:0.415309,accrucy : 0.453981
epoch:88,loss:0.454633,accrucy : 0.834540
epoch:137,loss:0.491693,accrucy : 0.901758
epoch:810,loss:0.045566,accrucy : 0.977249
epoch:1420,loss:0.009614,accrucy : 0.985522
epoch:1743,loss:0.008058,accrucy : 0.987073

Loss chart ：
Insert picture description here

About weight initialization

Analysis of convergence rate of different initialization methods ：

The learning rate is 0.01. The place to modify initialization is here ：
Insert picture description here

Insert picture description here

Insert picture description here
It seems that sometimes Loss The situation of increase , But random , It is related to initialization value and learning rate .

 The final result of the first picture ：epoch:99,loss:0.068638,accrucy : 0.903309

Insert picture description here
The effect is very poor .

epoch:99,loss:0.129023,accrucy : 0.307653

Insert picture description here
The effect was poor once , I want to boldly infer that the effect of such initialization is very poor , But this is unscientific , I try to change the learning rate to 0.1, The effect has been improved .

epoch:99,loss:0.129075,accrucy : 0.307653

Insert picture description here

epoch:99,loss:0.003670,accrucy : 0.767322

Insert picture description here
This initialization method is very bad , Although it decreases almost linearly , But the descent speed should not be too slow .

epoch:0,loss:4.500000,accrucy : 0.103413
epoch:12,loss:4.500000,accrucy : 0.103413
epoch:13,loss:4.500000,accrucy : 0.103930
epoch:499,loss:4.500000,accrucy : 0.105481

notes ： bias b The way of initialization has little effect on convergence , This reader can try it by himself .

Why do different initialization converge at different speeds ？

To be continued ~

On the loss function

To be continued ~

About my feelings

The number of neurons in the hidden layer is not the more the better , I used to PyTroch I have tried to train a network similar to this one , Input layer , The output layer is the same as this neural network , At first, the hidden layer has 64 and 24 Neurons , The effect is good , But I try to change the number of neurons in the hidden layer to make it more , The effect is not good , And because of The amount of computation increases , The iteration speed is greatly reduced .

From the above experimental results, it seems that it is better to use the random value of standard normal distribution when initializing the weight , But that's why it's worth continuing to learn . Besides, it seems biased b The way of initialization has little effect on convergence , This reader can try it by himself .

About learning neural networks , If you think in terms of pragmatism , Then maybe you just need to understand the algorithm in the field of machine learning , Just learn to use the framework . But this is just useful .

If you want to study these networks , Try to improve its accuracy 、 Convergence rate 、 Forecast speed 、 Memory footprint 、 Complexity And so on , So as to create a new structure 、 Versatile structure 、 Structures that can be combined and so on need to have a deep understanding of them , Instead of just using the skin of a frame . Secondly, the learning process cannot be swallowed , Step by step , Focus on practice, summary and understanding .

About deceptive Neural Networks , Neural network vulnerability

appendix

Article resources

Handwritten digit recognition fully connected neural network teaching Demo： take digits.zip Unzip to net.py, After installing the environment, it can run .

git clone https://github.com/MisakaMikoto128/FCNNeuralNetworkDemoForHandwrittenDigits.git

Backup .

Reference resources

[1] Neural network derivation
[2] Common loss function (loss function) summary
[3] Matrix derivation （ On ）
[4] Matrix derivation （ Next ）
[5] Xiaobai can understand it softmax Detailed explanation

Link failure Call I .
github:https://github.com/MisakaMikoto128
Personal website

原网站

版权声明
本文为[Meow, meow, hammer, you cute]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/205/202207240517132279.html