当前位置:网站首页>2022 Tsinghua summer school notes L2_ 1 basic composition of neural network
2022 Tsinghua summer school notes L2_ 1 basic composition of neural network
2022-07-24 21:33:00 【The duck neck is gone】
2022 Tsinghua University large model cross Seminar
L2 Neural Network basics
1 The basic composition of neural network
1.1 Neuron
- A single neuron :

Put the weight vector ( matrix ) Multiply with the input vector point , Get a scalar value , Plus offset b( Scalar ) Send in the nonlinear activation function f, Get the output .
1.2 neural network
- Multiple neurons form a single-layer neural network :

When there are multiple neurons , The weight changes from vector to matrix (3*3), bias b From scalar to vector (b1,b2,b3). - Stack single-layer neural networks to get multi-layer neural networks :

We can calculate the result of each layer from the input , The result of each layer is the result of the previous layer through linear change and activation function .
1.3 Activation function
- why use f? Why use nonlinear function to activate ?

- Pictured , Suppose there is only linear transformation in our network , After two layers of network , We found that h2 It is completely possible to use the initial input data after only one change .
- therefore , The expression ability of single layer is consistent with that of multiple layers , In order to prevent the collapse of the network , Increase the expression ability of the network , To fit more complex functions , We introduce nonlinear network structure .
- Common nonlinear activation functions

1.4 Output layer
- Determine the output layer according to different output forms :
- linear output
- Add a linear layer after the hidden layer to output directly .
- For regression problems .
- Sigmoid
- First, use the ordinary linear layer to get a value , Then use sigmoid Activation function , Press the output to 0-1 In this range .
- It is applicable to binary classification problems .
- Softmax
- First calculate a linear layer with the last hidden layer , Get an output z, Then substitute into the function y i = softmax ( z ) i = exp ( z i ) ∑ j exp ( z j ) y_{i}=\operatorname{softmax}(z)_{i}=\frac{\exp \left(z_{i}\right)}{\sum_{j} \exp \left(z_{j}\right)} yi=softmax(z)i=∑jexp(zj)exp(zi).
- Purpose : Eliminated z When it is negative ; Make all output class values sum to 1, The probability distributions of different classes are obtained .
- Often solve multi classification problems .
- linear output
2 Training
2.1 Training objectives :
- Forecast target : Reduce mean square error ( The return question )
- The goal of classification : Minimize cross entropy

If the correct answer is the first category , We can calculate the cross entropy as 0.74; If the correct answer is the second category , It can be calculated that the cross entropy is 1.74; If the correct answer is the third category ……
2.2 How to update
- Concept of gradient descent :

- We reduce the loss function a little at a time
- Each time, calculate the gradient of the loss function with respect to the parameter , That is, we get the place where the loss function changes fastest for parameters . Because we want to take the minimum , So we choose the direction with the largest absolute value in the negative direction .
- Gradient descent :
- For a single input ( It can be regarded as a one-dimensional parameter ), Finding partial derivatives
- about n One input time , See the picture below , The resulting gradient matrix can be obtained .

- The trick of gradient descent :
- Continuous derivation
- Back propagation algorithm
- Forward propagation refers to the order in which the edges point , Among them, the function of the edge is to transfer values .

- In order to find the gradient of the final output to an input value , We use the opposite direction of calculation .
- Take one paragraph as an example , Introduce the calculation method of single node :

- Multiply the upstream gradient by the local gradient , The gradient downstream can be calculated , By analogy, we can continue to find the gradient downstream .

- Multiply the upstream gradient by the local gradient , The gradient downstream can be calculated , By analogy, we can continue to find the gradient downstream .
- Forward propagation refers to the order in which the edges point , Among them, the function of the edge is to transfer values .
3 Lexical representation Word2Vec
3.1 Sliding Box: A fixed size sliding window
When the window moves to one end of the sentence , Only target
3.2 CBOW: according to context, forecast target
take never and late use one-hot Vector representation , Average these two vectors , Then turn the word vector into the size of the vocabulary , Finally through softmax Get the probability distribution .
3.3 skip-gram: according to target, forecast context
- Because it is too difficult for the model to predict multiple results , So we decompose the task , one by one .
3.4 improvement :
3.4.1 disadvantages : full softmax when , If you encounter a large vocabulary , Goodbye back propagation and gradient descent , The speed will be slow .
3.4.2 Two ways to improve computational efficiency :
- Negative sampling
Understand reference- Only a small part is sampled , Sample according to the frequency of words .
P ( w i ) = f ( w i ) 3 / 4 ∑ j = 1 V f ( w j ) 3 / 4 P\left(w_{i}\right)=\frac{f\left(w_{i}\right)^{3 / 4}}{\sum_{j=1}^{\mathbb{V}} f\left(w_{j}\right)^{3 / 4}} P(wi)=∑j=1Vf(wj)3/4f(wi)3/4 - 3/4 Empirical value , In order to slightly improve the sampling frequency of low-frequency words .
- Only a small part is sampled , Sample according to the frequency of words .
- layered softmax
3.4.3 Other tips:
- sub-sampling: Balance common words with rare words
- Common words appear frequently , The semantics covered are not very rich , Rare words are the opposite .
1 − t / f ( w ) 1-\sqrt{t / f(w)} 1−t/f(w) - If a word appears more frequently , The more likely he is to be removed
- Common words appear frequently , The semantics covered are not very rich , Rare words are the opposite .
- soft sliding window
- Words that are farther away should be considered less
边栏推荐
- 【类的组合(在一个类中定义一个类)】
- Go language structure
- Leetcode skimming -- bit by bit record 018
- 中信证券股票开户怎么样安全吗
- Shengbang security rushes to the scientific innovation board: Qianxin is its largest customer (55.87 million); Its three-year revenue is 460 million, net profit is 95.04 million, and R & D investment
- One bite of Stream(6)
- Eight transformation qualities that it leaders should possess
- Node installation using NVM succeeded, but NPM installation failed (error while downloading, TLS handshake timeout)
- Leetcode skimming -- bit by bit record 017
- HSPF (hydraulic simulation program FORTRAN) model
猜你喜欢

Lecun proposed that mask strategy can also be applied to twin networks based on vit for self supervised learning!
![[record of question brushing] 16. The sum of the nearest three numbers](/img/b9/a78b72650e94c75ccbe22af1f43857.png)
[record of question brushing] 16. The sum of the nearest three numbers

Go language error handling

Leetcode skimming -- bit by bit record 017

Redis (12) -- redis server

Together again Net program hangs dead, a real case analysis of using WinDbg

Conditional judgment of Shell Foundation

Multiplication and addition of univariate polynomials

Codeforces Round #808 (Div. 2)(A~D)

Case analysis of building cross department communication system on low code platform
随机推荐
Jenkins introduction
How to realize three schemes of finclip wechat authorized login
Experience of using dump file to reverse locate crash location
Conditional judgment of Shell Foundation
[record of question brushing] 16. The sum of the nearest three numbers
CAD calls mobile command (COM interface)
Is there any capital requirement for the online account opening of Ping An Securities? Is it safe
Huawei Router: basic principle and configuration of Isis (including experiment)
Codeforces Round #809 (Div. 2)(A~D2)
How to buy Xinke financial products in CICC securities? Revenue 6%
Baidu interview question - judge whether a positive integer is to the power of K of 2
250 million, Banan District perception system data collection, background analysis, Xueliang engineering network and operation and maintenance service project: Chinatelecom won the bid
Shengbang security rushes to the scientific innovation board: Qianxin is its largest customer (55.87 million); Its three-year revenue is 460 million, net profit is 95.04 million, and R & D investment
Career development suggestions shared by ten CIOs
Brand new: the latest ranking of programming languages in July
Let's make a nice monthly temperature map of China with ArcGIS
Thank Huawei for sharing the developer plan
IO flow overview
ECCV 2022 open source | target segmentation for 10000 frames of video
92. Recursive implementation of exponential enumeration