当前位置：网站首页>Classic model – RESNET

Classic model – RESNET

2022-06-26 03:28:00 【On the right is my goddess】

List of articles

Introduction
Deep Residual Learning
Experiments

Introduction

The advantage of deep convolution neural network is that it has many layers , Each layer can capture different information . From low-level visual features to high-level semantic features .

But is it a good thing to have so many layers ？

Obviously not , With the deepening of network level , There will be gradient explosion and gradient disappearance .

The common solution is to initialize or join well BN layer .

However , Although after these operations , The model converges , But the accuracy has decreased . This is not caused by over fitting , Because the training error and the test error have increased . As shown in the figure below .

Please add a picture description
Further reflection ： In theory , If my shallow network performance is better , The performance of the deep network should not decline . Because it can at least make the new layer a identity mapping（ Identity mapping ）. however , ordinary SGD It can't be done .

therefore , The article puts forward deep residual learning framework, Ensure that the network performance will not deteriorate with the increase of depth , This is equivalent to explicitly constructing a identity mapping.

Please add a picture description
The core idea ： Suppose the output of the model is $H (x)$ , But I won't let the model learn directly $H (x)$ , But to learn $H (x) - x$ , I we remember as $F (x)$ . The final output is $F (x) + x$ . We will $F (x)$ It's called residual error . Intuitively speaking , Just don't learn how to get H(x), But to learn the residual between what has been learned and what is real .

Advantage lies in ： Model complexity will not increase ; The amount of calculation will not increase .

It is proved by experiments that ：plain The version is less effective （ No, residual/shortcut connection）; As the depth increases , Performance will also improve .

Deep Residual Learning

Insert picture description here

Please add a picture description
The above figure shows four versions of ResNet Structure . You can see 50layer And above and 18、34 The structure of the version of is different , This is because as the number of network layers deepens , We hope to increase the number of channels , Because depth means you can learn more , But considering the increase of parameter quantity , So construct a bottleneck Structure , adopt 1x1 Convolution realizes channel compression and recovery .

In the model BN layer 、 Data augmentation to improve the generalization performance of the model , But it doesn't work dropout, Because it does not include the full connection layer .

So how do residual connections handle different input and output shapes ？

The first solution is to add some extra... To the input and output respectively 0, Make the shapes of the two correspond ;
The second scheme is to use 1x1 Convolution for projection .

Experiments

Please add a picture description
This picture shows 18 and 34 The version is available residual connection The difference between . It means that ：

The initial training error is larger than the test error , This is the result of data enhancement ;
Every sudden drop is due to a drop in the learning rate . At present, multiplication is generally not used 0.1 The means of , Because the timing is hard to grasp , Multiplying too early will lead to weak convergence in the later stage ;
This experiment shows that the convergence speed is faster after the residual connection 、 The performance is also relatively better ;

Why? ResNet Fast training ？

The reason why the gradient disappears is that as the network deepens , The chain rule multiplies many very small numbers , So that the gradient descent method subtracts a value close to 0 Value , Of course , If you fall into a local optimal position , No deep network , The gradient will easily disappear ;

however ResNet Words , The advantage is that a gradient of shallow network is added to the original foundation , This deep gradient is smaller , But the shallow ones are still relatively large , So mathematically , Gradients don't disappear easily .

The so-called reduced model complexity does not mean that it can not represent other things , Instead, we can find a less complex model to fit the data , As the author said , When connecting without residuals , Theoretically, we can learn one by one identity Things that are （ Don't leave anything behind ）, But it can't be done , Because if you don't guide the whole network to go like this , In fact, in theory, it simply can not pass , So you have to add this result manually , It makes it easier to train a simple model to fit the data , It is equivalent to reducing the complexity of the model .（ An excerpt from here ）

原网站

版权声明
本文为[On the right is my goddess]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/177/202206260240556224.html