当前位置：网站首页>Must see the summary! In depth learning era, you should read 10 articles to understand image classification!

Computer vision is the subject of converting images and videos into signals that machines can understand . Use these signals , The programmer can further control the behavior of the machine based on this high-level understanding . In many computer vision tasks , Image classification is one of the most basic tasks . It can not only be used in many practical products , for example Google Photo The label and AI Content review , But also for many more advanced visual tasks （ For example, object detection and video understanding ） Opened a door . Since the breakthrough of deep learning , Due to the rapid changes in this field , Beginners often find it too clumsy , Unable to learn . Different from the typical software engineering discipline , Not much about using DCNN Books for image classification , And the best way to understand this field is to read academic papers . But what thesis to read ？ Where do I start ？ In this paper , I'll introduce you to 10 The best paper for beginners to read . Through these papers , We can see how this field is developing , And how researchers come up with new ideas based on previous research results . however , Even if you have been working in this field for some time , It's still very helpful for you to do a large-scale collation .

The papers involved in this article have been packaged , The computer vision Alliance The background to reply “9079” Get download link

1998 year ：LeNet

Gradient learning lies in the application of document recognition

Excerpt from “ Gradient based learning is applied to document recognition ”

LeNet On 1998 Launched in 2013 , It lays a foundation for future image classification research using convolutional neural network . Many classic CNN technology （ For example, pool layer , Fully connected layers , Fill and activate layers ） Used to extract features and classify . With the help of the mean square error loss function and 20 Training cycles , The network is in MNIST The test set can achieve 99.05％ The accuracy of the . Even after 20 year , There are still many state-of-the-art classification networks that generally follow this pattern .

2012 year ：AlexNet

Deep convolution neural network ImageNet classification

Excerpt from “ A neural network with deep convolution ImageNet classification ”

Even though LeNet Achieved good results and showed CNN The potential of , However, due to the limited computing power and data volume , The development of this field has been stagnant for ten years . look CNN Can only solve some simple tasks , For example, number recognition , But for more complex features （ Such as faces and objects ）, with SVM Of the classifier HarrCascade or SIFT A feature extractor is the preferred method .

however , stay 2012 year ImageNet In a large-scale visual identity challenge ,Alex Krizhevsky Based on CNN To meet this challenge , And will ImageNet Of the test device top-5 Accuracy from 73.8％ Greatly increased to 84.7％. Their method inherits LeNet The layers of CNN idea , But it has greatly increased CNN Size . You can see from the above picture that , And LeNet Of 32x32 comparison , Now the input is 224x224, And LeNet Of 6 comparison , Many convolution kernels have 192 Channels . Although the design changes little , But the parameters changed hundreds of times , But the ability of the network to capture and represent complex features has also been improved hundreds of times . For large-scale model training ,Alex Two with 3GB RAM Of GTX 580 GPU, This created GPU The beginning of training . Again , Use ReLU Nonlinearity also helps to reduce computational costs .

In addition to bringing more parameters to the network , It also uses Dropout Layer discusses the over fitting problem caused by large networks . The local response normalization method has not been widely used since then , But it inspired other important normalization techniques （ for example BatchNorm） To solve the gradient saturation problem . in summary ,AlexNet The actual classification network framework for the next decade is defined ： Convolution ,ReLu Nonlinear activation ,MaxPooling and Dense Combination of layers .

2014 year ：VGG

Super depth convolution network is used for large-scale image recognition

come from Quora“ https://www.quora.com/What-is-the-VGG-neural-network”

In the use of CNN Great success has been achieved in visual recognition , The whole research community was taken aback , Everyone began to study why this neural network can work so well . for example , stay 2013 Published in “ Visualize and understand convolutional networks ” in ,Matthew Zeiler it has been reviewed that CNN How to capture features and visualize intermediate representations . All of a sudden , Everyone is beginning to realize CNN since 2014 It has been the future of computer vision since . Among all direct followers ,Visual Geometry Group Of VGG The Internet is the most eye-catching network . stay ImageNet On the tester , its top-5 The accuracy is up to 93.2％,top-1 The accuracy has reached 76.3％.

follow AlexNet The design of the ,VGG The network has two major updates ：1）VGG Not only does it use things like AlexNet Such a wider network , And use a deeper network .VGG-19 have 19 Convolution layers , and AlexNet There are only 5 individual .2）VGG Some small 3x3 Convolution filter can replace AlexNet The individual 7x7 even to the extent that 11x11 filter , Achieve better performance while reducing computing costs . Because of this elegant design ,VGG It has also become the backbone of many pioneering networks in other computer vision tasks , For example, for semantic segmentation FCN And for object detection Faster R-CNN.

With the development of Internet , The disappearance of gradients from multilayer back propagation becomes a bigger problem . To solve this problem ,VGG The importance of pre training and weight initialization is also discussed . This problem limits researchers from adding more layers , otherwise , Networks will be hard to integrate . But two years later , We will find a better solution for this .

2014 year ：GoogLeNet

Deeper convolution

Excerpt from “ Going Deeper with Convolutions”

VGG It has beautiful appearance and easy to understand structure , But in ImageNet 2014 All the finalists in the competition did poorly .GoogLeNet（ also called InceptionV1） Won the final prize . It's like VGG equally ,GoogLeNet One of our main contributions is use 22 Layer structure to break through the limitation of network depth . This proves again , Further exploration is indeed the right direction to improve accuracy .

And VGG Different ,GoogLeNet Try to solve the problem of calculation and gradient decline directly , Instead of proposing a solution with better pre training mode and weight initialization .

Bottleneck Inception Module From “ Going Deeper with Convolutions”

First , it Use is called Inception This module explores the idea of asymmetric network design （ See above ）. Ideally , They hope to use sparse convolution or dense layers to improve feature efficiency , But modern hardware design is not aimed at this situation . therefore , They think that , The sparsity at the network topology level can also make use of existing hardware functions , Help fuse features .

secondly , It draws on papers “ Network in network ” To solve the problem of high computing costs . Basically , introduce 1x1 Convolution filter to perform heavy computational operations （ Such as 5x5 Convolution kernel ） Previously reduce the size of the feature . This structure will later be called “ Bottleneck ”, And widely used in many subsequent networks . Be similar to “ Network in network ”, It also uses the average pool layer instead of the final full connection layer , To further reduce costs .

Third , To help the gradient flow deeper ,GoogLeNet Supervision is also used for some m-server outputs or auxiliary outputs . Because of its complexity , This design was not very popular in the image classification network , But in other areas of computer vision （ Such as Hourglass The Internet ） Is becoming more and more popular in posture estimation .

As a follow-up , This Google The team did this Inception Series wrote more papers .“ Batch normalization ： Accelerate deep network training by reducing internal covariate offset ” representative InceptionV2 .2015 Year of “ Rethink the of computer vision Inception framework ” representative InceptionV3 .2015 Year of “ Inception-v4,Inception-ResNet And the impact of residual connections on learning ” representative InceptionV4 . Every paper is true of the original Inception The network has been further improved , And achieved better results .

2015 year ：Batch Normalization

Batch normalization ： Accelerate deep network training by reducing internal covariate offset

The initial network helps researchers in ImageNet The data set achieves superhuman accuracy . however , As a statistical learning method , CNN Very limited by the statistical nature of a specific training data set . therefore , For higher accuracy , We usually need to calculate the mean and standard deviation of the whole data set in advance , And use them to normalize our input first , To ensure that most layer inputs in the network are tight , This translates into better activation response capability . This approximation is very troublesome , Sometimes it doesn't work at all for new network structures or new data sets , Therefore, the deep learning model is still considered difficult to train . To solve this problem , establish GoogLeNet People who Sergey Ioffe and Chritian Szegedy Decided to invent something smarter , be called “ Batch standardization ”.

Excerpt from “ Batch standardization ： Accelerate deep network training by reducing internal covariate offset ”

The idea of batch normalization is not difficult ： As long as you train long enough , We can use a series of small batch statistics to approximate the statistics of the entire data set . and , Instead of manually calculating statistics , We can introduce two more learnable parameters “ The zoom ” and “ displacement ” , So that network learning can standardize each layer separately .

The above figure shows the process of calculating the normalized value of the batch . As we can see , We take the average value of the whole small batch , And calculate the variance . Next , We can use this minimum batch mean and variance to normalize the input . Last , Through the scale and displacement parameters , The network will learn to adjust the batch standardization results to best suit the next layer , Usually ReLU. One warning is that we do not have a small batch of information during reasoning , Therefore, one solution is to calculate the moving average and variance during training , These moving averages are then used in the reasoning path . This small innovation is so influential , All subsequent networks immediately started using it .

2015 year ：ResNet

Deep residual learning for image recognition

2015 May be the best year of computer vision in the past decade , We have seen many great ideas not only in image classification , There are also various computer vision tasks , For example, object detection , Semantic segmentation, etc .2015 Year belongs to a group named ResNet Or a new network of residual networks , The network is composed of Microsoft Research Asia A group of Chinese researchers from the .

Excerpt from “ Depth residual learning for image recognition ”

As we were before VGG In the network , To get deeper , The biggest obstacle is the gradient vanishing problem , namely , When broadcast backward through a deeper layer , The derivative gets smaller and smaller , Finally, to the point that the modern computer architecture can not really represent .GoogLeNet Try to attack this by using auxiliary supervision and asymmetric startup module , But it can only alleviate the problem to a lesser extent . If we want to use 50 even to the extent that 100 layer , Is there a better way to let gradients flow through the network ？ResNet The answer is to use the residual module .

The remaining modules are from “ Deep residual learning image recognition ”

ResNet Added identity shortcuts to the output , Therefore, each residual module can at least not predict what the input is , Without getting lost . more importantly , The residual module does not want each layer to be directly suitable for the required feature mapping , Instead, try to understand the difference between output and input , This makes the task easier , Because the information gain required is small . Imagine , You are studying mathematics , For each new question , Will get a solution to a similar problem , So all you have to do is extend the solution and make it work . It's much easier than coming up with a new solution for every problem you encounter . Or as Newton said , We can stand on the shoulders of giants , Identity input is the giant of the remaining modules .

Besides identity mapping ,ResNet Also from Inception The network borrows bottlenecks and batch normalization . Final , It has successfully built a 152 A volume layered network , And in ImageNet It has been realized. 80.72％ Of top-1 accuracy . The remaining methods also became many other networks later （ for example Xception,Darknet etc. ） Default options . Besides , Because of its simple and beautiful design , Today it is still widely used in many production visual recognition systems .

Hype through tracking residual network , There are more invariants . stay “ Identity mapping in deep residual networks ” in ,ResNet The original author of tried to put the activation before the residual module , And get better results , This design is hereafter referred to as ResNetV2. Again , stay 2016 Year paper 《 Aggregate residual transformation of deep neural network 》 in , The researchers proposed ResNeXt, The model adds parallel branches to the residual module , To summarize the output of different transformations .

2016 year ：Xception

Xception： Deep learning and deep separable convolution

Excerpt from “ Xception： Deep learning and deep separable convolution ”

With ResNet Release , Most of the low hanging fruits in the image classifier seem to have been snatched away . Researchers began to think about CNN What is the internal mechanism of magic . Because cross channel convolution usually introduces a large number of parameters , therefore Xception Network selection investigate this action to get a full picture of its effects .

Just like its name ,Xception Derived from Inception The Internet . stay Inception Module , Aggregating multiple branches of different transformations to achieve topological sparsity . But why does this sparsity work ？Xception The author of , It's also Keras The author of the framework , Extend this idea to an extreme case , under these circumstances , One 3x3 The convolution file corresponds to an output channel before the last concatenation . under these circumstances , These parallel convolution kernels actually form a new operation called deep convolution .

Excerpt from “ Depth convolution and depth separable convolution ”

As shown in the figure above , Different from traditional convolution , Traditional convolution includes all channels for one calculation , Depth convolution only calculates the convolution of each channel separately , Then connect the outputs in series . This reduces feature exchange between channels , But it also reduces many connections , This results in layers with fewer parameters . however , This operation will output the same number of channels as the input （ If you combine two or more channels , The number of output channels will be reduced ）. therefore , Once the channel outputs are merged , You need another routine 1x1 Filter or point by point convolution , To increase or decrease the number of channels , Just like conventional convolution .

The idea didn't originally come from Xception. In a “ Large scale learning of visual representation ” This is described in the paper , And in InceptionV2 Occasionally used in .Xception A further step , This new convolution replaces almost all convolutions . The experimental results are very good . It transcends ResNet and InceptionV3, Become a new method for image classification SOTA Method . And that proves it CNN The mapping of cross channel correlation and spatial correlation can be completely decoupled . Besides , Because of and ResNet Have the same advantages ,Xception It also has simple and beautiful design , Therefore, his ideas were also used in many other subsequent studies , for example MobileNet,DeepLabV3 etc. .

2017 year ：MobileNet

MobileNets： Efficient convolutional neural networks for mobile vision applications

Xception stay ImageNet It has been realized. 79％ Of top-1 Accuracy and 94.5％ Of top-5 accuracy , But with the previous SOTA InceptionV3 Compared with each other, it only improves 0.8％ and 0.4％. The marginal income of the new image classification network is getting smaller and smaller , So researchers began to turn their attention to other fields . In a resource constrained environment ,MobileNet It has promoted the significant development of image classification .

“ MobileNets： Efficient convolutional neural networks for mobile vision applications ” Medium MobileNet modular

And Xception be similar ,MobileNet Use the same depth separable convolution module as shown above , And focus on efficiency and fewer parameters .

“ MobileNets： Efficient convolutional neural networks for mobile vision applications ” Parameter ratio in

The molecule in the above formula is the total number of parameters required for deep separable convolution . The denominator is the total number of parameters of similar regular convolution . here D [K] It's the size of the convolution kernel ,D [F] Is the size of the feature map ,M Is the number of input channels ,N Is the number of output channels . Because we separate the calculation of channel and spatial characteristics , So we can convert multiplication to addition , Its magnitude is small . It can be seen from this ratio that , What's better is , The more output channels , Using this new convolution saves more computation .

MobileNet Another contribution is the width and resolution multiplier .MobileNet The team hopes to find a standardized way to reduce the model size of mobile devices , The most intuitive method is to reduce the number of input and output channels and the resolution of the input image . To control this behavior , ratio alpha Multiply by the channel , ratio rho Multiply by the input resolution （ This also affects the size of the feature map ）. therefore , The total number of parameters can be expressed by the following formula ：

“ MobileNets： Efficient convolutional neural networks for mobile vision applications ”

Although this change seems naive in terms of innovation , But it has great engineering value , Because this is the first time that researchers have concluded , The network specification method can be adjusted for different resource constraints . Besides , It also summarizes the final solution to improve the neural network ： Larger and higher resolution inputs result in higher accuracy , Thinner and lower resolution input results in worse accuracy .

stay 2018 Years and 2019 Later in the year ,MobiletNet The team also released it “ MobileNetV2： Residuals and linear bottlenecks ” and “ Search for MobileNetV3”. stay MobileNetV2 in , An inverted residual bottleneck structure is used . stay MobileNetV3 in , It began using neural architecture search techniques to search for the best architecture combination , We will introduce .

2017 year ：NASNet

Learn extensible architecture to realize extensible image recognition

Just like image classification for resource constrained environments , The neural architecture search is in 2017 Another field emerged around . With the help of ResNet,Inception and Xception, It seems that we have reached the best network topology that human beings can understand and design , But if there is one better , More complex combinations , Far beyond human imagination ？2016 A paper in 《 Neural architecture search with reinforcement learning 》 This paper presents an idea of searching the best combination in the predetermined search space through reinforcement learning . as everyone knows , Reinforcement learning is a kind of goal - oriented learning , Ways to reward search agents for the best solution . however , Limited by computing power , This article only discusses the small-scale CIFAR Applications in data sets .

NASNet search space .“ Learn extensible architecture to realize extensible image recognition ”

To find something like ImageNet The best structure for such a large data set ,NASNet Created for ImageNet Customized search space . It wants to design a special search space , In order to CIFAR Search results on can also be found in ImageNet Upper normal operation . First ,NASNet Suppose in a good network （ Such as ResNet and Xception） The manual module commonly used in is still useful when searching . therefore ,NASNet No longer search for random connections and operations , Instead, search for a combination of these modules , These modules have been proven in ImageNet It has been used on . secondly , The actual search is still 32x32 Resolution CIFAR Execute on dataset , therefore NASNet Search only modules that are not affected by the input size . To make the second point work ,NASNet Two types of module templates are predefined ：Reduction and Normal.

Excerpt from “ Learn extensible architecture for scalable image recognition ”

Even though NASNet Have better metrics than manually designing the network , But it also has some disadvantages . The cost of finding the best structure is very high , Only like Google and Facebook Such a big company can afford . and , The final structure doesn't mean much to humans , Therefore, it is difficult to maintain and improve in the production environment . stay 2018 Later in the year ,“ MnasNet： Neural structure search for mobile platform ” Limit the search steps by using a predefined chain block structure , Further expanded NASNet Ideas . Besides , By defining weighting factors ,mNASNet Provides a more systematic way to search for models with given resource constraints , Not just based on FLOP To assess the .

2019 year ：EfficientNet

EfficientNet： Reflection on scaling of convolutional neural network model

stay 2019 year , about CNN There seems to be no exciting idea of performing supervised image classification . Dramatic changes in network structure usually result in only a small increase in accuracy . To make matters worse , When the same network is applied to different data sets and tasks , The previously claimed technique doesn't seem to work , This has provoked criticism , That is, whether these improvements are only suitable for ImageNet Data sets . On the other hand , There is a skill that will never live up to our expectations ： Use higher resolution input , Add more channels to the convolution layer and add more layers . Although the power is very cruel , But there seems to be a principled approach to expanding the network on demand .MobileNetV1 stay 2017 Such a proposal was made in , But then the focus shifted to better network design .

Excerpt from “ EfficientNet： Model scaling of convolutional neural network ”

Following NASNet and mNASNet after , The researchers realized that , Even with the help of a computer , The change of architecture will not bring much benefit . therefore , They began to fall back to expanding the network .EfficientNet Just based on this assumption . One side , It has been used. mNASNet The best building block for , To ensure a good foundation . On the other hand , It defines three parameters alpha,beta and rho To control the depth of the network , Width and resolution . such , Even if there's no big GPU Pool to search for the best structure , Engineers can still rely on these principled parameters to adjust the network according to their different requirements . Last ,EfficientNet Provides 8 Two different variants , They have different widths , Depth and resolution , And no matter the size of the model has good performance . let me put it another way , If you want to obtain high accuracy , Please use 600x600 and 66M Parametric EfficientNet-B7. If you want a low latency and smaller model , Please use 224x224 and 5.3M Parameters EfficientNet-B0. Problem solved .

other

If you have finished 10 Reading more than papers , You should be CNN Have a good understanding of the history of image classification . If you want to continue learning in this field , I have also listed some other interesting papers for you to read , These papers are well known in their respective fields , And inspired many other researchers around the world .

2014 year ：SPPNet

The spatial pyramid pool in the deep convolution network is used for visual recognition

SPPNet The idea of feature pyramid is used for reference in traditional computer vision feature extraction . The pyramid forms an element word bag with different proportions , So it can adapt to different input sizes and get rid of the fixed size full connection layer . The idea also inspired DeepLab Of ASPP Module and for object detection FPN.

2016 year ：DenseNet

Tightly connected convolution network

Cornell University DenseNet Further expanded ResNet Ideas . It not only provides skip connections between layers , It also has skipped connections from all previous layers .

2017 year ：SENet

Squeeze and motivate networks

Xception Network proof , Cross channel correlation has little to do with spatial correlation . however , As the last ImageNet The champion of the competition ,SENet Designed a “ Squeeze and excite ” And tell a different story .SE The block first uses the global pool to compress all channels into fewer channels , Then apply the fully connected transformation , Then use another fully connected layer to “ inspire ” Back to the original number of channels . therefore , In essence ,FC The layer helps the network understand the attention on the input element graph .

2017 year ：ShuffleNet

ShuffleNet： An extremely efficient convolutional neural network for mobile devices

ShuffleNet Builds on the MobileNetV2 On top of the inverted bottleneck module , He believes that dot convolution in deep separable convolution will sacrifice accuracy , In exchange for less computation . To make up for that ,ShuffleNet Added an additional channel shuffle operation , To ensure that the pointwise convolution is not always applied to the same “ spot ”. stay ShuffleNetV2 in , This channel rearrangement mechanism is further extended to ResNet Identity mapping branch , So part of the identity function will also be used to rearrange .

2018：Bag of Tricks

The technique of image classification using convolution neural network

“ Skill pack ” Focus on the common skills used in the image classification area . When engineers need to improve benchmark performance , It can be used as a good reference . Interestingly , These techniques, such as hybrid enhancement and cosine learning rates, can sometimes be better improved than the new network architecture implementation .

Conclusion

With EfficientNet Release ,ImageNet The Classification Benchmark seems to be coming to an end . Use existing deep learning methods , Unless there is another paradigm shift , Otherwise we will never one day be able to ImageNet Up to 99.999％ The accuracy of the . therefore , Researchers are actively studying some novel fields , For example, self supervised or semi supervised learning for large-scale visual recognition . meanwhile , Using existing methods , For engineers and Entrepreneurs , Finding the practical application of this imperfect technology has become a problem .

Reference

Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based Learning Applied to Document Recognition
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks
Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, Going Deeper with Convolutions
Sergey Ioffe, Christian Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition
François Chollet, Xception: Deep Learning with Depthwise Separable Convolutions
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Application
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le, Learning Transferable Architectures for Scalable Image Recognition
Mingxing Tan, Quoc V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger, Densely Connected Convolutional Networks
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu, Squeeze-and-Excitation Networks
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun, ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, Mu Li, Bag of Tricks for Image Classification with Convolutional Neural Networks
https://towardsdatascience.com/10-papers-you-should-read-to-understand-image-classification-in-the-deep-learning-era-4b9d792f45a7