当前位置:网站首页>[deep learning series] - visual interpretation of neural network

[deep learning series] - visual interpretation of neural network

2022-06-25 20:38:00 SophiaCV

This is the third article in the deep learning series , Welcome to the original official account 【 The computer vision Alliance 】, Read my original works for the first time ! reply 【 Watermelon Book hand push notes 】 You can also get my pure manual notes of machine learning !

Deep learning series

【 Deep learning series 】—— Introduction to deep learning
【 Deep learning series 】—— Visual interpretation of gradient descent algorithm ( momentum ,AdaGrad,RMSProp,Adam)


Link to the original text :https://medium.com/swlh/an-intuitive-visual-interpretability-for-convolutional-neural-networks-9630007c5857

translate :【 The computer vision Alliance 】 Public account team

 picture

Recommend resources

Machine learning push notes (https://github.com/Sophia-11/Machine-Learning-Notes, Direct note address )

Note Preview

 picture

 picture


The first convolutional neural network is Alexander Waibel stay 1987 The neural network with time delay was proposed (TDNN)[5].TDNN It is a convolutional neural network applied to speech recognition . It USES FFT Preprocess speech signal as input . Its hidden layer consists of two one-dimensional convolution kernels , To extract translation invariant features in frequency domain [6]. stay TDNN Before appearance , The field of artificial intelligence is back-propagation (BP) A breakthrough has been made in the research [7], therefore TDNN Able to use BP Framework for learning . In the original author's comparative experiment , Under the same conditions ,TDNN Is better than hidden Markov model (HMM), The latter is 1980 The mainstream speech recognition algorithms in the s [6].

1988 year , Zhang Wei proposed the first two-dimensional convolutional neural network to transform invariant artificial neural network (SIANN), And it is applied to the detection of medical images [1].Yann LeCun still 1989 year , A convolution neural network is constructed for computer vision problems [2], namely LeNet In the original version of .LeNet It contains two convolutions , Two fully connected layers , in total 60,000 Learning parameters , It's much bigger than TDNN and SIANN, Its structure is very close to modern convolutional neural network [4].LeCun(1989) use [2] Stochastic gradient descent (SGD) Carry out weight learning after random initialization . Later, the deep learning institute retained this strategy . Besides ,LeCun(1989) In discussing its network structure [2] The word convolution is used for the first time , And named convolutional neural network .

For deep convolutional neural networks , After many convolutions and merges , The last convolution layer contains the most abundant Spatial and semantic information . Each convolution unit in convolution neural network actually acts as an object detector , It has the ability to locate objects, but the information contained in it is difficult for humans to understand , And it is difficult to visually display .

In this paper , We'll review class activation maps (CAM),CAM Drawing on the famous papers 《 Network in network 》(Network In Network) The idea in , Using the global average pool (GAP) Instead of the full connection layer .

The proposed CNN The network has powerful image processing and classification functions , At the same time, it can also locate the key parts of the picture .

Convolution layer (Convolution Layers)

Convolutional neural networks (CNN) , It is mainly to extract features continuously through a single filter , From local features to overall features , For image recognition and other functions .

Suppose we need to deal with a dimension of 6x6 A single channel grayscale image of pixels , Convert it to a 2D matrix , As shown below :

 picture

source :https : //mc.ai/my-machine-learning-diary-day-68/

The number in the picture represents the pixel value of the location , The larger the pixel value , The brighter the color . The dividing line between the two colors in the middle of the picture is the boundary we want to detect .

We can design a filter ( Also known as kernel) To detect the boundary . then , This filter is combined with the input image to extract edge information , The convolution operation on the picture can be simplified to the following animation :

 picture

source :https : //mc.ai/my-machine-learning-diary-day-68/

We use this filter to overwrite the image , Cover an area as large as the filter , Multiply the corresponding elements , Then sum it . After calculating a region , Move to another area , Then calculate until all areas of the original image are covered .

The output matrix is called a characteristic graph (Feature Map), It has a lighter color in the middle , Darker colors on both sides , It reflects the boundary in the middle of the original image .

 picture

source :https : //mc.ai/learning-to-perform-linear-filtering-using-natural-image-data/

The convolution layer mainly consists of two parts , A filter and a characteristic graph , This is the data flow CNN The first neural layer of the network , The more filters you learn to use , Will automatically adjust CNN Filter matrix , Will get more features .

The general super parameters to be set include Number of filters , Size and step size .

Pooling layer (Pooling)

Pooling is also called spatial pooling or subsampling . Its main function is to extract the main features of a specific region and reduce the number of parameters , To prevent the model from over fitting .

There are no parameters we need to learn . The specified super parameter package is required Include pool type , Common methods include Maxpooling or Averagepooling, Window size and step size . Usually , We use more Maxpooling, And the size is usually (2,2), In steps of 2 Filter , So after the merger , Input length and width will be reduced 2 times , And the channel will not change , As shown in the figure below :

 picture

The maximum value is obtained in the merge window , A new matrix is generated by merging the characteristic graph matrix in sequence . Again , We can also use the method of averaging or summing , But in general , The maximum method is relatively better .

After several convolutions and merges , We finally flatten the multidimensional data into a one-dimensional array , Then connect them to the fully connected layer .

 picture

source :https : //gfycat.com/fr/smoggylittleflickertailsquirrel-machine-learning-neural-networks-mnist

Its main function is to classify the processed images based on the feature set extracted from the convolution layer and pooling layer .

Such as GoogleNet [10] Such full convolution neural networks avoid the use of full connection layer , Use the global average pool (GAP). such , Not only can parameters be reduced to avoid over fitting , And you can create a feature map associated with a category .

Global average pooling layer (Global Average Pooling)

For a long time , A fully connected network has always been CNN The standard structure of classification networks . Usually , When fully connected, it will have the activation function for classification . But a fully connected layer has a large number of parameters , This will slow down the training speed , And easy to fit .

A network in a network [9] in , The concept of global average pool is proposed to replace the fully connected layer .

 picture

source :http : //www.programmersought.com/article/1768159517/

The difference between a global average pool and a local average pool is the pool window . Local average pooling includes averaging the sub regions in the characteristic graph , In the global average pool , We average the entire characteristic graph .

 picture

source :https : //www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/

Using a global average pool instead of a fully connected layer greatly reduces the number of parameters .

Class activation graph (Class Activation Map)

When using global average pooling , The final convolution layer is forced to generate the same number of feature graphs as the number of categories we are targeting , This will give a very clear meaning to each feature map , That is, category confidence graph [11].

 picture

source :https : //medium.com/@ahmdtaha/learning-deep-features-for-discriminative-localization-aa73e32e39b2

As you can see from the diagram , stay GAP after , We get the average value of each characteristic graph of the last convolution layer , The output is obtained by weighted sum . For each category C, Each feature map k The average value of has the corresponding weight w.

Training CNN After the model , We can get a heat map to explain the classification results . for example , If we want to explain C Class classification results . We take out and class C All corresponding weights , And find the weighted sum of their corresponding characteristic graphs . Because the size of this result is consistent with the feature map , So we need to oversample it and overlay it on the original image , As shown below :

 picture

source :https : //medium.com/@ahmdtaha/learning-deep-features-for-discriminative-localization-aa73e32e39b2

In this way ,CAM In the form of a heat map , The model focuses on the c Class .

 picture

source :MultiCAM: Multi class activation mapping for aircraft recognition in remote sensing images

Conclusion

CAM The explanation effect of has been very good , But there is a drawback , That is, it needs to modify the structure of the original model , This leads to the need to retrain the model , This greatly limits its usage scenarios . If the model is already online , Or the training cost is very high , Then it is almost impossible for us to retrain them .

Have a harvest? ? Let's pay attention and praise , Let more people see this article

  1. give the thumbs-up , Let more people see this article article
  2. First public official account 【 The computer vision Alliance 】, Read the article for the first time , reply 【 Watermelon Book hand push notes 】 obtain PDF download !
  3. Welcome to my blog , Let's learn and improve together !

References

  1. Zhang, W., 1988. Shift-invariant pattern recognition neural network and its optical architecture. In Proceedings of annual conference of the Japan Society of Applied Physics.
  2. . LeCun, Y. and Bengio, Y., 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.
  3. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W. and Jackel, L.D., 1989. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), pp.541–551.
  4. LeCun, Y., Kavukcuoglu, K. and Farabet, C., 2010. Convolutional networks and applications in vision. In ISCAS(Vol. 2010, pp. 253–256).
  5. Waibel, A., 1987. Phoneme recognition using time-delay neural networks. Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.
  6. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K. and Lang, K., 1989. Phoneme recognition using time-delay neural networks, IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), pp. 328–339.
  7. Rumelhart, D.E., Hinton, G.E. and Williams, R.J., 1986. Learning representations by back-propagating errors. nature, 323(6088), p.533.
  8. LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), pp.2278–2324.
  9. Min Lin, Qiang Chen, Shuicheng Yan : Network In Network.
  10. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich: Going Deeper with Convolutions.
  11. Bolei Zhou andAditya Khosla and Agata Lapedriza andAude Oliva andAntonio Torralba :Learning Deep Features for Discriminative Localization
原网站

版权声明
本文为[SophiaCV]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202181342288524.html