当前位置:网站首页>Excellent tech sharing | research and application of Tencent excellent map in weak surveillance target location

Excellent tech sharing | research and application of Tencent excellent map in weak surveillance target location

2022-06-24 06:11:00 Youtu Laboratory

Computer vision technology makes AI Have the “ eyes ”, And the emergence of deep learning makes this pair “ eyes ” Increased computing power of , It can recognize and react to the image features it sees and obtain the corresponding information . And one of , object detection (Object Detection) As an important part of image understanding , For images with multiple objects , The target in the image / Objects are located and classified , To confirm their location and size , This is also one of the core problems in the field of computer vision .

whole “ manual ” The method of strongly supervised target detection is time-consuming and needs a large labeling cost , It is very unfriendly to encounter task change or evolution , Weak supervised learning is expected to solve these problems . Senior researcher of Tencent Youtu lab noahpan With 「 Research and application of weakly supervised target location 」 The theme of , Combined with the research progress of Tencent Youtu laboratory in weak surveillance target location 、 The results and related thoughts were shared .

01

From full supervision to weak supervision

Limitations of targeting

Weakly supervised target localization is to learn the position of the target in the image only by using the category labels at the image level , Compared with full supervision , Weakly supervised target location can save a lot of labeling costs . Compared with the classification labels that need to be labeled at the image level , mark bounding  box  level The image annotation of needs about 10 Times of time . therefore , Only use the category labels at the image level to learn the target location , It can greatly save the annotation cost .

At present, the focus of weak surveillance target location is that a picture contains a category , Other common solutions are mainly through multi instance learning , And get some offline region  proposal, Through to refund Or get region  proposal. High positioning proposal, A good classification needs to be assigned , Finally, the positioning result is obtained .

Now there are two limitations of the target location method :

First of all , Local response , It can only locate the local area with the most discriminant information of the target ;

second , Loss of structure , There is no way to ensure that you can learn the structure of your goals well , Such as edge contour .

Commonly used data sets and evaluation criteria for weakly supervised target location , Data sets generally contain ImageNet and CUB-200-2011, The evaluation method mainly includes two levels , One is Bounding  box, The other is MASK. about Bounding box Speaking of , An instance needs to meet two conditions to be correctly positioned : Forecast target box and GT IoU Greater than 0.5; Correct classification . Finally, count the correct proportion of the test set or verification set . about Mask Speaking of , Pixel level... Needs to be considered IoU, It can better measure the accuracy of positioning .

02

The five categories of weak supervision target orientation development

The first category : Image level erasure

  • Image level erasure . It mainly includes two tasks ,HaS and CutMiX. Randomly erase areas on the image plane , At the same time, the network can learn the correct classification . Drive the network in the process , To activate a larger area , This kind of method is relatively simple and direct .

The second category : Feature level erasure

  • Feature level erasure . Such methods mainly include ACoL,ADL,MEIL. On the main classification branch , Get the initial CAM after , Erase its features , The erased feature moves to another classification branch , Two parallel branches are classified at the same time , The final test result is achieved by fusing the CAM As a result .

The third category : Based on spatial constraints , By considering how spatial correlations , Let the network de activate a larger area

  • Based on spatial constraints . Such methods mainly include DANet, GCNet, SLTNet.DANet Alleviate the local response caused by the similar appearance of different categories through category hierarchical reorganization . in addition , By increasing the number of features corresponding to each class and constraining the similarity of features, the overall category response area is improved .GCNet By presetting three different shapes : rectangular 、 Rotate rectangle and ellipse to approximate the target shape , At the end of this chapter, we use the idea of confrontation for reference , Yes, the scenario area is correctly classified , The background area can not be classified to guide the network to learn the accurate target location .SLTNet Motivation and DANet similar , In order to alleviate the local response problem caused by different similar textures , By reducing the category loss of the network for such cases , Improve the response area of the target .

Fourth category :Pixel-level The relevance of

  • Pixel-level  correlation. Such methods include SPG,I2C,SPOL Other methods . This method improves the integrity of category activation response graph by calculating the similarity of pixel level in features . Concrete I2C Random consistency and global consistency are used to improve the response area of the target .SPOL By fusing the characteristics of different layers in the network , Use the rich detail information in shallow features , Improve the response integrity of the overall feature .

Fifth category : Yes CAM Improvement

  • Yes CAM Improvement of methods . It mainly includes Rethinking CAM and Relevance CAM Two jobs .Rethinking CAM By providing GAP A threshold value is set at the layer to ensure that different channel features have equivalent value ranges after feature aggregation , Ensure that the corresponding category weights are similar , relieve GAP Resulting in local response problems .Relevance utilize Layer-wise Relevance Propagation Methods calculate the correlation of each layer of the network corresponding to the target category , After using GAP Layer gets the weight of the corresponding channel feature relative to the target category , By weighting the characteristics of different channels, the category activation graph of any layer in the network is obtained . in addition ,Relevance CAM Use Limited LRP Methods , Subtract the non target relevance , Get more accurate positioning results . This method is associated with CAM The comparative advantage is , Not only the last layer of convolution can be visualized , You can also get the visualization results of the middle layer , And the positioning results of different layers .

03

Weak supervised targets based on target structure information

Positioning method and research results

At present, there are two main problems in target positioning , The first is the local response , The second is that there is no way to keep the structure information . We believe that the training model has more accurate positioning information , Need design methods to extract from the model . The key to extracting information is to extract long range Characteristic similarity of .

therefore , Tencent Youtu proposed two solutions .

  • Scheme 1 : stay CNN It is called on the Internet High-order self-correlation Methods to capture long range Characteristic similarity of , solve CNN The local receptive field can only capture small range The problem of feature similarity ;
  • Option two : be based on transformer, utilize self-attention Global receptive field extraction provided by mechanism long range Feature similarity .

Scheme 1 :SPA CVPR2021

Why? GAP Way to localize the network to local areas ?

First ,GAP In the feature aggregation process, the background cannot be distinguished , Introduce background noise , Negative impact on Classification ; secondly , Now the value range of each layer of convolution network is unlimited , The correct classification of the network can be achieved by having local extremely high response on the channel characteristics of the corresponding category , Go through like this GAP after , It can still ensure that there is a high enough response on the corresponding class for classification .

We propose two solutions to this limitation .

First , Constrain the range of features , The advantage of adding constraints is that if the network wants to classify correctly , Want a larger activation value , More areas must be activated .

The second is to propose a relatively simple pseudo label method . In a simple way of variance , Calculate the variance of each pixel in different categories , If the variance is small , We think it's the background , If the variance is large , Think it's the future , In this way, we can get a simple pseudo MASK.

How to extract high-order similarity from the network ?

The traditional calculation of the correlation between the two , We call it first order similarity , That is, the distance between two features is calculated directly . because CNN The characteristics of the local receptive field , First order similarity cannot be calculated accurately long range Characteristic similarity of .

We have put forward Higher order autocorrelation , Take the second-order similarity as an example , We find the third feature point between the two feature points , So that the third point can satisfy the similarity between two points at a distance high enough , Then the product of the similarity between the middle point and the two points , As the distance between two points . Pictured , And represent two eigenvectors , The characteristic distance between the two is cos( α+β), Find a point in the middle , Now calculate the distance between and . Under certain circumstances, it can meet . Because the middle bridge point is unknown , We will traverse the entire feature map All points of the upper division and two points are intermediate nodes , Then take the average as the second-order similarity .

Based on high-order similarity , We first get the initial CAM, hold CAM The region with high response is taken as the initial positioning result , Calculate the high-order similarity corresponding to each pixel value in the high response , Take the sum of all the high-order similarities in the high response region and average them , As the final positioning result ; And then do similar operations on the background area to get the high-order similarity map of the corresponding background . By subtracting the background from the foreground , Get the final positioning result .

Option two :TS-CAM-ICCV2021

comparison CNN The Internet ,Transformer The network structure has a global receptive field , Based on the above analysis ,Transformer The network naturally has the advantage of capturing the complete target response , however attention map No category information , Make it impossible to directly get the response graph of the corresponding target category .

Based on this , We design a semantic coupled attention graph method TS-CAM.TS -CAM For each patch To classify , Finally through GAP The way to get the classification results , Instead of using Vision Transformer Using separate class token To classify . In the test phase , Rearrange the classification results , Get something similar to CNN in CAM Result , Then we can get the response graph of each category . After and from Transformer The overall correlation extracted from the structure is multiplied , Get the category aware activation graph . From the result of feature Visualization ,TS-CAM Feature activation is more complete .

04

Application in image content audit and other fields

Based on target location , Utu lab has tried some simple applications .

First of all , For one Data sets Some data can be marked with categories and bounding box, For the rest, only the category is marked , Through the weak supervised localization method, we can get the data with only category annotation bounding box result , Then, semi supervised training is used to improve the performance of the whole model ; The other is right Some instances in the image Carry out category and bounding box The annotation , The weak supervised target location method is used to predict other targets , Complete the annotation information , Finally, it is used to train the whole detection network .

second , Do image retrieval , For targets that vary greatly from different perspectives , Generally, local features with more detailed information are required for matching , The method of weakly supervised target location can locate local features very well .

05

Based on the thinking of weak supervision target orientation

Overall, , The biggest challenge of weak supervision target location is how to solve , Or mitigate some fundamental differences in classification and positioning . In order to better find a classification interface that can achieve high discrimination , Classification problems often only get local responses , But the purpose of positioning is different , Need to find the complete target area , We have the following simple thoughts .

First of all , Use different architecture, It's like Transformer, As well as the more concerned MLP, Take advantage of the global receptive field to activate more areas .

second ,Pre-training, The aim is to introduce some prior knowledge . You can try large-scale pre training , In addition, I introduce some prior knowledge I have learned about this goal , To improve its positioning results .

Third , Reconsider the relationship between features and classifiers . The main problem is how to set an objective function that can be compatible with location and classification , Or to improve GAP, Ensure that the structure of the target is maintained as much as possible in the process of feature aggregation .

Fourth , Relax constraints , Weakly supervised target location method has its inherent limitations ,CVPR 2020 The work referred to itself as weak surveillance targeting is ill-posed problem , There is no way to solve , Whether the conditions can be relaxed ? at present , Tencent Youtu is also trying to do this , This is a more valuable , It makes more sense to , More promising directions .

原网站

版权声明
本文为[Youtu Laboratory]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/07/20210726162028706x.html