This paper proposes a fine-grained classification solution CAP, Through the context aware attention mechanism to help the model find subtle feature changes . In addition to the pixel level attention mechanism , There are also regional attention mechanisms and local feature coding methods , Very different from previous visual schemes , It's worth seeing.
source ： Xiaofei's algorithm Engineering Notes official account

The paper : Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification

Address of thesis ：https://arxiv.org/abs/2101.06635
Paper code ：https://github.com/ArdhenduBehera/cap

Introduction

This paper considers that most of the excellent fine-grained image recognition methods assist in recognition by exploring the local features of the target , However, the local information is not marked , However, weak supervision or unsupervised method is adopted to locate the local feature location . And most methods use pre - trained detectors , Unable to capture the relationship between the target and local features . In order to better describe the picture content , The information from pixel to target to scene needs to be considered more carefully , Not only to locate local features / The location of the target , It also describes its rich and complementary features from multiple dimensions , To get a complete picture / The content of the goal .

This paper considers how to describe the target from the perspective of convolution network , Put forward context-aware attentional pooling(CAP) modular , It can efficiently encode the position information and appearance information of local features . The module takes the output characteristics of convolution network as input , Learn to adjust the importance of different areas in the feature , Thus we can get the rich appearance characteristics and spatial characteristics of local areas , And then carry out accurate classification .

The main contributions of this paper are as follows ：

An extension module in the field of fine-grained image recognition is proposed CAP, It can be simply applied to various convolution networks , Bring considerable fine-grained classification performance improvement .
To capture the target / Nuances between scenes , A context - sensitive method guided by regional features is proposed attention features .
Propose a learnable pool operation , It is used to automatically select the hidden state of the circular network to form the space and appearance features .
The proposed algorithm is used in 8 Test on a fine-grained data set , get SOTA result .
Analyze different basic networks , expand CAP Application scope of the module .

Proposed Approach

The overall flow of the algorithm in this paper is shown in the figure above , Input picture , Output specific dependent categories , contain 3 A component (3 Group parameters )：

Basics CNN The Internet \(\mathcal{F}(.;\theta_b)\)
CAP modular \(\mathcal{F}(.;\theta_c)\)
Classification module \(\mathcal{F}(.;\theta_d)\)

Context-aware attentional pooling (CAP)

The output characteristic of convolution network is defined as \(x=\mathcal{F}_b(I_n;\theta_b)\),CAP The module takes pixel level features into account 、 Small area characteristics 、 The context information of large area features and picture level features .

pixel-level contextual information

The context information of pixel level features mainly learns the correlation degree between pixels \(p(x_i|x_j;\theta_p)\), In the calculation \(j\) The output of the position synthesizes all other pixel features according to the correlation degree , Use it directly self-attention Realization , Feature conversion uses \(1\times 1\) Convolution . This step directly operates on the output characteristics of the backbone network , But it is not reflected in the overall flow chart .

Proposing integral regions

In order to learn context information more efficiently , The paper is on the characteristic diagram \(o\) Define basic areas of different granularity levels on the , The granularity level is determined by the size of the area . hypothesis \((i,j)\) The smallest area on the location is \(r(i,j\Delta_x,\Delta_y)\) For example , A series of areas can be derived by enlarging the width and height \(R=\{r(i,j,m\Delta_x,n\Delta_y)\}\),\(i < i + m \Delta_x \le W\),\(j < j + n \Delta_y \le H\). Produce similar sets of regions at different locations \(R\), Get the final set of regions \(\mathcal{R}=\{R\}\).\(\mathcal{R}\) Areas with different aspect ratios covering all locations , Provides comprehensive contextual information , Help provide subtle features at different levels of the picture .

Bilinear pooling

Follow the previous step , Get... On the characteristic graph \(|\mathcal{R}|\) Regions , The size ranges from the smallest \(\Delta_x\times\Delta_y\times C\) To the biggest \(W\times H\times C\), The goal of this paper is to represent different size regions as fixed size features , Bilinear interpolation is mainly used . Definition \(T_{\psi}(y)\) Is a coordinate conversion function ,\(y=(i,j)\in \mathbb{R}^c\) Is the area coordinate , The corresponding eigenvalue is \(R(y)\in \mathbb{R}^C\), Then the converted picture \(\tilde{R}\) Of \(\tilde{y}\) The value on the coordinates is :

\(R(T_{\psi(y)})\) Is a sampling function ,\(K(\cdots)\) It's a kernel function , The most primitive method is adopted here , Map the target coordinates back to the original drawing , Take the nearest four points , Output by distance , Finally, the pooled fixed feature is obtained \(\bar{f}(w\times h\times C)\).

Context-aware attention

here , This paper uses a new attention mechanism to obtain context information , according to \(\bar{f}_r\) With other features \(\bar{f}_{r^{'}}(r, r^{'}\in \mathcal{R})\) The similarity of the weighted output , This allows the model to selectively focus on more relevant areas , So as to produce more comprehensive context information . To query items \(q(\bar{f}_r)\) And a set of keyword items \(k(\bar{f}_{r^{'}})\), Output context vector \(c_r\)：

Parameter matrix \(W_{\beta}\) and \(W_{\beta^{'}}\) It is used to convert input features into key items of query items ,\(W_{\alpha}\) Is a nonlinear combination ,\(b_{\alpha}\) and \(b_{\beta}\) For the offset term , The overall learnable parameters are \(\{W_{\beta},W_{\beta^{'}},W_{\alpha},b_{\alpha},b_{\beta}\}\in\theta_c\), And attention items \(\alpha_{r,r^{'}}\) It represents the similarity between the two features . such , Context vector \(c_r\) Can represent the area \(\bar{f}_r\) Implied context information , This information is obtained according to its relevance to other regions , The whole calculation idea is similar to self-attention Basically similar .

Spatial structure encoding

Context vector \(c=\{c_r|r=1,\cdots|\mathcal{R}|\}\) The key degree and characteristics of the area are described , In order to further add structural information related to spatial arrangement , In this paper, the context vector of the region \(c\) Turn to area sequence ( Press the paper from top to bottom 、 Left to right ), Input into the recurrent neural network , Hidden units using recurrent neural networks \(h_r\in\mathbb{R}^n\) To express structural features .

Area \(r\) The intermediate feature of can be expressed as \(h_r=\mathcal{F}_h(h_{r-1},f_r;\theta_h)\),\(\mathcal{F}_h\) use LSTM,\(\theta_h\in\theta_c\) contain LSTM Related parameters of . In order to increase generalization ability and reduce computation , Contextual features \(f_r\in\mathbb{R}^{1\times C}\) from \(c_r\in\mathbb{R}^{w\times h\times C}\) Perform global average pooling to obtain , Finally, the context feature sequence is output \(f=(f_1,f_2,\cdots,f_r,\cdots,f_{|\mathcal{R}|})\) Corresponding hidden state sequence \(h=(h_1,h_2,\cdots,h_r,\cdots,h_{|\mathcal{R}|})\), Later, it is used in the classification module .

Classification

In order to further guide the model to distinguish subtle changes , This paper proposes a learnable pool operation , Can respond to similar hidden layers by combining \(h_r\) To integrate feature information . The paper draws lessons from NetVLAD Thought , A derivable clustering method is used to transform the response value of the hidden layer , First, the response of hidden layer to class cluster is calculated \(k\) The relevance of , And then weighted to the class cluster \(k\) Of VLAD encoding in ：

Each class cluster has its learnable parameters \(W_i\) and \(b_i\), The whole idea is based on softmax, Press... For the response value of the hidden layer softmax The weight of is assigned to different clusters . After getting all the class clusters encoding After vector , Use learnable weights \(W_N\) and softmax Normalize . therefore , Classification module \(\mathcal{F}_d\) The learnable parameter of is \(\theta_d=\{W_i, b_i, W_N\}\).

Experiments and Discussion

On different datasets , Compare different methods .

Comparison of accuracy under different backbone networks .

Visualization of output characteristics of different modules , chart b It's to join CAP after , Characteristics of backbone network output .

Conclusion

This paper proposes a fine-grained classification solution CAP, Through the context aware attention mechanism to help the model find the subtle feature changes of the target . In addition to the pixel level attention mechanism , There are also regional attention mechanisms and local feature coding methods , Very different from previous visual schemes , It's worth seeing. .

If this article helps you , Please give me a compliment or watch it ～

More on this WeChat official account 【 Xiaofei's algorithm Engineering Notes 】

CAP： Multiple attention mechanisms , Interesting fine-grained classification schemes | AAAI 2021 More articles about

ACNet: Special thoughts , Tencent proposes a binary neural tree combined with attention convolution for fine-grained classification | CVPR 2020
In this paper, a binary neural tree combined with attention convolution is proposed for weakly supervised fine-grained classification , The attention convolution operation is combined on the edge of the tree structure , At each node, a routing function is used to define the calculation path from the root node to the leaf node , Combine the predicted values of all leaf nodes to make the final prediction , Of the paper ...
Transformer Interpretability ： What the attention mechanism notices ？
Original author | FLPPED The paper : Self-Attention Attribution: Interpreting Information Interactions Inside Transform ...
（ turn ） Attention mechanism （Attention Mechanism） Applications in natural language processing
Attention mechanism (Attention Mechanism) Applications in natural language processing In this paper, from :http://www.cnblogs.com/robert-dlut/p/5952032.html In recent years , depth ...
Attention mechanism （Attention Mechanism） Applications in natural language processing
Attention mechanism (Attention Mechanism) Applications in natural language processing In recent years , The study of deep learning is getting deeper and deeper , Many breakthroughs have also been made in various fields . Based on attention (attention) The neural network of mechanisms becomes ...
Self attention mechanism in natural language processing （Self-attention Mechanism）
Self attention mechanism in natural language processing (Self-attention Mechanism) In recent years , attention (Attention) Mechanism is widely used in natural language processing based on deep learning (NLP) In every mission , Before, I paid attention to the early stage ...
AAAI2018 Self attention mechanism in (Self-attention Mechanism)
In recent years , attention (Attention) Mechanism is widely used in natural language processing based on deep learning (NLP) In every mission . With the further study of attention mechanism , All kinds of attention It was proposed by researchers that , Such as single . Multiple . Interactive and so on . last year ...
In depth understanding of BERT Transformer , It's not just the attention mechanism
Source: Business News , The original title : In depth understanding of BERT Transformer , It's not just the attention mechanism BERT yes google A recently proposed natural language processing model , It's on many tasks Very good on the test . Such as : Question and answer . Natural language inference and ...
Attention mechanism （Attention Mechanism） application —— natural language processing （NLP）
In recent years , The study of deep learning is getting deeper and deeper , Many breakthroughs have also been made in various fields . Based on attention (attention) The neural network of mechanism has become a hot spot of neural network research recently , Here are some examples based on attention The neural network of mechanism is in ...
Self attention mechanism （Self-attention Mechanism）—— natural language processing （NLP）
In recent years , attention (Attention) Mechanism is widely used in natural language processing based on deep learning (NLP) In every mission . With the further study of attention mechanism , All kinds of attention It was proposed by researchers that . stay 2017 year 6 month google machine ...
About the attention mechanism （《Attention is all you need》）
Deep learning to do NLP Methods , Basically, the sentence is divided into words first , Then each word is transformed into a corresponding word vector sequence .(https://kexue.fm/archives/4765) The first idea is RNN layer , Recursively , however RNN It can't be well ...

Random recommendation

TestNg Rely on detailed explanation ( 3、 ... and )------ Flexible file configuration dependencies
Configuration based dependency testing , Let dependency testing not be limited to test code , stay XML Flexible dependency configuration in file Original article , copyright , Allowed to reprint , Indicate the source :http://blog.csdn.net/wanghantong Javaco ...
Go Language exception handling
Writing Go When writing language code , We should get used to using error Type value to indicate abnormal status . As an idiom , stay Go Many functions and methods in the language standard library code package will also return error Type value to indicate the error status and its details . error It's one ...
EXPDP and IMPDP A simple test
One .EXPDP and IMPDP Instructions Oracle Database 10g Introducing the latest data pump (Data Dump) technology , Data pump export import (EXPDP and IMPDP) The role of 1) Realize logical backup and logical recovery . ...
Freertos System configuration
Freertos System configuration http://blog.csdn.net/liyuanbhu/article/details/7912170/
sharepoint REST API Get folders and files
Use REST Operation folder : Get folder url: http://site url/_api/web/GetFolderByServerRelativeUrl('/Shared Documents')/f ...
One about kindle Firmware modification problem
Have a serial number at hand B05A The Japanese version at the beginning kpw2, There is no Chinese interface by default . I've seen it before @5201992318q Great God's post , The reason is the system langpicker.conf There is a code in the file to judge the serial number and delete the language , So I sprouted ...
Network programming : be based on UDP Agreed socket
udp It's unlinked , After starting the service, you can accept the message directly , There's no need to set up a link in advance UDP Communication advantages of the protocol : Allow a server to communicate with multiple clients at the same time , TCP no way Server side import socket sk = socket ...
Stream( flow ) Basic operation
// Convert stream to file public static void StreamToFile(Stream stream, string filepath) { byte[ ...
ZT What are the difficulties ？ sleepy
What are the difficulties ? The difficulty is the mountain in front of us , We need to go over ; Difficulties are the huge waves that hinder our progress , We need to sail, cleave and cut the waves : Difficulties are the storms that come before us , Believe that there will be sunshine and rainbows after the storm . In fact, difficulties are not terrible , Afraid of ...
Redis Intro - Dict
https://segmentfault.com/a/1190000004850844