当前位置:网站首页>Based on easycv to reproduce Detr and dab-detr, the correct opening method of object query
Based on easycv to reproduce Detr and dab-detr, the correct opening method of object query
2022-07-25 18:59:00 【51CTO】
DETR It is the latest target detection framework in recent years , The first real end-to-end Detection Algorithm , Save the tedious RPN、anchor and NMS Wait for the operation , Directly input the picture output detection box .DETR Our success is mainly due to Transformer Powerful modeling capabilities , And the Hungarian matching algorithm solves how to learn one-to-one Match detection box and target box .
although DETR Can achieve with Mask R-CNN Quite accurate , But training 500 individual epoch、 Slow convergence , The problem of low accuracy of small targets has been criticized . A series of subsequent work is carried out around these issues , One of the most exciting is Deformable DETR, It is also a must for today's test ,Deformable DETR Our contribution is not just to Deformable Conv Extended to Transformer On , More importantly, it provides a lot of good training DETR Techniques for detecting frameworks , Like imitation Mask R-CNN Framework of the two-stage practice , How to integrate query embed Split into content and reference points Two parts , How to integrate DETR Expand to multi-scale training , And through look forward once Conduct boxes Prediction and other skills , stay Deformable DETR after , Everyone seems to have found out how to open DETR The right way to frame . Among them the object query What does it mean , And how to make better use of object query Make a test , Produced a lot of valuable work , such as Anchor DETR、Conditional DETR wait , among DAB-DETR It is particularly thorough .DAB-DETR take object query As a content and reference points Two parts , among reference points The displayed representation is xywh four-dimensional vector , And then through decoder forecast xywh The residual of is iteratively updated to the detection box , And through xywh Vectors introduce positional attention , help DETR Speed up convergence , This article will be based on EasyCV Recurring DETR and DAB-DETR The algorithm details how to use it correctly object query To enhance DETR Check the performance of the framework .
DETR
DETR Use set loss function As a monitoring signal for end-to-end training , Then predict all goals at the same time , among set loss function Use bipartite matching Algorithm will pred Objectives and gt Match the goals . Directly regard the target detection task as set prediction problem , Make the training process simple , And avoid anchor、NMS Etc .
DETR The main contribution has two parts :architecture and set prediction loss.
1.Architecture

DETR First use CNN The input image embedding Into a two-dimensional representation , Then convert the two-dimensional representation into one-dimensional representation and combine positional encoding Send together encoder,decoder Put a small fixed number of learned object queries( It can be understood as positional embeddings) and encoder Output as input . The final will be decoder Every... You get output embdding To a shared feedforward network (FFN), The network can predict a detection result ( Include classes and borders ) Or the “ No target ” Class .
1.1 Transformer

1.1.1 Encoder
take Backbone Output feature map Convert to one-dimensional representation , obtain Characteristics of figure , Then combine positional encoding As Encoder The input of . Every Encoder All by Multi-Head Self-Attention and FFN form . and Transformer Encoder The difference is , because Encoder It has position invariance ,DETR take positional encoding Add to each Multi-Head Self-Attention in , To ensure the position sensitivity of target detection .
1.1.2 Decoder
because Decoder It also has position invariance ,Decoder Of individual object query( It can be understood that learning is different object Of positional embedding) Must be different , In order to generate different object Of embedding, And add them to each at the same time Multi-Head Attention in .
individual object queries adopt Decoder Convert to a output embedding, then output embedding adopt FFN Independently decode
A prediction , contain box and class. For input embedding Use at the same time Self-Attention and Encoder-Decoder Attention, The model can make use of the relationship between targets to carry out global reasoning . and Transformer Decoder The difference is ,DETR Each Decoder Parallel output
Objects ,Transformer Decoder Using an autoregressive model , Serial output
Objects , Only one element of one output sequence can be predicted at a time .
1.1.3 FFNFFN
from 3 layer perceptron And the first floor linear projection form .FFN Predict box Normalized center coordinates of 、 Long 、 generous and easygoing class.DETR The forecast is a fixed number N individual box Set , also N Usually larger than the actual target number ( among DETR The default setting is 100 individual , and DAB-DETR Set to 300 individual ), And an extra empty class is used to represent the predicted box There is no goal .
2.Set prediction loss
DETR The main difficulty of model training is how to base on gt Measure forecast results ( Category 、 Location 、 Number ).DETR Proposed loss Function can produce pred and gt Optimal bilateral matching ( determine pred and gt The one-to-one relationship of ), And then optimize loss. take Expressed as gt Set , Expressed as
A set of prediction results . hypothesis
Greater than the number of image targets ,
It can be considered as using empty classes ( No goal ) The size of the fill is N Set . Search two sets
Elements
Different arrangement order of , bring loss The smallest possible order of permutation is the maximum match of bipartite graph (Bipartite Matching), The formula is as follows :

among
Express pred and gt About Elements
The matching of loss. The bipartite graph is matched by Hungarian algorithm (Hungarian algorithm) obtain . matching loss At the same time, I considered pred class and pred box The accuracy of the . Every gt The elements of i Can be seen as
,
Express class label( It may be an empty class )
Express gt box, Put the element
The bipartite graph matches the specified pred class Expressed as
,pred box Expressed as
.
The first step is to find a one-to-one match pred and gt, The second step is to calculate hungarian loss.hungarian loss The formula is as follows :

among Combined with the L1 loss and generalized IoU loss, The formula is as follows :

DAB-DETR
DAB-DETR take object query As a content and reference points Two parts , among reference points The displayed representation is xywh four-dimensional vector , And then through decoder forecast xywh The residual of is iteratively updated to the detection box , And through xywh Vectors introduce positional attention , help DETR Speed up convergence .


stay DAB-DETR Before , There is a lot of work on how to set reference points Have made in-depth exploration :Conditional DETR adopt 256 The learnable vector of dimension learns xy Reference point , Then the location information is introduced transformer decoder in ;Anchor DETR The reference point is regarded as xy, And then get through learning 256 Dimension vector , Introduce location information into transformer decoder in , And through the step-by-step iteration, we get the xy;Defomable DETR It is through 256 The vectorial learning vector gets xywh Reference resources anchor, The detection frame is obtained through step-by-step iteration ;DAB-DETR Is more thorough , Absorb the advantages of hundreds of families , adopt xywh Study 256 Dimension vector , Introduce location information into transformer decoder in , And the detection frame is obtained through step-by-step iteration . thus ,reference points The way of using is becoming clearer , The displayed representation is xywh, Then learn to 256 Dimension vector , Introduce location information , Each layer transformer decoder Study xywh Residual of , The final detection frame is obtained by stacking step by step .

in addition ,DAB-DETR In order to make full use of xywh This is more revealing reference points Representation , Further introduced Width & Height-Modulated Multi-Head Cross-Attention, In fact, simply speaking, it is in cross-attention Introduction position in xywh Get position attention , This improvement can be greatly accelerated decoder The rate of convergence , Because the original DETR It is equivalent to learning positional attention in the whole picture ,DAB-DETR You can focus directly on key positions , This is also Deformable DETR The reason why convergence can be accelerated , The essence is that the more critical sparse position sampling can speed up decoder Convergence rate .
Repeat the results

Tutorial
Next , We will use a practical example to show how to base on EasyCV Conduct DAB-DETR Algorithm training , You can also link See the detailed steps .
One 、 Install dependency packages
If you are running in a local development environment , You can refer to the link Installation environment . If you use PAI-DSW There is no need to install related dependencies for the experiment , stay PAI-DSW docker Relevant environment has been built in . Two 、 Data preparation
You can download COCO2017 data , You can also use the example we provided COCO data
data/coco The format is as follows :
Two 、 Model training and evaluation
With vitdet-base For example . stay EasyCV in , Use the form of configuration file to realize the control of model parameters 、 Data input and augmentation methods 、 Configuration of training strategy , Only by modifying the parameter settings in the configuration file , You can complete the experimental configuration for training . You can download the sample configuration file directly .
see easycv Installation position
Execute training orders
Execute the evaluation order
Reference
Code implementation :
DETR https://github.com/alibaba/EasyCV/tree/master/easycv/models/detection/detectors/detr
DAB-DETR https://github.com/alibaba/EasyCV/tree/master/easycv/models/detection/detectors/dab_detr
EasyCV Previous sharing
be based on EasyCV Reappear ViTDet: Single layer features surpass FPN https://zhuanlan.zhihu.com/p/528733299
MAE Introduction and implementation of self-monitoring algorithm based on EasyCV The recurrence of https://zhuanlan.zhihu.com/p/515859470
EasyCV Open source | Visual self-monitoring out of the box +Transformer Algorithm library https://zhuanlan.zhihu.com/p/505219993
边栏推荐
- Care for front-line epidemic prevention workers, Haocheng JIAYE and Gaomidian sub district office jointly build the great wall of public welfare
- Ping command details [easy to understand]
- 信达证券是国企吗?在信达证券开户资金安全吗?
- 上半年出货量已超去年全年,森思泰克毫米波雷达“夺食”国际巨头
- How to prohibit the use of 360 browser (how to disable the built-in browser)
- 阿里云技术专家郝晨栋:云上可观测能力——问题的发现与定位实践
- The Yellow Crane Tower has a super shocking perspective. You've never seen such a VR panorama!
- 从目标检测到图像分割简要发展史
- Alibaba cloud technology expert Qin long: reliability assurance is a must - how to carry out chaos engineering on the cloud?
- The bank's wealth management subsidiary accumulates power to distribute a shares; The rectification of cash management financial products was accelerated
猜你喜欢

MySQL sub query (selected 20 sub query exercises)

With 8 years of product experience, I have summarized these practical experience of continuous and efficient research and development

Deng Qinglin, a technical expert of Alibaba cloud: Best Practices for disaster recovery and remote multi activity across availability zones on cloud

Microsoft azure and Analysys jointly released the report "Enterprise Cloud native platform driven digital transformation"

乐理基础 调式

SQL Server 2019 安装教程

There are several browser cores. How to upgrade if the browser version is too low

SQL realizes 10 common functions of Excel, with original interview questions attached

基础乐理--配置和弦

浅析IM即时通讯开发出现上网卡顿?网络掉线?
随机推荐
Care for front-line epidemic prevention workers, Haocheng JIAYE and Gaomidian sub district office jointly build the great wall of public welfare
优秀的测试/开发程序员突破,不忘初心,方得始终......
软件测试流程(思维导图)
Baklib:制作优秀的产品说明手册
#夏日挑战赛#【FFH】这个盛夏,来一场“清凉”的代码雨!
接口自动化测试平台FasterRunner系列(三)- 操作示例
Pixel2Mesh从单个RGB图像生成三维网格ECCV2018
基础乐理之音程的度数
21 days proficient in typescript-4 - type inference and semantic check
黄鹤楼超震撼视角,这样的VR全景你绝对没见过!
分享六个实用的小程序插件
Typescript reflection object reflection use
【帮助中心】为您的客户提供自助服务的核心选项
阿里云技术专家秦隆:可靠性保障必备——云上如何进行混沌工程?
果链“围城”:傍上苹果,是一场甜蜜与苦楚交错的旅途
Twitter acquired a public opinion war, which was turned into a child quarrel by musk
ES6通过代理器(Proxy)与反射(Reflect)实现观察者模式
With a financing of 200million yuan, the former online bookstore is now closed nationwide, with only 3 stores left in 60 stores
Summer Challenge [FFH] this midsummer, a "cool" code rain!
优维低代码:Use Resolves