当前位置：网站首页>Detector: detect objects with recursive feature pyramid and switchable atolos convolution

Detector: detect objects with recursive feature pyramid and switchable atolos convolution

2022-07-23 16:48:00 【TJMtaotao】

Abstract

Many modern target detectors use “ Think twice before ” Mechanism , It shows excellent performance . This paper applies this mechanism to the trunk design of target detection . At the macro level , We propose a recursive feature pyramid , It combines additional feedback connections from the feature pyramid network into a bottom-up backbone layer . At the micro level , We propose a switchable antitrust convolution , The convolution is characterized by convolution at different antitrust rates , And use the switch function to collect the results . Combine them together to form a detector , The performance of target detection is greatly improved . stay COCO On the test development platform , The detector realizes target detection 54.7% Of box-AP state , Instance segmentation 47.1% Of mask-AP state , Panoramic segmentation 49.6% Of PQ state .https://github.com/joe-siyuan-qiao/DetectoRS

1. Introduction

To detect objects , Human visual perception transmits high-level semantic information through feedback connection , Selectively enhance and inhibit the activation of neurons [2,19,20]. Inspired by the human visual system , The mechanism of secondary vision and secondary thinking in computer vision has been instantiated , And show excellent performance [5,6,58]. Many popular two-stage target detectors , Such as fast R-CNN[58], First, output the target suggestion , Then, according to these suggestions, regional features are extracted to detect the target . In the same direction ,Cascade R-CNN[5] A multistage detector is developed , In this detector , The subsequent detector head is trained to be a more selective example . The success of this design idea inspired us to explore it in the neural network backbone design of target detection . especially , We have adopted this mechanism at both macro and micro levels , Thus, our proposed detector greatly improves the current most advanced target detector HTC[7] Performance of , At the same time, the reasoning speed remains unchanged , As shown in Table 1 .

At the macro level , We propose a recursive feature pyramid （RFP） It is based on the feature pyramid network （FPN） Above [44], It will come from FPN The additional feedback connections of the layer are merged into the bottom-up backbone layer , Pictured 1a Shown . Expand the recursive structure into sequential implementation , We got the trunk of a target detector , It can observe two or more images . Similar to cascade R-CNN Cascade detector head in , our RFP Recursively enhance FPN To generate an increasingly powerful representation . A network similar to deep monitoring [36], The feedback connection brings the features of the gradient received directly from the detector head back to the low level of the bottom-up trunk , To speed up training and improve performance . We propose RFP It realizes a design of two consecutive searches and thinking , Bottom up backbone and FPN Run multiple times , Its output characteristics depend on the characteristics in the previous steps .

At the micro level , We propose a switchable atolos convolution （SAC）, It convolutes the same input characteristics at different atolos rates [11,30,53], And use the switch function to collect the results . chart 1b Show SAC An illustration of the concept of . The switching function is spatially related , That is, each position of feature mapping may have different switches to control SAC Output . For use in detectors SAC, We will the standards in the bottom-up backbone 3x3 All convolution layers are converted to SAC, The performance of the detector is greatly improved . Some previous methods used conditional convolution , for example [39,74], It also combines the results of different convolutions into a single output . Different from those architectural requirements

To train from scratch ,SAC Provides a mechanism , The pre trained standard convolutional network can be easily converted （ for example ImageNet pretrained[59] checkpoint ）. Besides , stay SAC A new weight locking mechanism is used in , In addition to the trainable differences , The weight of different materials is the same .

Combined with the suggested RFP and SAC The result is in our detector . In order to prove its validity , We are in a challenging COCO Data sets [47] The detector is incorporated into the most advanced HTC[7]. stay COCO In test development , We report on... For object detection box AP[22]、 For instance segmentation mask AP[26] And for panoramic segmentation PQ[34]. With ResNet-50[28] The detector for the trunk is significantly improved HTC[7]7.7% Of box-AP and 5.9% Of mask-AP. Besides , Equip our detector with ResNeXt-101-32x4d[71] Can achieve the most advanced 54.7% Box type AP and 47.1% Mask AP. add DeepLabv3+[14] With Wide-ResNet-41[10] Material prediction for the backbone , The detector creates for panoramic segmentation 49.6% Of PQ New record .

2. Related Works

object detection . There are two main types of target detection methods ： First level method , Such as [45、50、56、60、80、81] And multilevel methods , Such as [5、7、9、25、27、58]. Multistage detectors are usually more flexible than primary detectors 、 More precise , But it's also more complicated . In this paper , We use a multistage detector HTC[7] As a baseline , And compared with these two kinds of detectors .

Multiscale features . Our recursive feature pyramid is based on the feature pyramid network （FPN）[44], An effective target detection system using multi-scale features . before , Many target detectors directly use multi-scale features extracted from the backbone network [4,50], and FPN The top-down path is used to sequentially combine the features of different scales .PANet[49] stay FPN Add another bottom-up path to the top of .STDL[82] The cross scale characteristics of scale conversion module are proposed .G-FRNet[1] Use the gating unit to add feedback .NAS-FPN[24] and Auto-FPN[73] Using neural structure search [87] To find the best FPN structure .EfficientDet[66] Suggest repeating a simple BiFPN layer . Unlike them , The recursive feature pyramid we proposed is enriched by a bottom-up trunk FPN The ability to express . Besides , We will use the pyramid pool of atorus space （ASPP）[13,14] Integrate to FPN in , With rich functions , Similar to seamless mini DeepLab Design [55].

Recursive convolution network . In order to solve different types of computer vision problems , Many recursive methods have been proposed , Such as [32,42,65]. lately ,CBNet[51] A recursive target detection method is proposed , It cascades multiple backbone networks , Output features as FPN The input of . by comparison , our RFP Use a that contains a valid fusion module 、 Rich in ASPP Of FPN Perform recursive computation .

The conditional convolution network adopts dynamic kernel 、 Width or depth , for example [16,39,43,48,74,77]. The difference is , We propose a switchable antitrust convolution （SAC） Without changing any pre training model , An effective conversion mechanism from standard convolution to conditional convolution . therefore ,SAC Is a plug and play module , Backbone for many pre training . Besides ,SAC Using global context information and a new weight locking mechanism , Make it more effective .

3. Recursive feature pyramid

3.1 Characteristic pyramid network

among x0 It's the input image ,fS+1=0. be based on FPN The target detector adopts fi Carry out detection and calculation .

3.2 Recursive feature pyramid

We are right. ResNet[28] Backbone network B Made changes , To allow it to accept x and R（f） As input .ResNet There are four stages , Each stage consists of several similar blocks . We only change the first block of each stage , Pictured 3 Shown . This block calculation 3 Layer features and add them to the features calculated by shortcut . In order to use features R（f）, We added another convolution layer , Its kernel size is set to 1. The weight of this layer is initialized to 0, To ensure that loading weights from pre trained checkpoints does not have any practical effect .

3.3. ASPP as the Connecting Module

We don't have a convolution that follows the cascade feature , Because here R The final output used in intensive forecasting tasks is not generated . Be careful , Each of these four branches produces a feature , The number of channels is the of input characteristics 1/4, Connecting them will produce a connection with R.In Sec Input features of the same size .5, We showed with and without ASPP Modular RFP Performance of .

3.4 The output of the fusion module is updated