当前位置：网站首页>(CVPR-2022)BiCnet

(CVPR-2022)BiCnet

2022-07-23 22:47:00 【Gu daochangsheng】

stay [30, 42] after , We decompose the video network into spatial clues and temporal relationships . Use efficient BiCnet Fully explore spatial clues , We built a Temporal Kernel Selection Blocks to jointly model short-term and long-term time relationships . Because the time relationship of different scales has different importance for different sequences （ Pictured 2 Shown ）,TKS Combine multi-scale time relationships in a dynamic way , That is to assign different weights to different time scales according to the input sequence .

chart 2： The short-term and long-term temporal relationships have different importance for different sequences . (a) Partially occluded sequence . Long term time clues are needed to reduce occlusion . (b) Fast moving pedestrian sequence . Short term time cues are needed to simulate detailed movement patterns .
”

Special ,TKS With a series of continuous frame characteristics As input , among It's No Characteristic diagram of frame , And in Perform triple operations on , namely Partition、Select and Excite.

Partition operation . Due to the imperfect character detection algorithm , The adjacent frames of the video are not well aligned , This may cause time convolution in the video reID [9] The invalid . stay [34] after , We use partition strategy to alleviate the problem of spatial dislocation . say concretely , Given video feature map , We divide each frame into A spatial area , And average pool each divided area , Build regional video feature map .

Select operation . Pictured 4 Shown , Given , We carry out Parallel paths , among F (i) Yes. Kernel size 1D Time convolution [30]. In order to further improve efficiency , have The time convolution of the kernel is replaced with Kernel and expansion size Extended convolution of . The basic idea of the selection operation is to use the global information from all time paths to determine the weight assigned to each path . say concretely , We first fuse the outputs of all paths by summing the elements , Then perform global average pooling to obtain global characteristics ：

among Represents global average pooling along time and space dimensions . Then embed according to the global Get the channel selection weight ,

among Is for Generate Transformation parameters of . Then the aggregation characteristic graph is obtained through the selection weights on various time cores ,

among Yes, it will Remodel as In order to Size compatible reshaping operation .

It's worth pointing out , Compared with using scale weights to provide rough fusion , We choose to use channel weights （ equation 7） To merge . This design results in finer grained fusion , Each characteristic channel can be adjusted . Besides , The weight is dynamically calculated according to the input video . This may have different dominant time scales for different sequences reID crucial .

Trigger operation . The excitation operation pairs Adjust to modulate the input characteristic diagram . The final feature map by ：. here It is the nearest neighbor sampler , It's right Perform upsampling to match The spatial resolution of . TKS The block maintains the input size , Therefore, it can be inserted into BiCnet To extract effective spatio-temporal features .

原网站

版权声明
本文为[Gu daochangsheng]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207231731162948.html

当前位置：网站首页>(CVPR-2022)BiCnet

(CVPR-2022)BiCnet

边栏推荐

猜你喜欢

随机推荐