当前位置：网站首页>[multimodal] hit: hierarchical transformer with momentum contract for video text retrieval iccv 2021

[multimodal] hit: hierarchical transformer with momentum contract for video text retrieval iccv 2021

2022-07-25 12:00:00 【chad_ lee】

《HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval》ICCV 2021

Kwai and the work of Peking University , Video text retrieval task , That is, align the video with the text , It has been used in various scenes of Kwai .

video - Text alignment method

Insert picture description here

Existing videos - There are three types of text alignment methods ：

Two-stream, Text and visual information are transmitted through independent Vision Transformer and Text Transformer, Then in multimodal Transformer Fusion in China , Representative method e.g ViLBERT、LXMERT etc. .
Single-stream, Text and visual information only pass through a multimodal Transformer To merge , Representative method e.g VisualBERT、Unicoder-VL etc. .
Dual-stream, Text and visual information only pass through independent Vision Transformer and Text Transformer, Representative method e.g COOT、T2VLAD etc. .

Obviously, the time cost of the third type of double tower type is the smallest , This paper also adopts double tower structure , To meet the needs of large-scale video text retrieval .

There are two main innovations in this paper ：1、 Not only in the last layer of representation alignment , It also represents the alignment on the first layer .2、 introduce MoCo The momentum update mechanism of is applied to comparative learning matching .

The second point is more complicated , Each tower also has a momentum update tower , Therefore, it is commonly used 4 A model （ Four tower model ） There is . Plus two levels of comparative learning loss, a pair pair The sample will have 4 individual pair loss Need to compute .

Model

Insert picture description here

First of all Encoder All are Transformer.

For a couple of video-Text sample ,text Input Query Text Encoder and Key Text Encoder,video Frame extraction , Then pull into sequence and input Query Video Encoder and Key Text Encoder. The output is all token embedding Of pooling.

So there is 4 individual Encoder The model gets input , There are two models （Query-Key） Your input is the same ,Key The model is made by Query The momentum of the model is updated .

Key Models are also maintained separately Text/Video Negative sample queue . There were Video As Query、Text As Key Comparative learning of loss; also Text As Query、Video As Key Of loss.

Insert picture description here

Then calculate at the bottom and top loss, Double again , So there is 4 individual loss.

experiment

原网站

版权声明
本文为[chad_ lee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251110592093.html

当前位置：网站首页>[multimodal] hit: hierarchical transformer with momentum contract for video text retrieval iccv 2021

[multimodal] hit: hierarchical transformer with momentum contract for video text retrieval iccv 2021

《HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval》ICCV 2021

video - Text alignment method

Model

experiment

边栏推荐

猜你喜欢

随机推荐