当前位置:网站首页>[multimodal] hit: hierarchical transformer with momentum contract for video text retrieval iccv 2021
[multimodal] hit: hierarchical transformer with momentum contract for video text retrieval iccv 2021
2022-07-25 12:00:00 【chad_ lee】
《HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval》ICCV 2021
Kwai and the work of Peking University , Video text retrieval task , That is, align the video with the text , It has been used in various scenes of Kwai .
video - Text alignment method

Existing videos - There are three types of text alignment methods :
- Two-stream, Text and visual information are transmitted through independent Vision Transformer and Text Transformer, Then in multimodal Transformer Fusion in China , Representative method e.g ViLBERT、LXMERT etc. .
- Single-stream, Text and visual information only pass through a multimodal Transformer To merge , Representative method e.g VisualBERT、Unicoder-VL etc. .
- Dual-stream, Text and visual information only pass through independent Vision Transformer and Text Transformer, Representative method e.g COOT、T2VLAD etc. .
Obviously, the time cost of the third type of double tower type is the smallest , This paper also adopts double tower structure , To meet the needs of large-scale video text retrieval .
There are two main innovations in this paper :1、 Not only in the last layer of representation alignment , It also represents the alignment on the first layer .2、 introduce MoCo The momentum update mechanism of is applied to comparative learning matching .
The second point is more complicated , Each tower also has a momentum update tower , Therefore, it is commonly used 4 A model ( Four tower model ) There is . Plus two levels of comparative learning loss, a pair pair The sample will have 4 individual pair loss Need to compute .
Model

First of all Encoder All are Transformer.
For a couple of video-Text sample ,text Input Query Text Encoder and Key Text Encoder,video Frame extraction , Then pull into sequence and input Query Video Encoder and Key Text Encoder. The output is all token embedding Of pooling.
So there is 4 individual Encoder The model gets input , There are two models (Query-Key) Your input is the same ,Key The model is made by Query The momentum of the model is updated .
Key Models are also maintained separately Text/Video Negative sample queue . There were Video As Query、Text As Key Comparative learning of loss; also Text As Query、Video As Key Of loss.

Then calculate at the bottom and top loss, Double again , So there is 4 individual loss.

experiment

边栏推荐
- Teach you how to configure S2E as the working mode of TCP server through MCU
- 浅谈低代码技术在物流管理中的应用与创新
- JS常用内置对象 数据类型的分类 传参 堆栈
- brpc源码解析(二)—— brpc收到请求的处理过程
- Chapter 4 linear equations
- W5500 multi node connection
- Attendance system based on w5500
- brpc源码解析(四)—— Bthread机制
- What is the global event bus?
- Solved files' name is invalid or doors not exist (1205)
猜你喜欢
随机推荐
【多模态】《HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval》ICCV 2021
程序员送给女孩子的精美礼物,H5立方体,唯美,精致,高清
brpc源码解析(三)—— 请求其他服务器以及往socket写数据的机制
Solutions to the failure of winddowns planning task execution bat to execute PHP files
[leetcode brush questions]
brpc源码解析(六)—— 基础类socket详解
winddows 计划任务执行bat 执行PHP文件 失败的解决办法
brpc源码解析(七)—— worker基于ParkingLot的bthread调度
浅谈低代码技术在物流管理中的应用与创新
PL/SQL入门,非常详细的笔记
brpc源码解析(八)—— 基础类EventDispatcher详解
W5500 is in TCP_ In server mode, you cannot Ping or communicate in the switch / router network.
任何时间,任何地点,超级侦探,认真办案!
PHP 上传ftp路径文件到外网服务器上 curl base64图片
Miidock Brief
【USB设备设计】--复合设备,双HID高速(64Byte 和 1024Byte)
return 和 finally的执行顺序 ?各位大佬请看过来,
Go 垃圾回收器指南
OneNET平台控制W5500开发板LED灯
【多模态】《TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback》 Arxiv‘22







![[MySQL learning 08]](/img/9e/6e5f0c4c956ca8dc31d82560262013.png)

