当前位置:网站首页>ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
2022-06-25 03:41:00 【QbitAl】
Write it at the front
Visual language pre training improves the performance of many downstream visual language tasks , for example : Image Retrieval 、 Picture based Q & A or reasoning . Have a friend to ask , In addition to using larger models for open academic tasks / More data / Skills brush the indicators very high , What are the practical applications of the multimodal pre training model ?
So , Bytes to beat AI Lab Research The team put forward X-VLM, For the first time, it is proposed to learn multi granularity visual and language alignment . Experimental proof , This pre training method is very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot , only 216M Parameter quantity X-VLM You can achieve excellent performance in a wide range of multimodal tasks , for example : Image text retrieval 、 Picture based Q & A or reasoning 、 Visual positioning 、 Picture description generation . at present ,X-VLM In the real application scenario of ByteDance, it exceeds many models commonly used in the industry , Online completed , Serving businesses such as today's headlines . Related papers have been published by ICML 2022 receive .
The paper :https://arxiv.org/abs/2111.08276
Code :https://github.com/zengyan-97/X-VLM
such as ,X-VLM Learned multi granular visual and language alignment , It can generate more correct sentences for pictures to describe objects and relationships between objects , This ability has been applied to the public welfare project of ByteDance . Mr. Zhao, who has visual impairment, often uses today's headlines to understand current affairs and news , He has always had an expectation :“ Hope to be the same as ordinary people ‘ see ’ To all information content .” More than two-thirds of today's headlines contain pictures , In order to solve the problem of reading pictures for the visually impaired , Today's headline App Recently applied X-VLM Generation capacity of , It can automatically identify pictures and describe them .
To make them “ see ” See each picture , We made a small improvement
Besides ,X-VLM The ability to understand and generate is also used in the automatic correction function of the energetically intelligent learning lamp . The following figure shows the complete phrase question type and the results predicted by the model :
The powerful intelligent learning lamp with automatic problem-solving function has been widely praised by parents , This capability is still under continuous optimization .
Research background
The existing multimodal pre training models can be roughly divided into two categories :
1) Depending on the target detector extraction based on the object ( for example : vehicle 、 people 、 Trees 、 knapsack ) To represent a picture , This method can learn object level visual and language alignment , Pictured 1 in (a) Shown . These methods either directly use the pre trained target detector , Or combine the target detection process into multimodal pre training ;
2) use ResNet perhaps Vision Transformer Encode the whole picture , Just learn the alignment between the picture and the text , Pictured 1(b) Shown .
There are some problems in both methods . First , The method based on object detection can recognize all possible objects in the picture , Some of them have nothing to do with paired text . Besides , The object-based visual features extracted by this method may lose the information between objects ( It can be considered as a kind of contextual information ). and , This method can only identify a limited number of objects , It is difficult to predefine the appropriate object categories . The second method is simple and direct , But it is difficult to learn fine-grained visual and language alignment , for example : Object level alignment . This fine-grained alignment relationship has been confirmed by previous work for visual reasoning (visual reasoning) And visual positioning (visual grounding) The task is very helpful .
actually , For multimodal pre training , The following public data is available for use by the model :1) Picture and picture title ;2) Area labeling , for example : chart 1 The text in the “man crossing the street” Associated with a specific area in the picture . However , Previous work has roughly aligned the area labels with the whole picture ;3) Object label , for example “backpack”, These annotations are used to train the target detector in the previous work .
Different from the previous practice , In this paper, the author proposes X-VLM, Use the above data in a unified way to efficiently learn multi granularity visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment . say concretely , The author suggests that the Vision Transformer Of patch embeddings To flexibly represent the visual concepts of various particle sizes , Pictured 1(c) Shown : for example , Visual concepts “backpack” from 2 individual patch form , And the visual concept “man crossing the street” By more patch form .
therefore ,X-VLM The secret to learning multi granularity visual and language alignment is :
1) Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity , This process uses the commonly used comparative learning loss 、 Match loss 、 and MLM Loss optimization ;
2) Further more , In the same picture , Give different texts , The model is required to predict the coordinates of the visual concept corresponding to the granularity , Optimization with regression loss and intersection union ratio loss of boundary box coordinates . Experimental proof , This pre training method Very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot ,X-VLM It can be understood in multiple modes downstream / Excellent performance on generation tasks **.
Method
X-VLM By an image encoder , A text encoder , A cross modal encoder consists of .
chart 2 The visual concept is given on the left ( It can be an object / Area / picture ) The coding process : The image encoder is based on Vision Transformer, Divide the input picture into patch code . then , Give any bounding box , Flexible access to all... In the box patch The average value of the representation obtains the global representation of the region . Then compare the global representation with all in the original box patch Means to arrange into a sequence according to the original order , As the representation of the visual concept corresponding to the bounding box . Get the picture itself in this way (I) And visual concepts in pictures (V1,V2,V3) The coding . Text corresponding to visual concepts , One by one by text encoder , For example, picture title 、 Area description 、 Or object labels .
X-VLM Adopt the common model structure , The difference lies in the method of pre training . The author optimizes through the following two types of losses :
First of all , In the same picture , Give different texts , for example :T(text)、T1(text1)、T2(text2)、T3(text3), The model is required to predict the bounding box of the corresponding visual concept in the picture :
xjcls Is the cross modal encoder in [CLS] The output vector of the position .Sigmoid The function is to standardize the bounding box of prediction .Ground-truth bj Corresponding , In turn are the standardized abscissa of the center 、 Center ordinate 、 wide 、 high . Last , This loss is the regression loss of the bounding box coordinates (L1) Compare losses with each other (GIoU) The sum of the . The author thinks that in the same picture , Give different words , The model is required to predict the corresponding visual concept , It can make the model learn multi granularity visual language alignment more effectively . This loss is also the first time it has been used in multimodal pre training .
second , Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly optimize the model to pull together different granularity text and visual concepts , Including objects / Area / Alignment of picture and text . The author uses three common loss optimization methods in multimodal pre training , In turn, is :
1) Compare learning loss :
yv2t,yt2v ∈ Rbsz x bsz yes ground-truth Similarity degree , The diagonal to 1, Others are 0.
pv2t, pt2v ∈ Rbsz x bsz It is the similarity calculated by the model based on the output of text encoder and image encoder .
2) Match loss :
pmatch It is based on the calculation of cross modal encoder , Forecast given Whether it matches ( let me put it another way ,0/1 classification ). For each positive example , The author sampled a pair of negative cases .
3)Masked Language Modeling Loss :
T( Estimated value ) Some words in have been randomly replaced with [MASK],pj(V, T( Estimated value )) It's a cross modal encoder in the word tj The probability distribution of thesaurus calculated by the output vector of position .
experiment
The author uses the medium-sized model commonly used in multimodal pre training 4M and 16M Experiment with image data set , As shown in the following table :
among , mark (# Ann) Is the sum of area labels and object labels . It can be seen that , Some datasets don't have picture titles , for example Visual Genome(VG), Some datasets have no picture labels , for example CC-3M/12M.
surface 2 It shows the task of image text retrieval (MSCOCO and Flickr30K) Performance on . Even if , The previous methods are pre trained on a larger amount of internal data or have a larger model scale , stay 4M Training under picture data set X-VLM You can go beyond the previous methods .
surface 3 Demonstrated in visual reasoning (VQA2.0 and NLVR2)、 Visual positioning (RefCOCO+) 、 Picture description generation (COCO Caption) Model performance on . For a fair comparison ,X-VLM It follows the previous work fine-tune Method , No additional adjustments have been made . Combine tables 2 And table 3, It can be seen that , Compared with the previous method ,X-VLM Support more kinds of downstream tasks , And in these common visual language tasks have achieved very good performance .
Summarize and discuss
In this paper , The author puts forward X-VLM To learn multi granular visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment .X-VLM The secret to success is :
1) be based on patch embeddings Flexible presentation of visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity ;
2) Further more , In the same picture , Give different texts , The model is required to predict the coordinates corresponding to the visual concept . Experiments show that this pre training method is very efficient .
In the experimental part , The author uses the commonly used 4M and 16M data , Total training parameters 216M Of X-VLM , It can surpass a larger scale model or a model using a large amount of pre training data , In the downstream multi-modal understanding / Excellent performance on generation tasks . also , The engineers of ByteDance also put X-VLM Used in real business scenarios , for example : Describe the picture content for visually impaired people , Automatic correction of pupils' homework . actually ,X-VLM He is also very good at fine-grained retrieval,visual grounding Etc .
at present ,X-VLM Our code is open source , You are also welcome to scan the QR code below to do your own tasks fine-tune Experience .
* This article is authorized by qubit to publish , Opinions are owned only by the author .
— End —
「 Smart car 」 Communication group recruitment !
Welcome to smart cars 、 Self driving partners join the community , Communicate with industry celebrities 、 Compare notes , Don't miss the development of smart car industry & Technological progress .
ps. Please note your name when adding friends - company - Position oh ~
边栏推荐
- DSPACE set zebra crossings and road arrows
- 浏览器下载的文件属性里都有保护,如何去掉
- 腾讯开源项目「应龙」成Apache顶级项目:前身长期服务微信支付,能hold住百万亿级数据流处理...
- 存算一体芯片离普及还有多远?听听从业者怎么说 | 对撞派 x 后摩智能
- 小米路由R4A千兆版安装breed+OpenWRT教程(全脚本无需硬改)
- 扎克伯格最新VR原型机来了,要让人混淆虚拟与现实的那种
- 服乔布斯不服库克,苹果传奇设计团队解散内幕曝光
- 202112-2 序列查询新解
- 指南针在上面开户安全吗?靠谱吗?
- Background page production 01 production of IVX low code sign in system
猜你喜欢
CVPR大会现场纪念孙剑博士,最佳学生论文授予同济阿里,李飞飞获黄煦涛纪念奖...
Tencent Open Source Project "Yinglong" est devenu un projet Apache de haut niveau: l'ancien Service à long terme Wechat payment, peut maintenir un million de milliards de niveaux de traitement de flux
20年ICPC澳门站L - Random Permutation
The sign in function completes 03 "IVX low code sign in system production"
CUDA编程入门极简教程
The era of copilot free is over! Student party and defenders of popular open source projects can prostitute for nothing
DSPACE set zebra crossings and road arrows
服乔布斯不服库克,苹果传奇设计团队解散内幕曝光
DSPACE设置斑马线和道路箭头
Two way combination of business and technology to build a bank data security management system
随机推荐
We media do not know how to realize it? Sharing of 7 major realization methods
Rebeco:使用机器学习预测股票崩盘风险
存算一体芯片离普及还有多远?听听从业者怎么说 | 对撞派 x 后摩智能
Is it reliable for CITIC Securities to open a mobile account? Is it safe?
小米路由R4A千兆版安装breed+OpenWRT教程(全脚本无需硬改)
MySQL learning notes -- addition, deletion, modification and query on a single table
Doak CMS article management system recommendation
香蕉为什么能做随机数生成器?因为,它是水果界的“辐射之王”
How to choose a securities company when opening an account with a compass? Which is safer
Insurance app aging service evaluation analysis 2022 issue 06
New solution of 202112-2 sequence query
Easy to use dictionary -defaultdict
在华泰证券上面开股票账户好不好,安不安全?
服乔布斯不服库克,苹果传奇设计团队解散内幕曝光
Introduction to CUDA Programming minimalist tutorial
20 years ICPC Macau station L - random permutation
跨境电商新手如何防止店铺关联?用什么工具好?
How to click DOM to automatically locate the corresponding code line in vscode
Tencent's open source project "Yinglong" has become a top-level project of Apache: the former long-term service wechat payment can hold a million billion level of data stream processing
ASP. Net conference room booking applet source code booking applet source code