当前位置:网站首页>ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
2022-06-25 03:41:00 【QbitAl】
Write it at the front
Visual language pre training improves the performance of many downstream visual language tasks , for example : Image Retrieval 、 Picture based Q & A or reasoning . Have a friend to ask , In addition to using larger models for open academic tasks / More data / Skills brush the indicators very high , What are the practical applications of the multimodal pre training model ?

So , Bytes to beat AI Lab Research The team put forward X-VLM, For the first time, it is proposed to learn multi granularity visual and language alignment . Experimental proof , This pre training method is very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot , only 216M Parameter quantity X-VLM You can achieve excellent performance in a wide range of multimodal tasks , for example : Image text retrieval 、 Picture based Q & A or reasoning 、 Visual positioning 、 Picture description generation . at present ,X-VLM In the real application scenario of ByteDance, it exceeds many models commonly used in the industry , Online completed , Serving businesses such as today's headlines . Related papers have been published by ICML 2022 receive .

The paper :https://arxiv.org/abs/2111.08276
Code :https://github.com/zengyan-97/X-VLM
such as ,X-VLM Learned multi granular visual and language alignment , It can generate more correct sentences for pictures to describe objects and relationships between objects , This ability has been applied to the public welfare project of ByteDance . Mr. Zhao, who has visual impairment, often uses today's headlines to understand current affairs and news , He has always had an expectation :“ Hope to be the same as ordinary people ‘ see ’ To all information content .” More than two-thirds of today's headlines contain pictures , In order to solve the problem of reading pictures for the visually impaired , Today's headline App Recently applied X-VLM Generation capacity of , It can automatically identify pictures and describe them .

To make them “ see ” See each picture , We made a small improvement
Besides ,X-VLM The ability to understand and generate is also used in the automatic correction function of the energetically intelligent learning lamp . The following figure shows the complete phrase question type and the results predicted by the model :

The powerful intelligent learning lamp with automatic problem-solving function has been widely praised by parents , This capability is still under continuous optimization .

Research background

The existing multimodal pre training models can be roughly divided into two categories :
1) Depending on the target detector extraction based on the object ( for example : vehicle 、 people 、 Trees 、 knapsack ) To represent a picture , This method can learn object level visual and language alignment , Pictured 1 in (a) Shown . These methods either directly use the pre trained target detector , Or combine the target detection process into multimodal pre training ;
2) use ResNet perhaps Vision Transformer Encode the whole picture , Just learn the alignment between the picture and the text , Pictured 1(b) Shown .
There are some problems in both methods . First , The method based on object detection can recognize all possible objects in the picture , Some of them have nothing to do with paired text . Besides , The object-based visual features extracted by this method may lose the information between objects ( It can be considered as a kind of contextual information ). and , This method can only identify a limited number of objects , It is difficult to predefine the appropriate object categories . The second method is simple and direct , But it is difficult to learn fine-grained visual and language alignment , for example : Object level alignment . This fine-grained alignment relationship has been confirmed by previous work for visual reasoning (visual reasoning) And visual positioning (visual grounding) The task is very helpful .
actually , For multimodal pre training , The following public data is available for use by the model :1) Picture and picture title ;2) Area labeling , for example : chart 1 The text in the “man crossing the street” Associated with a specific area in the picture . However , Previous work has roughly aligned the area labels with the whole picture ;3) Object label , for example “backpack”, These annotations are used to train the target detector in the previous work .
Different from the previous practice , In this paper, the author proposes X-VLM, Use the above data in a unified way to efficiently learn multi granularity visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment . say concretely , The author suggests that the Vision Transformer Of patch embeddings To flexibly represent the visual concepts of various particle sizes , Pictured 1(c) Shown : for example , Visual concepts “backpack” from 2 individual patch form , And the visual concept “man crossing the street” By more patch form .
therefore ,X-VLM The secret to learning multi granularity visual and language alignment is :
1) Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity , This process uses the commonly used comparative learning loss 、 Match loss 、 and MLM Loss optimization ;
2) Further more , In the same picture , Give different texts , The model is required to predict the coordinates of the visual concept corresponding to the granularity , Optimization with regression loss and intersection union ratio loss of boundary box coordinates . Experimental proof , This pre training method Very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot ,X-VLM It can be understood in multiple modes downstream / Excellent performance on generation tasks **.
Method

X-VLM By an image encoder , A text encoder , A cross modal encoder consists of .
chart 2 The visual concept is given on the left ( It can be an object / Area / picture ) The coding process : The image encoder is based on Vision Transformer, Divide the input picture into patch code . then , Give any bounding box , Flexible access to all... In the box patch The average value of the representation obtains the global representation of the region . Then compare the global representation with all in the original box patch Means to arrange into a sequence according to the original order , As the representation of the visual concept corresponding to the bounding box . Get the picture itself in this way (I) And visual concepts in pictures (V1,V2,V3) The coding . Text corresponding to visual concepts , One by one by text encoder , For example, picture title 、 Area description 、 Or object labels .
X-VLM Adopt the common model structure , The difference lies in the method of pre training . The author optimizes through the following two types of losses :
First of all , In the same picture , Give different texts , for example :T(text)、T1(text1)、T2(text2)、T3(text3), The model is required to predict the bounding box of the corresponding visual concept in the picture :


xjcls Is the cross modal encoder in [CLS] The output vector of the position .Sigmoid The function is to standardize the bounding box of prediction .Ground-truth bj Corresponding , In turn are the standardized abscissa of the center 、 Center ordinate 、 wide 、 high . Last , This loss is the regression loss of the bounding box coordinates (L1) Compare losses with each other (GIoU) The sum of the . The author thinks that in the same picture , Give different words , The model is required to predict the corresponding visual concept , It can make the model learn multi granularity visual language alignment more effectively . This loss is also the first time it has been used in multimodal pre training .
second , Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly optimize the model to pull together different granularity text and visual concepts , Including objects / Area / Alignment of picture and text . The author uses three common loss optimization methods in multimodal pre training , In turn, is :
1) Compare learning loss :

yv2t,yt2v ∈ Rbsz x bsz yes ground-truth Similarity degree , The diagonal to 1, Others are 0.
pv2t, pt2v ∈ Rbsz x bsz It is the similarity calculated by the model based on the output of text encoder and image encoder .
2) Match loss :

pmatch It is based on the calculation of cross modal encoder , Forecast given Whether it matches ( let me put it another way ,0/1 classification ). For each positive example , The author sampled a pair of negative cases .
3)Masked Language Modeling Loss :

T( Estimated value ) Some words in have been randomly replaced with [MASK],pj(V, T( Estimated value )) It's a cross modal encoder in the word tj The probability distribution of thesaurus calculated by the output vector of position .
experiment
The author uses the medium-sized model commonly used in multimodal pre training 4M and 16M Experiment with image data set , As shown in the following table :

among , mark (# Ann) Is the sum of area labels and object labels . It can be seen that , Some datasets don't have picture titles , for example Visual Genome(VG), Some datasets have no picture labels , for example CC-3M/12M.

surface 2 It shows the task of image text retrieval (MSCOCO and Flickr30K) Performance on . Even if , The previous methods are pre trained on a larger amount of internal data or have a larger model scale , stay 4M Training under picture data set X-VLM You can go beyond the previous methods .

surface 3 Demonstrated in visual reasoning (VQA2.0 and NLVR2)、 Visual positioning (RefCOCO+) 、 Picture description generation (COCO Caption) Model performance on . For a fair comparison ,X-VLM It follows the previous work fine-tune Method , No additional adjustments have been made . Combine tables 2 And table 3, It can be seen that , Compared with the previous method ,X-VLM Support more kinds of downstream tasks , And in these common visual language tasks have achieved very good performance .
Summarize and discuss
In this paper , The author puts forward X-VLM To learn multi granular visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment .X-VLM The secret to success is :
1) be based on patch embeddings Flexible presentation of visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity ;
2) Further more , In the same picture , Give different texts , The model is required to predict the coordinates corresponding to the visual concept . Experiments show that this pre training method is very efficient .
In the experimental part , The author uses the commonly used 4M and 16M data , Total training parameters 216M Of X-VLM , It can surpass a larger scale model or a model using a large amount of pre training data , In the downstream multi-modal understanding / Excellent performance on generation tasks . also , The engineers of ByteDance also put X-VLM Used in real business scenarios , for example : Describe the picture content for visually impaired people , Automatic correction of pupils' homework . actually ,X-VLM He is also very good at fine-grained retrieval,visual grounding Etc .
at present ,X-VLM Our code is open source , You are also welcome to scan the QR code below to do your own tasks fine-tune Experience .

* This article is authorized by qubit to publish , Opinions are owned only by the author .
— End —
「 Smart car 」 Communication group recruitment !
Welcome to smart cars 、 Self driving partners join the community , Communicate with industry celebrities 、 Compare notes , Don't miss the development of smart car industry & Technological progress .
ps. Please note your name when adding friends - company - Position oh ~

边栏推荐
- Background page production 01 production of IVX low code sign in system
- C语言数组与结构体指针
- CUDA编程入门极简教程
- How to open an account to open a new bond is it safe to open an account
- 股票开户用客户经理发的开户链接安全吗?知道的给说一下吧
- Tutoriel d'installation MySQL
- 吴恩达机器学习新课程又来了!旁听免费,小白友好
- Performance rendering of dSPACE
- SkyWalking 实现跨线程 Trace 传递
- There is the word "Internet" in the concept of industrial Internet, but it is an existence that is not related to the Internet
猜你喜欢

发布功能完成02《ivx低代码签到系统制作》

陆奇首次出手投资量子计算

CUDA编程入门极简教程

Difference between left join on and join on

MySQL installation tutorial

Two common OEE monitoring methods for equipment utilization

mysql学习笔记--单张表上的增删改查

The release function completed 02 "IVX low code sign in system production"

What is an SSL certificate and what are the benefits of having an SSL certificate?

扎克伯格最新VR原型机来了,要让人混淆虚拟与现实的那种
随机推荐
单例的饥饿、懒汉模式案例
大咖说*计算讲谈社|如何提出关键问题?
Is it safe to open an account by fraud
CUDA编程入门极简教程
陆奇首次出手投资量子计算
騰訊開源項目「應龍」成Apache頂級項目:前身長期服務微信支付,能hold住百萬億級數據流處理...
Easy to use dictionary -defaultdict
How to click DOM to automatically locate the corresponding code line in vscode
Two common OEE monitoring methods for equipment utilization
Skywalking implements cross thread trace delivery
XML modeling
New solution of 202112-2 sequence query
AI自己写代码让智能体进化!OpenAI的大模型有“人类思想”那味了
程序员真人秀又来了!呼兰当主持挑灯狂补知识,SSS大佬本科竟是药学,清华朱军张敏等加入导师团...
Self cultivation and learning encouragement
什么是SSL证书,拥有一个SSL证书有什么好处?
Add in cmakelists_ Definitions() function
用向量表示两个坐标系的变换
Is flush a regular platform? Is it safe for flush to open an account
同花顺证券开户安全吗