当前位置:网站首页>ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language

ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language

2022-06-25 03:41:00 QbitAl

Write it at the front

Visual language pre training improves the performance of many downstream visual language tasks , for example : Image Retrieval 、 Picture based Q & A or reasoning . Have a friend to ask , In addition to using larger models for open academic tasks / More data / Skills brush the indicators very high , What are the practical applications of the multimodal pre training model ?

9fba76d5baa72581f0b4fe52b8ab5362.png

So , Bytes to beat AI Lab Research The team put forward X-VLM, For the first time, it is proposed to learn multi granularity visual and language alignment . Experimental proof , This pre training method is very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot , only 216M Parameter quantity X-VLM You can achieve excellent performance in a wide range of multimodal tasks , for example : Image text retrieval 、 Picture based Q & A or reasoning 、 Visual positioning 、 Picture description generation . at present ,X-VLM In the real application scenario of ByteDance, it exceeds many models commonly used in the industry , Online completed , Serving businesses such as today's headlines . Related papers have been published by ICML 2022 receive .

457f79b88c72636edf0025a3b1f80645.png

The paper :https://arxiv.org/abs/2111.08276
Code :https://github.com/zengyan-97/X-VLM

such as ,X-VLM Learned multi granular visual and language alignment , It can generate more correct sentences for pictures to describe objects and relationships between objects , This ability has been applied to the public welfare project of ByteDance . Mr. Zhao, who has visual impairment, often uses today's headlines to understand current affairs and news , He has always had an expectation :“ Hope to be the same as ordinary people ‘ see ’ To all information content .” More than two-thirds of today's headlines contain pictures , In order to solve the problem of reading pictures for the visually impaired , Today's headline App Recently applied X-VLM Generation capacity of , It can automatically identify pictures and describe them .

e73b301d0a073a8ab7c3488ffff89854.png

To make them “ see ” See each picture , We made a small improvement

Besides ,X-VLM The ability to understand and generate is also used in the automatic correction function of the energetically intelligent learning lamp . The following figure shows the complete phrase question type and the results predicted by the model :

6df68f005af2b038e24c73dcae2b512c.png

The powerful intelligent learning lamp with automatic problem-solving function has been widely praised by parents , This capability is still under continuous optimization .

cfbb410505aac29960846c5b115a9cbd.png

Research background

1772a8fc0877a556baa2e58526a4890a.png

The existing multimodal pre training models can be roughly divided into two categories :

1) Depending on the target detector extraction based on the object ( for example : vehicle 、 people 、 Trees 、 knapsack ) To represent a picture , This method can learn object level visual and language alignment , Pictured 1 in (a) Shown . These methods either directly use the pre trained target detector , Or combine the target detection process into multimodal pre training ;

2) use ResNet perhaps Vision Transformer Encode the whole picture , Just learn the alignment between the picture and the text , Pictured 1(b) Shown .

There are some problems in both methods . First , The method based on object detection can recognize all possible objects in the picture , Some of them have nothing to do with paired text . Besides , The object-based visual features extracted by this method may lose the information between objects ( It can be considered as a kind of contextual information ). and , This method can only identify a limited number of objects , It is difficult to predefine the appropriate object categories . The second method is simple and direct , But it is difficult to learn fine-grained visual and language alignment , for example : Object level alignment . This fine-grained alignment relationship has been confirmed by previous work for visual reasoning (visual reasoning) And visual positioning (visual grounding) The task is very helpful .

actually , For multimodal pre training , The following public data is available for use by the model :1) Picture and picture title ;2) Area labeling , for example : chart 1 The text in the “man crossing the street” Associated with a specific area in the picture . However , Previous work has roughly aligned the area labels with the whole picture ;3) Object label , for example “backpack”, These annotations are used to train the target detector in the previous work .

Different from the previous practice , In this paper, the author proposes X-VLM, Use the above data in a unified way to efficiently learn multi granularity visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment . say concretely , The author suggests that the Vision Transformer Of patch embeddings To flexibly represent the visual concepts of various particle sizes , Pictured 1(c) Shown : for example , Visual concepts “backpack” from 2 individual patch form , And the visual concept “man crossing the street” By more patch form .

therefore ,X-VLM The secret to learning multi granularity visual and language alignment is :

1) Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity , This process uses the commonly used comparative learning loss 、 Match loss 、 and MLM Loss optimization ;

2) Further more , In the same picture , Give different texts , The model is required to predict the coordinates of the visual concept corresponding to the granularity , Optimization with regression loss and intersection union ratio loss of boundary box coordinates . Experimental proof , This pre training method Very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot ,X-VLM It can be understood in multiple modes downstream / Excellent performance on generation tasks **.

Method

8797a05750e522185e37c6991fe7e6aa.png

X-VLM By an image encoder , A text encoder , A cross modal encoder consists of .

chart 2 The visual concept is given on the left ( It can be an object / Area / picture ) The coding process : The image encoder is based on Vision Transformer, Divide the input picture into patch code . then , Give any bounding box , Flexible access to all... In the box patch The average value of the representation obtains the global representation of the region . Then compare the global representation with all in the original box patch Means to arrange into a sequence according to the original order , As the representation of the visual concept corresponding to the bounding box . Get the picture itself in this way (I) And visual concepts in pictures (V1,V2,V3) The coding . Text corresponding to visual concepts , One by one by text encoder , For example, picture title 、 Area description 、 Or object labels .

X-VLM Adopt the common model structure , The difference lies in the method of pre training . The author optimizes through the following two types of losses :

First of all , In the same picture , Give different texts , for example :T(text)、T1(text1)、T2(text2)、T3(text3), The model is required to predict the bounding box of the corresponding visual concept in the picture :

e130a94f1dfc93418d4075efc739c894.png

86becaa3b61912de7ca979d2fc0314ee.png

xjcls Is the cross modal encoder in [CLS] The output vector of the position .Sigmoid The function is to standardize the bounding box of prediction .Ground-truth bj Corresponding  , In turn are the standardized abscissa of the center 、 Center ordinate 、 wide 、 high . Last , This loss is the regression loss of the bounding box coordinates (L1) Compare losses with each other (GIoU) The sum of the . The author thinks that in the same picture , Give different words , The model is required to predict the corresponding visual concept , It can make the model learn multi granularity visual language alignment more effectively . This loss is also the first time it has been used in multimodal pre training .

second , Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly optimize the model to pull together different granularity text and visual concepts , Including objects / Area / Alignment of picture and text . The author uses three common loss optimization methods in multimodal pre training , In turn, is :

1) Compare learning loss :

9b3a2cb0f5771e73e0a18c73d76318c0.png

yv2t,yt2v ∈ Rbsz bsz yes ground-truth Similarity degree , The diagonal to 1, Others are 0.

pv2t, pt2v ∈ Rbsz x bsz It is the similarity calculated by the model based on the output of text encoder and image encoder .

2) Match loss :

73946bdeea6ffd5620f125f01a539953.png

pmatch It is based on the calculation of cross modal encoder , Forecast given    Whether it matches ( let me put it another way ,0/1 classification ). For each positive example , The author sampled a pair of negative cases .

3)Masked Language Modeling Loss :

db0be7f1a7b5be2f18794109cdb5cd1a.png

T( Estimated value ) Some words in have been randomly replaced with [MASK],pj(V, T( Estimated value )) It's a cross modal encoder in the word tj The probability distribution of thesaurus calculated by the output vector of position .

experiment

The author uses the medium-sized model commonly used in multimodal pre training 4M and 16M Experiment with image data set , As shown in the following table :

8b8b6cacc81f1c85a9e7ed698f8994c1.png

among , mark (# Ann) Is the sum of area labels and object labels . It can be seen that , Some datasets don't have picture titles , for example Visual Genome(VG), Some datasets have no picture labels , for example CC-3M/12M.

272a17597e463c9ce49e01542826c14a.png

surface 2 It shows the task of image text retrieval (MSCOCO and Flickr30K) Performance on . Even if , The previous methods are pre trained on a larger amount of internal data or have a larger model scale , stay 4M Training under picture data set X-VLM You can go beyond the previous methods .

a70437ce5d4ed6cdfa6a9ee58786ed7b.png

surface 3 Demonstrated in visual reasoning (VQA2.0 and NLVR2)、 Visual positioning (RefCOCO+) 、 Picture description generation (COCO Caption) Model performance on . For a fair comparison ,X-VLM It follows the previous work fine-tune Method , No additional adjustments have been made . Combine tables 2 And table 3, It can be seen that , Compared with the previous method ,X-VLM Support more kinds of downstream tasks , And in these common visual language tasks have achieved very good performance .

Summarize and discuss

In this paper , The author puts forward X-VLM To learn multi granular visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment .X-VLM The secret to success is :

1) be based on patch embeddings Flexible presentation of visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity ;

2) Further more , In the same picture , Give different texts , The model is required to predict the coordinates corresponding to the visual concept . Experiments show that this pre training method is very efficient .

In the experimental part , The author uses the commonly used 4M and 16M data , Total training parameters 216M Of X-VLM , It can surpass a larger scale model or a model using a large amount of pre training data , In the downstream multi-modal understanding / Excellent performance on generation tasks . also , The engineers of ByteDance also put X-VLM Used in real business scenarios , for example : Describe the picture content for visually impaired people , Automatic correction of pupils' homework . actually ,X-VLM He is also very good at fine-grained retrieval,visual grounding Etc .

at present ,X-VLM Our code is open source , You are also welcome to scan the QR code below to do your own tasks fine-tune Experience .

35a683214ef4aca76c8b542020d667c3.png

* This article is authorized by qubit to publish , Opinions are owned only by the author .

—  End  —

「 Smart car 」 Communication group recruitment !

Welcome to smart cars 、 Self driving partners join the community , Communicate with industry celebrities 、 Compare notes , Don't miss the development of smart car industry & Technological progress .

ps. Please note your name when adding friends - company - Position oh ~

9bec04f5ff967e744007ec94735be96b.png

原网站

版权声明
本文为[QbitAl]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206250044296188.html