当前位置：网站首页>ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language

ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language

2022-06-25 03:41:00 【QbitAl】

Write it at the front

Visual language pre training improves the performance of many downstream visual language tasks , for example ： Image Retrieval 、 Picture based Q & A or reasoning . Have a friend to ask , In addition to using larger models for open academic tasks / More data / Skills brush the indicators very high , What are the practical applications of the multimodal pre training model ？

So , Bytes to beat AI Lab Research The team put forward X-VLM, For the first time, it is proposed to learn multi granularity visual and language alignment . Experimental proof , This pre training method is very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot , only 216M Parameter quantity X-VLM You can achieve excellent performance in a wide range of multimodal tasks , for example ： Image text retrieval 、 Picture based Q & A or reasoning 、 Visual positioning 、 Picture description generation . at present ,X-VLM In the real application scenario of ByteDance, it exceeds many models commonly used in the industry , Online completed , Serving businesses such as today's headlines . Related papers have been published by ICML 2022 receive .

The paper ：https://arxiv.org/abs/2111.08276
Code ：https://github.com/zengyan-97/X-VLM

such as ,X-VLM Learned multi granular visual and language alignment , It can generate more correct sentences for pictures to describe objects and relationships between objects , This ability has been applied to the public welfare project of ByteDance . Mr. Zhao, who has visual impairment, often uses today's headlines to understand current affairs and news , He has always had an expectation ：“ Hope to be the same as ordinary people ‘ see ’ To all information content .” More than two-thirds of today's headlines contain pictures , In order to solve the problem of reading pictures for the visually impaired , Today's headline App Recently applied X-VLM Generation capacity of , It can automatically identify pictures and describe them .

To make them “ see ” See each picture , We made a small improvement

Besides ,X-VLM The ability to understand and generate is also used in the automatic correction function of the energetically intelligent learning lamp . The following figure shows the complete phrase question type and the results predicted by the model ：

The powerful intelligent learning lamp with automatic problem-solving function has been widely praised by parents , This capability is still under continuous optimization .

Research background

The existing multimodal pre training models can be roughly divided into two categories ：

1） Depending on the target detector extraction based on the object （ for example : vehicle 、 people 、 Trees 、 knapsack ） To represent a picture , This method can learn object level visual and language alignment , Pictured 1 in (a) Shown . These methods either directly use the pre trained target detector , Or combine the target detection process into multimodal pre training ;

2） use ResNet perhaps Vision Transformer Encode the whole picture , Just learn the alignment between the picture and the text , Pictured 1(b) Shown .

There are some problems in both methods . First , The method based on object detection can recognize all possible objects in the picture , Some of them have nothing to do with paired text . Besides , The object-based visual features extracted by this method may lose the information between objects （ It can be considered as a kind of contextual information ）. and , This method can only identify a limited number of objects , It is difficult to predefine the appropriate object categories . The second method is simple and direct , But it is difficult to learn fine-grained visual and language alignment , for example ： Object level alignment . This fine-grained alignment relationship has been confirmed by previous work for visual reasoning (visual reasoning) And visual positioning (visual grounding) The task is very helpful .

actually , For multimodal pre training , The following public data is available for use by the model ：1） Picture and picture title ;2） Area labeling , for example ： chart 1 The text in the “man crossing the street” Associated with a specific area in the picture . However , Previous work has roughly aligned the area labels with the whole picture ;3） Object label , for example “backpack”, These annotations are used to train the target detector in the previous work .

Different from the previous practice , In this paper, the author proposes X-VLM, Use the above data in a unified way to efficiently learn multi granularity visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment . say concretely , The author suggests that the Vision Transformer Of patch embeddings To flexibly represent the visual concepts of various particle sizes , Pictured 1(c) Shown ： for example , Visual concepts “backpack” from 2 individual patch form , And the visual concept “man crossing the street” By more patch form .

therefore ,X-VLM The secret to learning multi granularity visual and language alignment is ：

1） Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity , This process uses the commonly used comparative learning loss 、 Match loss 、 and MLM Loss optimization ;

2） Further more , In the same picture , Give different texts , The model is required to predict the coordinates of the visual concept corresponding to the granularity , Optimization with regression loss and intersection union ratio loss of boundary box coordinates . Experimental proof , This pre training method Very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot ,X-VLM It can be understood in multiple modes downstream / Excellent performance on generation tasks **.

Method

X-VLM By an image encoder , A text encoder , A cross modal encoder consists of .

chart 2 The visual concept is given on the left （ It can be an object / Area / picture ） The coding process ： The image encoder is based on Vision Transformer, Divide the input picture into patch code . then , Give any bounding box , Flexible access to all... In the box patch The average value of the representation obtains the global representation of the region . Then compare the global representation with all in the original box patch Means to arrange into a sequence according to the original order , As the representation of the visual concept corresponding to the bounding box . Get the picture itself in this way (I) And visual concepts in pictures （V¹,V²,V³） The coding . Text corresponding to visual concepts , One by one by text encoder , For example, picture title 、 Area description 、 Or object labels .

X-VLM Adopt the common model structure , The difference lies in the method of pre training . The author optimizes through the following two types of losses ：

First of all , In the same picture , Give different texts , for example ：T(text)、T¹(text1)、T²(text2)、T³(text3), The model is required to predict the bounding box of the corresponding visual concept in the picture ：

x^j_cls Is the cross modal encoder in [CLS] The output vector of the position .Sigmoid The function is to standardize the bounding box of prediction .Ground-truth b_j Corresponding , In turn are the standardized abscissa of the center 、 Center ordinate 、 wide 、 high . Last , This loss is the regression loss of the bounding box coordinates （L1） Compare losses with each other （GIoU） The sum of the . The author thinks that in the same picture , Give different words , The model is required to predict the corresponding visual concept , It can make the model learn multi granularity visual language alignment more effectively . This loss is also the first time it has been used in multimodal pre training .

second , Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly optimize the model to pull together different granularity text and visual concepts , Including objects / Area / Alignment of picture and text . The author uses three common loss optimization methods in multimodal pre training , In turn, is ：

1） Compare learning loss ：

y^v2t,y^t2v ∈ R^bsz^x bsz yes ground-truth Similarity degree , The diagonal to 1, Others are 0.

p^v2t, p^t2v ∈ R^bsz^{x bsz} It is the similarity calculated by the model based on the output of text encoder and image encoder .

2） Match loss ：

p^match It is based on the calculation of cross modal encoder , Forecast given Whether it matches （ let me put it another way ,0/1 classification ）. For each positive example , The author sampled a pair of negative cases .

3）Masked Language Modeling Loss ：

T( Estimated value ) Some words in have been randomly replaced with [MASK],p^j(V, T( Estimated value )) It's a cross modal encoder in the word t_j The probability distribution of thesaurus calculated by the output vector of position .

experiment

The author uses the medium-sized model commonly used in multimodal pre training 4M and 16M Experiment with image data set , As shown in the following table ：

among , mark （# Ann） Is the sum of area labels and object labels . It can be seen that , Some datasets don't have picture titles , for example Visual Genome（VG）, Some datasets have no picture labels , for example CC-3M/12M.

surface 2 It shows the task of image text retrieval (MSCOCO and Flickr30K) Performance on . Even if , The previous methods are pre trained on a larger amount of internal data or have a larger model scale , stay 4M Training under picture data set X-VLM You can go beyond the previous methods .

surface 3 Demonstrated in visual reasoning (VQA2.0 and NLVR2)、 Visual positioning (RefCOCO+) 、 Picture description generation (COCO Caption) Model performance on . For a fair comparison ,X-VLM It follows the previous work fine-tune Method , No additional adjustments have been made . Combine tables 2 And table 3, It can be seen that , Compared with the previous method ,X-VLM Support more kinds of downstream tasks , And in these common visual language tasks have achieved very good performance .

Summarize and discuss

In this paper , The author puts forward X-VLM To learn multi granular visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment .X-VLM The secret to success is ：

1） be based on patch embeddings Flexible presentation of visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity ;

2） Further more , In the same picture , Give different texts , The model is required to predict the coordinates corresponding to the visual concept . Experiments show that this pre training method is very efficient .

In the experimental part , The author uses the commonly used 4M and 16M data , Total training parameters 216M Of X-VLM , It can surpass a larger scale model or a model using a large amount of pre training data , In the downstream multi-modal understanding / Excellent performance on generation tasks . also , The engineers of ByteDance also put X-VLM Used in real business scenarios , for example ： Describe the picture content for visually impaired people , Automatic correction of pupils' homework . actually ,X-VLM He is also very good at fine-grained retrieval,visual grounding Etc .

at present ,X-VLM Our code is open source , You are also welcome to scan the QR code below to do your own tasks fine-tune Experience .

* This article is authorized by qubit to publish , Opinions are owned only by the author .

— End —

「 Smart car 」 Communication group recruitment ！

Welcome to smart cars 、 Self driving partners join the community , Communicate with industry celebrities 、 Compare notes , Don't miss the development of smart car industry & Technological progress .

ps. Please note your name when adding friends - company - Position oh ~

原网站

版权声明
本文为[QbitAl]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/176/202206250044296188.html