当前位置：网站首页>[multimode] unimo

[multimode] unimo

2022-06-23 04:35:00 【joyce_ peng】

One 、unimo

1、 advantage ： Training data contains text 、 Images 、 Data training with pictures and texts , Not limited to picture and text pairs

Insert picture description here

2、 Strategies and models

（1） Text rewriting （Text Rewriting）： In order to enhance the semantic alignment ability of text and text in multiple granularity , The text description of image is changed from sentence level 、 Phrase level and vocabulary level are rewritten .
At the sentence level , be based on Back translation （Back Translation, That is, one sentence machinetranslation model is translated into many other languages , Translate it back , Using the ability of machinetranslation model to get other forms of sentences with the same meaning without changing the original intention of the sentences ） To get multiple positive example texts of a picture .
further , Using the characteristics of discrete symbols in natural language , be based on TF-IDF Similarity retrieval can get more literal words with high repetition rate , But sentences with different meanings are strong negative samples of a picture .
At the phrase and vocabulary levels , First Parse the text into a scene graph , Then randomly replace the objects （object）、 attribute （attribute） And relationship （relation） And their combination , Strong negative examples of these two granularities are obtained .

Please add a picture description

（2） Images / Text retrieval （Image and Text Retrieval）： In order to integrate more single-mode knowledge in cross modal learning , The picture and text pair information will be further enhanced and enriched by the background knowledge retrieved from large-scale single-mode data . This part of the retrieved data will form a weak correlation pair with another modal data in the picture and text pair to join the comparative learning .
Insert picture description here

（3） Visual and textual learning
Insert picture description here

3、 experiment

Pre training data section , Text corpus includes Wikipedia、BookCorpus、OpenWebText Equivalent corpora ; Image data is crawled from the Internet 300K Images ; The multi-mode picture and text pair data includes COCO Caption、Visual Genome、Conceptual Caption、SBU Caption.
Downstream tasks include both visual Q & A 、 Figure description generation 、 Multimode tasks such as visual inference , It also includes text classification 、 Text in this paper, 、 Various text tasks such as problem generation .
The results on multi-mode tasks are very bright , All major tasks are SOTA, In particular, it has great advantages in the retrieval task . From the Case Show Look at ,UNIMO It really performs better in accurately understanding and capturing details .
Insert picture description here
surface 1： Multimodal downstream task evaluation results . surface 2： Single mode downstream task evaluation results .
As shown in the table 1 Shown , The author will UNIMO and ViLBERT、VLP、UNITER、Oscar、Villa、ERNIE-ViL The multimodal pre training models are compared , It turns out that ,UNIMO On the whole, the best results have been achieved . As shown in the table 2 Shown ,UNIMO In language understanding and generation tasks BERT、RoBERTa、XLNet and UniLM The pre training model has better or equivalent performance .UNIMO Not only has he achieved the best results in multimodal tasks , And we have also achieved good results in single-mode tasks , This proves the superiority of the unified modal architecture .