当前位置:网站首页>[multimode] unimo
[multimode] unimo
2022-06-23 04:35:00 【joyce_ peng】
One 、unimo
1、 advantage : Training data contains text 、 Images 、 Data training with pictures and texts , Not limited to picture and text pairs

2、 Strategies and models
(1) Text rewriting (Text Rewriting): In order to enhance the semantic alignment ability of text and text in multiple granularity , The text description of image is changed from sentence level 、 Phrase level and vocabulary level are rewritten .
At the sentence level , be based on Back translation (Back Translation, That is, one sentence machinetranslation model is translated into many other languages , Translate it back , Using the ability of machinetranslation model to get other forms of sentences with the same meaning without changing the original intention of the sentences ) To get multiple positive example texts of a picture .
further , Using the characteristics of discrete symbols in natural language , be based on TF-IDF Similarity retrieval can get more literal words with high repetition rate , But sentences with different meanings are strong negative samples of a picture .
At the phrase and vocabulary levels , First Parse the text into a scene graph , Then randomly replace the objects (object)、 attribute (attribute) And relationship (relation) And their combination , Strong negative examples of these two granularities are obtained .

(2) Images / Text retrieval (Image and Text Retrieval): In order to integrate more single-mode knowledge in cross modal learning , The picture and text pair information will be further enhanced and enriched by the background knowledge retrieved from large-scale single-mode data . This part of the retrieved data will form a weak correlation pair with another modal data in the picture and text pair to join the comparative learning .
(3) Visual and textual learning 

3、 experiment
Pre training data section , Text corpus includes Wikipedia、BookCorpus、OpenWebText Equivalent corpora ; Image data is crawled from the Internet 300K Images ; The multi-mode picture and text pair data includes COCO Caption、Visual Genome、Conceptual Caption、SBU Caption.
Downstream tasks include both visual Q & A 、 Figure description generation 、 Multimode tasks such as visual inference , It also includes text classification 、 Text in this paper, 、 Various text tasks such as problem generation .
The results on multi-mode tasks are very bright , All major tasks are SOTA, In particular, it has great advantages in the retrieval task . From the Case Show Look at ,UNIMO It really performs better in accurately understanding and capturing details .
surface 1: Multimodal downstream task evaluation results . surface 2: Single mode downstream task evaluation results .
As shown in the table 1 Shown , The author will UNIMO and ViLBERT、VLP、UNITER、Oscar、Villa、ERNIE-ViL The multimodal pre training models are compared , It turns out that ,UNIMO On the whole, the best results have been achieved . As shown in the table 2 Shown ,UNIMO In language understanding and generation tasks BERT、RoBERTa、XLNet and UniLM The pre training model has better or equivalent performance .UNIMO Not only has he achieved the best results in multimodal tasks , And we have also achieved good results in single-mode tasks , This proves the superiority of the unified modal architecture .
reference:https://mp.weixin.qq.com/s/7NYe59gKu6-js32tfy4xBw
边栏推荐
- PTA:7-65 饮料的价格
- 华为联机对战服务玩家快速匹配后,不同玩家收到的同一房间内玩家列表不同
- 【二叉樹進階】AVLTree - 平衡二叉搜索樹
- Pytoch --- use pytoch's pre training model to realize four weather classification problems
- How e-commerce makes use of small programs
- Xiaojinwei, chairman of Chenglian Technology: implement the national strategy of data economy and lead the development of new consumption in the digital era!
- 顺序表查找
- How node+express operates cookies
- Does the network disk also roll inside?
- Halcon glue line detection - template matching, pose transformation, glue width, glue continuity detection
猜你喜欢

【深度学习】深度学习推理框架 TensorRT MNN OpenVINO ONNXRuntime

Leetcode 1208. 尽可能使字符串相等(终于解决,晚安)

What is metadata

QMainWindow

在word里,如何让页码从指定页开始编号
![[deep learning] deep learning reasoning framework tensorrt MNN openvino onnxruntime](/img/a9/11bc00a91b79358f28ada2d4c99f32.png)
[deep learning] deep learning reasoning framework tensorrt MNN openvino onnxruntime

Code refactoring Guide

Basic skills of x64dbg

理想汽车×OceanBase:当造车新势力遇上数据库新势力

Deploying Apache pulsar on kubesphere
随机推荐
Xiaojinwei, chairman of Chenglian Technology: implement the national strategy of data economy and lead the development of new consumption in the digital era!
Implementation of VGA protocol based on FPGA
How to process large volume xlsx/csv/txt files?
What is the difference between redistemplate and CacheManager operation redis
【Pytorch】用自动微分求sin(x)的导数
解决使用Exception抛出后,@Transactional不生效
Does the network disk also roll inside?
给你的AppImage创建桌面快捷方式
PTA:7-85 数据的间距问题(重载+函数模板)
How to make the page number start from the specified page in word
抖音x-bogus和_signature参数分析
Deploying Apache pulsar on kubesphere
PTA:7-31 期刊收费
虫子 STM32 高级定时器 (哈哈我说实话硬件定时器不能体现实力,实际上想把内核定时器发上来的,一想算了,慢慢来吧)
虫子 日期类 下 太子语言
制造型企业开发的SRM供应商管理系统特点是什么
P1363 phantom maze (DFS)
flutter系列之:flutter中的Wrap
leetcode 91. Decode Ways 解码方法(中等)
【一起上水硕系列】Day Three - preview4