当前位置:网站首页>Youtech sharing | the application of Tencent Youtu multimodal graphic content identification and positioning in content security
Youtech sharing | the application of Tencent Youtu multimodal graphic content identification and positioning in content security
2022-06-24 05:13:00 【Youtu Laboratory】
Now , With the development and innovation of digital technology , Deep learning is more and more widely used in the field of computer vision , And appear in every scene of daily work and life , Such as face recognition 、 Classification and detection of objects . These applications are based on a single mode in the field of vision , But the real world is not limited to the single mode of vision , auditory 、 Language is also an important part of the real world , It may not be possible to judge the type of things perfectly by a single mode .
In this context , More and more researches have been carried out from the multi-modal aspect . However , The early multi-modal research idea is how to better integrate multiple models , Final realization 1+1>2 The effect of . But this approach depends not only on the size of the data volume , And the model features of different dimensions can not be matched, resulting in the failure to achieve reasonable semantic fusion , Finally, the results of the model will be greatly reduced , There may even be 1+1<2 Result .
To solve the above problems , Researcher of Tencent Youtu laboratory xavierzwlin With 「 Identification and location of multimodal graphic content 」 The theme of , Combined with the research progress of Tencent Youtu laboratory in multimodal tasks 、 Achievements and practical experience in the field of content security , Explain the technical principles and internal logic behind it for everyone .
01
Research progress of multimodal tasks
Multimodality refers to the processing of information transmitted by an object from a variety of information forms . Simply speaking , A dog can see through its appearance 、 Call 、 Touch and other modes to express themselves , And the video that people watch everyday can also be viewed through pictures 、 subtitle 、 The sound and even the barrage transmit information to the outside world , This is the multimodal form .
It is widely used to identify picture advertisements on the Internet 、 In the process of facial expression package and user's fuzzy requirements , Machine learning, which can only deal with single-mode processing, cannot deal with the text on a single picture 、 figure 、 Background watermark and other modes for effective recognition , In this case, multimodal algorithm is needed to solve the above problems .
At present, there are many kinds of multimodal tasks , For example, the following four categories :
01
Identify the tasks : By recognizing the scenes and words on the picture , Distinguish the information that the picture wants to express ;
02
Search task : By identifying different descriptions in a paragraph , Select the appropriate target by searching ;
03
Image Caption : By identifying the various features on the picture ( background 、 action 、 expression 、 State, etc ), Output the correct description of the picture ;
04
VQA : Combined with the questions raised, identify the relevant content in the picture , And output the correct answer .
The above specific applications are abstracted into specific problems , The following categories can be classified :
01
Representational learning : Representation learning can be divided into joint representation and collaborative representation , Joint representation refers to mapping different modal features to the same feature space ; Collaborative representation needs to map different modal features to different spaces , And realize some constraint relationship between different modes ;
02
Align: Align the elements that are related to each other in two modes ;
03
Fusion: Merge multiple modes in the same shared space into a new mode ;
04
Transltaion: Transform one mode into another with corresponding relation ;
05
Co-learning: Transfer data knowledge from one mode to another mode .
02
be based on Transformer
Multimodal pre training model
be based on Transformer The simplest way to do multimodal pre training model , Namely VISUALBERT: It is from NLP In the field of Bert The model is migrated from . First introduced BERT, He is based on self-attention Build a Transformer The infrastructure of , Design a series of pre training tasks , You can make BERT The model performs self supervised learning without labeling data . The model obtained by this kind of method has strong generalization ability , The pre training model can be used to train the downstream tasks Fine-tuing.
among ,BERT Self supervised learning can be achieved through NSP loss and mask language model Two ways to achieve .NSP loss It can judge whether there is a logical connection before and after the input content to judge the logic of the whole content ;mask language model Then the hidden intermediate words are predicted by the given surrounding words .
In multimodal ( Image text ) In pre training ,NSP loss You can use the matched pictures and texts as positive samples , Randomly combined images and text as negative samples , Thus, we can learn the logical association of images in self supervised learning and judge the relationship between images and text content ; In the use of mask language model when , You can sequence text features , Make... By hiding some text BERT Self predict what the hidden text is .
There are many work pairs VISUALBERT Make a series of improvements , The main directions are task improvement and model structure improvement . stay LXMERT(EMNLP2019) These two ways of improvement are mentioned in this paper :
In terms of model structure , The author of this paper proposes that two independent Transformer Feature extraction of image and text respectively , And then through a complete cross modal Transformer Realize the feature fusion of image and text , To solve the problem of difficult fusion between multiple modes with different model features ;
In terms of mission , The author puts forward two improvement directions . firstly , Hide some image content , And predict what the hidden content is through other features and text descriptions in the picture ; The second is to identify the question and answer data in the picture , Answer the questions raised in the text .
03
Take pre training
Multimodal content security identification
Tencent Youtu optimizes the model structure 、 Task design 、 Model acceleration and other aspects of optimization , Use pre training , Carry out multimodal content security identification .
Data processing : The text content needs to pass through OCR Extract the text content and convert it into corresponding token, Enter... Into text Transformer in ; The image content will pass through CNN To extract , Extract all the regional features contained in the image , Form a feature sequence covering the whole image , Input to partial image Transformer in ; Besides , To prevent over fitting during pre training , We can also get the characteristic graph of convolution , The local features are separated according to the spatial position and sampled randomly .
Feature extraction and fusion : Adopt a phased approach 、 Hierarchical fusion , That is to say, firstly, the text content and the local image are fused in a shallow level , Form cross modal text + Partial image Transformer modular ; Then cross modal text + Partial image Transformer The module is deeply fused with the global features of the image , Finally, a single mode Transformer And deep and shallow span modes Transformer A feature fusion network .
Optimization path of pre training task form :
At present, the pre training tasks are mainly divided into Image-Text Match and Masked Language Model Two types of . among ,Image-Text Match Each picture can only match one piece of text when sampling , Low training efficiency ;Masked Language Model Is a strong task , Dominant in pre training tasks , It may lead to the poor recognition ability of the finally trained model for image modes .
To solve the above problems , Tencent Youtu added a similarity task in the pre training task (similarity loss).
First , Building text Transformer The collected text features will be merged , Get complete text feature; secondly , Through CNN When extracting image features , Would be right feature maps Global pooling , Get complete image feature; Last , By calculation text feature And image feature The similarity , Increase the efficiency of text and image matching .
The advantage of doing this is that when training , Every time you enter a picture BERT Will calculate the picture and all text feature The similarity , From a picture corresponding to a paragraph of text in the past to a picture corresponding to multiple paragraphs of text , It greatly improves the training efficiency ; meanwhile , Similarity tasks can significantly improve the relevance between images and texts , Provide a better learning environment for later features ; Third ,feature maps and image feature Will receive it directly text feature Supervision of , thus send CNN And other image modules can also be fully trained .
When optimizing downstream tasks , In addition to the existing text model and image model , Some existing feature models can also be added to the training , Improve the overall training effect . If you are worried that adding too many models will affect the execution speed of the pre training task , Can pass AIbret Model realizes model miniaturization for feature model , It can also be done through KD The method of model distillation , It can also be used LayerDrop Way to randomly skip Transformer Some layers in the module .
After a series of optimization , The function and efficacy of the multimodal pre training model of Tencent Youtu have been significantly improved :
First , Compared to single mode , Multimodal content security identification The recall rate has increased 30%;
secondly , The miniaturization of the model has significantly improved the running speed and training efficiency of the whole model , More primitive BERT The model has been upgraded 60%;
Last , The pre training model can also be used in pure image tasks . Fully trained for similar tasks CNN The module is extracted and put into the task of pure image detection for experiment , The experimental results are obviously better than those based on ImageNet Training model of .
04
The latest research on weak supervised image description and location
Tencent Youtu is still there Grounded Image Caption This multimodal transformation task , Made some frontier explorations , Relevant work has been finalized MM2021. differ Image Caption, Grounded Image Caption In addition to the text description of the picture content , Also find out the position of the corresponding object in the image in the text description .
The existing implementation is through the attention layer + The language prediction layer is a double layer LSTM Model architecture to achieve .
First, through FasterRCNN Extract the corresponding image ROI features , And use attention LSTM Layer acquisition image ROI Hidden layers in features , Thus, the image region with the strongest correlation with the hidden layer in the image features is obtained ;
Then input the previous image area into the language prediction LSTM layer , Predict the probability of the word corresponding to the image in the region .
But because the scheme itself is weak supervision , There is no complete monitoring information , The existing scheme may have a box that is too large 、 Too small or offset , Resulting in low positioning accuracy , The output is inaccurate or unrecognizable . In fact, the solution is relatively simple , Using attention LSTM Layer acquisition image ROI Hidden layers in features , Use multiple juxtaposed attention LSTM Layer predicts multiple local areas , Finally, multiple results are fused into a complete region , Make positioning more accurate . It should be noted that , attention LSTM The more layers, the better , Too much attention LSTM Layers are prone to generate noisy data .
Past highlights
optimal Tech Share | Face 3D Research and application of reconstruction and rendering technology
The background to reply “ The group of ”
Join the Youtu community
边栏推荐
- Qiming cloud sharing: tips on esp32c3 simple IO and serial port
- Disaster recovery series (IV) - disaster recovery construction of business application layer
- Develop a customized music player from scratch, and your girlfriend will have it?
- Hard core JS: there may be a memory leak in your program
- Zhang Xiaodan, chief architect of Alibaba cloud hybrid cloud: evolution and development of government enterprise hybrid cloud technology architecture
- Three methods of local storage
- How RedHat 8 checks whether the port is connected
- PHP sizeof() function
- Performance comparison of JS loop traversal methods: for/while/for in/for/map/foreach/every
- How to clone virtual machines on vspere client
猜你喜欢

Zhang Xiaodan, chief architect of Alibaba cloud hybrid cloud: evolution and development of government enterprise hybrid cloud technology architecture

014_ TimePicker time selector

Loss and optimization of linear regression, machine learning to predict house prices

011_ Cascader cascade selector

Popularization of children's programming education in specific scenarios

Intensive learning and application of "glory of the king" to complete the application of 7 real worlds other than human players

Leetcode (question 2) - adding two numbers

Training methods after the reform of children's programming course

Introduction to the "penetration foundation" cobalt strike Foundation_ Cobalt strike linkage msfconsole
![[leetcode daily question] push domino](/img/81/1c31e97d9a245816514bcf47c92107.jpg)
[leetcode daily question] push domino
随机推荐
Where can I buy a domain name? Does the domain name have promotion space?
[Tencent cloud] new enterprise users go to the cloud & the latest discount 2022!
What is stored in the domain name server? How does the domain name server provide services?
PHP ksort() function
How the query address of cloud native monitoring data exposes the public network
Mini web framework: adding routes in decorator mode | dark horse programmer
Eigen eigenMatrix
Verifying data models in golang
The easyplayer player displays compileerror:webassembly Reason for instance() and its solution
What is an evpn switch?
[leetcode daily question] push domino
"Emergency response practice" logparser log analysis practice
Functional advantages of industrial wireless router
Bi-sql - Select
[Tencent cloud] buy a cloud server, participate in a gift lottery, and give you an iPad worth 8000 yuan, Bose earphones, and a thousand yuan JD card!
Fluent version control FVM
Bi-sql where
How does the mobile phone remotely connect to the ECS? What should be paid attention to during the operation
CTF learning notes 18:iwesec file upload vulnerability-03-content-type filtering bypass
What is the implementation of domain name to IP address conversion? What are the benefits of switching to a website?