当前位置：网站首页>Youtech sharing | the application of Tencent Youtu multimodal graphic content identification and positioning in content security

Youtech sharing | the application of Tencent Youtu multimodal graphic content identification and positioning in content security

2022-06-24 05:13:00 【Youtu Laboratory】

Now , With the development and innovation of digital technology , Deep learning is more and more widely used in the field of computer vision , And appear in every scene of daily work and life , Such as face recognition 、 Classification and detection of objects . These applications are based on a single mode in the field of vision , But the real world is not limited to the single mode of vision , auditory 、 Language is also an important part of the real world , It may not be possible to judge the type of things perfectly by a single mode .

In this context , More and more researches have been carried out from the multi-modal aspect . However , The early multi-modal research idea is how to better integrate multiple models , Final realization 1+1>2 The effect of . But this approach depends not only on the size of the data volume , And the model features of different dimensions can not be matched, resulting in the failure to achieve reasonable semantic fusion , Finally, the results of the model will be greatly reduced , There may even be 1+1<2 Result .

To solve the above problems , Researcher of Tencent Youtu laboratory xavierzwlin With 「 Identification and location of multimodal graphic content 」 The theme of , Combined with the research progress of Tencent Youtu laboratory in multimodal tasks 、 Achievements and practical experience in the field of content security , Explain the technical principles and internal logic behind it for everyone .

Research progress of multimodal tasks

Multimodality refers to the processing of information transmitted by an object from a variety of information forms . Simply speaking , A dog can see through its appearance 、 Call 、 Touch and other modes to express themselves , And the video that people watch everyday can also be viewed through pictures 、 subtitle 、 The sound and even the barrage transmit information to the outside world , This is the multimodal form .

It is widely used to identify picture advertisements on the Internet 、 In the process of facial expression package and user's fuzzy requirements , Machine learning, which can only deal with single-mode processing, cannot deal with the text on a single picture 、 figure 、 Background watermark and other modes for effective recognition , In this case, multimodal algorithm is needed to solve the above problems .

At present, there are many kinds of multimodal tasks , For example, the following four categories ：

Identify the tasks ： By recognizing the scenes and words on the picture , Distinguish the information that the picture wants to express ;

Search task ： By identifying different descriptions in a paragraph , Select the appropriate target by searching ;

Image Caption ： By identifying the various features on the picture （ background 、 action 、 expression 、 State, etc ）, Output the correct description of the picture ;

VQA ： Combined with the questions raised, identify the relevant content in the picture , And output the correct answer .

The above specific applications are abstracted into specific problems , The following categories can be classified ：

Representational learning ： Representation learning can be divided into joint representation and collaborative representation , Joint representation refers to mapping different modal features to the same feature space ; Collaborative representation needs to map different modal features to different spaces , And realize some constraint relationship between different modes ;

Align： Align the elements that are related to each other in two modes ;

Fusion： Merge multiple modes in the same shared space into a new mode ;

Transltaion： Transform one mode into another with corresponding relation ;

Co-learning： Transfer data knowledge from one mode to another mode .

be based on Transformer

Multimodal pre training model

be based on Transformer The simplest way to do multimodal pre training model , Namely VISUALBERT： It is from NLP In the field of Bert The model is migrated from . First introduced BERT, He is based on self-attention Build a Transformer The infrastructure of , Design a series of pre training tasks , You can make BERT The model performs self supervised learning without labeling data . The model obtained by this kind of method has strong generalization ability , The pre training model can be used to train the downstream tasks Fine-tuing.

among ,BERT Self supervised learning can be achieved through NSP loss and mask language model Two ways to achieve .NSP loss It can judge whether there is a logical connection before and after the input content to judge the logic of the whole content ;mask language model Then the hidden intermediate words are predicted by the given surrounding words .

In multimodal （ Image text ） In pre training ,NSP loss You can use the matched pictures and texts as positive samples , Randomly combined images and text as negative samples , Thus, we can learn the logical association of images in self supervised learning and judge the relationship between images and text content ; In the use of mask language model when , You can sequence text features , Make... By hiding some text BERT Self predict what the hidden text is .

There are many work pairs VISUALBERT Make a series of improvements , The main directions are task improvement and model structure improvement . stay LXMERT（EMNLP2019） These two ways of improvement are mentioned in this paper ：

In terms of model structure , The author of this paper proposes that two independent Transformer Feature extraction of image and text respectively , And then through a complete cross modal Transformer Realize the feature fusion of image and text , To solve the problem of difficult fusion between multiple modes with different model features ;

In terms of mission , The author puts forward two improvement directions . firstly , Hide some image content , And predict what the hidden content is through other features and text descriptions in the picture ; The second is to identify the question and answer data in the picture , Answer the questions raised in the text .

Take pre training

Multimodal content security identification

Tencent Youtu optimizes the model structure 、 Task design 、 Model acceleration and other aspects of optimization , Use pre training , Carry out multimodal content security identification .

Data processing ： The text content needs to pass through OCR Extract the text content and convert it into corresponding token, Enter... Into text Transformer in ; The image content will pass through CNN To extract , Extract all the regional features contained in the image , Form a feature sequence covering the whole image , Input to partial image Transformer in ; Besides , To prevent over fitting during pre training , We can also get the characteristic graph of convolution , The local features are separated according to the spatial position and sampled randomly .

Feature extraction and fusion ： Adopt a phased approach 、 Hierarchical fusion , That is to say, firstly, the text content and the local image are fused in a shallow level , Form cross modal text + Partial image Transformer modular ; Then cross modal text + Partial image Transformer The module is deeply fused with the global features of the image , Finally, a single mode Transformer And deep and shallow span modes Transformer A feature fusion network .

Optimization path of pre training task form ：

At present, the pre training tasks are mainly divided into Image-Text Match and Masked Language Model Two types of . among ,Image-Text Match Each picture can only match one piece of text when sampling , Low training efficiency ;Masked Language Model Is a strong task , Dominant in pre training tasks , It may lead to the poor recognition ability of the finally trained model for image modes .

To solve the above problems , Tencent Youtu added a similarity task in the pre training task （similarity loss）.

First , Building text Transformer The collected text features will be merged , Get complete text feature; secondly , Through CNN When extracting image features , Would be right feature maps Global pooling , Get complete image feature; Last , By calculation text feature And image feature The similarity , Increase the efficiency of text and image matching .

The advantage of doing this is that when training , Every time you enter a picture BERT Will calculate the picture and all text feature The similarity , From a picture corresponding to a paragraph of text in the past to a picture corresponding to multiple paragraphs of text , It greatly improves the training efficiency ; meanwhile , Similarity tasks can significantly improve the relevance between images and texts , Provide a better learning environment for later features ; Third ,feature maps and image feature Will receive it directly text feature Supervision of , thus send CNN And other image modules can also be fully trained .

When optimizing downstream tasks , In addition to the existing text model and image model , Some existing feature models can also be added to the training , Improve the overall training effect . If you are worried that adding too many models will affect the execution speed of the pre training task , Can pass AIbret Model realizes model miniaturization for feature model , It can also be done through KD The method of model distillation , It can also be used LayerDrop Way to randomly skip Transformer Some layers in the module .

After a series of optimization , The function and efficacy of the multimodal pre training model of Tencent Youtu have been significantly improved ：

First , Compared to single mode , Multimodal content security identification The recall rate has increased 30%;

secondly , The miniaturization of the model has significantly improved the running speed and training efficiency of the whole model , More primitive BERT The model has been upgraded 60%;

Last , The pre training model can also be used in pure image tasks . Fully trained for similar tasks CNN The module is extracted and put into the task of pure image detection for experiment , The experimental results are obviously better than those based on ImageNet Training model of .

The latest research on weak supervised image description and location

Tencent Youtu is still there Grounded Image Caption This multimodal transformation task , Made some frontier explorations , Relevant work has been finalized MM2021. differ Image Caption, Grounded Image Caption In addition to the text description of the picture content , Also find out the position of the corresponding object in the image in the text description .

The existing implementation is through the attention layer + The language prediction layer is a double layer LSTM Model architecture to achieve .

First, through FasterRCNN Extract the corresponding image ROI features , And use attention LSTM Layer acquisition image ROI Hidden layers in features , Thus, the image region with the strongest correlation with the hidden layer in the image features is obtained ;

Then input the previous image area into the language prediction LSTM layer , Predict the probability of the word corresponding to the image in the region .

But because the scheme itself is weak supervision , There is no complete monitoring information , The existing scheme may have a box that is too large 、 Too small or offset , Resulting in low positioning accuracy , The output is inaccurate or unrecognizable . In fact, the solution is relatively simple , Using attention LSTM Layer acquisition image ROI Hidden layers in features , Use multiple juxtaposed attention LSTM Layer predicts multiple local areas , Finally, multiple results are fused into a complete region , Make positioning more accurate . It should be noted that , attention LSTM The more layers, the better , Too much attention LSTM Layers are prone to generate noisy data .

Past highlights

optimal Tech Share | Research and application of Tencent optimal map in weak surveillance target location

optimal Tech Share | Face 3D Research and application of reconstruction and rendering technology

optimal Tech Share | Tencent Youtu proposed LAP Unsupervised multi view face 3D Reconstruction algorithm , HD restore facial details

The background to reply “ The group of ”

Join the Youtu community

原网站

版权声明
本文为[Youtu Laboratory]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/08/20210820162836700b.html