当前位置：网站首页>2020：MUTANT: A Training Paradigm for Out-of-Distribution Generalizationin Visual Question Answering

2020：MUTANT: A Training Paradigm for Out-of-Distribution Generalizationin Visual Question Answering

2022-06-27 03:16:00 【weixin_ forty-two million six hundred and fifty-three thousand 】

Abstract

The evaluation of extraterritorial test samples has become an important indicator of generalization , this paper , We have put forward MUTANT, A training paradigm , Expose the model to inputs that are perceptually similar but semantically different , To improve generalization , Such as VQA-CP Challenge . In this paradigm , The model uses the training objectives of consistency constraints to understand the impact of semantic changes of input on output . With the existing VQA-CP The method is different ,MUTANT Does not rely on knowledge about the nature of the training and the distribution of test answers .MUTANT stay VQA-CP It has been realized. 10.57% Raise , Our work is to use semantic input mutation to OOD Generalization opens the way .

One 、 Introduce

Each data set contains biases , Inductive bias is a necessary condition for machine learning algorithm to work . However, deviation has one useful for generalization ( Positive bias ) The components of , And due to false correlation ( Negative bias ) A component of . We use "positive bias" Indicates the necessary relevance to perform a task -- As for the “What sports is” The answer to the question is related to a sports name . take "negative bias" Used for false correlations that may be learned from data -- As for the “What sports is” The answer is "tennis".OOD The goal of generalization is to reduce negative bias while learning to perform tasks .LMH Remove all prejudices by punishing the example of answering without looking at the image .

We propose a method that focuses on increasing positive bias and reducing negative bias , To solve the problem OOD The generalization problem of . Our method makes the input mutation , In order to VQA Models are exposed to perceptually similar but semantically different samples . Intuition is implicitly allowing the model to understand the key changes in the input that cause the answer to change . Pictured 1 Shown , Sudden changes in images and questions have led to changes in the answers , Neither of these mutations significantly changed the input , The type of reasoning required to answer questions has not changed .

We propose an exposure framework for problem types , Teaching model , Although these language priors may exist in the training data , Other sports can also answer these questions , So as to reduce negative prejudice . This is different from focusing on ways to reduce language bias using data enhancement (CSS) contrary . Our approach uses pairwise training protocols , To ensure the consistency of the answer prediction of the original sample and the mutation sample . Our model includes a projection layer , Projection of the cross modal features and real answer of the learning manifold , Noise contrast estimation is used to minimize the distance between two vectors .

Our contribution is as follows ：（1） Introducing training VQA The mutant paradigm of the model and the sample generation mechanism using the semantic transformation of the input image or problem , In order to realize the OOD generalization .（2） In addition to the traditional classification tasks , We also set a new training goal , Use the projection of the cross modal feature and the projection of the answer embedded in the shared projection manifold , To predict the correct answer .（3） Our pairwise consistency loss is used as a regularization , Try to make the distance between the real answer vectors on the ground closer to the distance between a pair of predicted answer vectors with original and mutation inputs .（4） A large number of experiments and analysis show that our method is VQACP Advantages of data sets , And set up 69.52% A new level of , Improved 10.57%.

Two 、MUTANT

We will open the domain VQA The problem is a multi classification problem .

2.1 The concept of mutation

There are three types of transformations that create mutant inputs , add to 、 Delete 、 Or replace . For image mutation , Corresponding to the addition or deletion of objects , Change the properties of the object , Such as color 、 texture 、 And lighting conditions . Problem mutation can be achieved by adding a negative word (no,not etc. ), Cover up key words , Replace the target word with an antonym . therefore , about VQA Each sample in the dataset , We can get a mutation sample and use it for training .

2.2 Mutant training

Our mutation sample training method relies on the traditional VQA Three key concepts of classification task supplement .

（1） Answer projection ： Tradition VQA Model learning strategy use softmax Cross entropy optimizes the task of standard classification ：

QA As a classification task is welcome , Because the answer vocabulary follows the long tail distribution in the data set . But when making decisions , It doesn't take into account the meaning of the answer , Instead, I learned the correlation between the single heat vector of the answer class and the input characteristics . So in order to answer "What is the color of the banana", The model learns the characteristics of the problem and yellow Strong correlation between answer classes , There is no concept of coding the yellow or green of bananas . This key shortcoming has a negative impact on the versatility of these models when tested on raw green or over mature black bananas .

To alleviate this situation , In addition to classification tasks , We propose a training target that operates in the answer embedding space . The key idea is to map input and output onto a shared manifold , So as to establish a similarity measure on the manifold . We train a projection layer , Learn the characteristics and answers of projective manifolds , chart 2 Shown . The noise contrast estimation is then used as a loss function to minimize the cross modal features z Projection and ground-truth answer a Of glove vector v The distance between the projections of ：

among zfeat=fproj(z) and za=fproj(glove(a)), This similarity measure is not between real answers and predictions , But between the projection of the input feature and the projection of the answer , To include context in the answer task .

（2） Type exposure ： The aim is to remove negative bias , We teach models to identify problem types , And learn which answers are valid for a particular type of question , No matter how often they appear in the dataset . Such as How many The answer can be all the numbers , We call it type exposure , Because it indicates that the model , Although there may be a strong correlation between the question and the answer , But there are other answers that work for specific types of questions . Our type exposure uses a feedforward network to predict the answer type , And create a binary mask on the candidate answer corresponding to this type .

（3） Consistency between the two groups ： The final component of the mutant is pairwise consistency . We train the model with both original and mutated samples , A loss function is used to ensure that the distance between two predicted answer vectors is close to the distance between two ground true answer vectors . The pairwise consistency loss is as follows ：

This pairwise consistency is designed as a regularized consistency , It contains the concept of semantic transfer caused by mutation in the answer space . for example , Consider the picture 3 Image mutation in , It turns the real answer from “2” Change to “1”. The change of answer space should be reflected by the predictor .

3、 ... and 、 by VQA Generate input mutation

Mutation refers to the transformation of semantic entities in images or problems , Can reliably generate new answers . Our mutation process is automatic , And do not use knowledge about test sets for creating new samples .

3.1 Image mutation

First, identify the key objects that cause the answer to change , Then delete instances of these objects or change their colors .

（1） Remove object instances ： If an object is mentioned in the question ( A synonym for ), It is important to the problem , Otherwise, it is not critical . For each image , We get multiple mask images , And delete the pixels within the instance border . These masked images are fed into the GAN Inner painting network , Make the mutation image realistic , Also prevents the model from getting clues from the mask shape .

（2） Color conversion ： We use a sample that asks for the color of the object in the image , adopt RGB Pixel level color inversion in space to change the color of key objects , The real answer will be replaced by the new color of the key object . To get an object with a new color , We don't use knowledge about the colors of objects in the world . In some cases , The new color of the object may not correspond to the real world scene , So force the model to actually recognize the color , Not the language a priori answer .

3.2 A sudden change in the problem

Use three problem mutations , As shown in the table 1. First, identify the key objects , Then apply the template based question operator . The first operator is right yes-no The negation of the question , Through a template based process , In the verb 、 Add a before this or noun phrase no or not. The second is to use translated words or hostile object words instead of keywords . The third kind covers up the words in the question , So as to introduce ambiguity in the problem . The problem of not identifying new answers with certainty is annotated through an extensive class tag (color,location,fruit instead of red,library,apple). However, we hope that the model can identify this wide range of answer categories under partially closed inputs . For mutations with non critical objects or words , The answer remains the same .

3.3 Mutation statistics

Use VQA-CP-v2 Generating mutation samples from the training set of . For each original sample , We generate on average 1.5 Mutation samples , So as to obtain a total of 679k Samples . surface 2 It shows the distribution of the mutation we generated relative to the mutation type . Adding mutant samples does not change the distribution of samples for each problem type .

Four 、 experiment

4.1 Set up

（1） Data sets ： stay VQA-CP-v2 and VQA-v2 We train and evaluate our model on the validation set .

4.2 Baseline model

Compare our method with GVQA、RUBI、SCR、LMH、CSS Compare as a baseline , Because most of these methods are based on UpDn As the backbone . We studied UpDn In the mutant paradigm ,LXMERT As a powerful base transformer Cross modal feature extractor , And pre training on tasks such as mask language modeling and cross modal matching .LXMERT Represents the recent use of similar bert The trend of the pre training model , And fine tune multiple downstream visual and language tasks .

4.3 stay VQA-CP-v2 and VQA-v2 The result on

Add our mutant method to UpDn and LXMERT Model , about VQA-CP, Our approach is LXMERT I've been promoted on the Internet 23.29%, Better than the previous best results 10.57%, And promotion in all classes . We use negation as yes-no Problem mutation operation of the problem , But such a problem does not exist in the test set , Our model takes advantage of this mutation , stay yes-no The problem is greatly improved . Mutation method in UpDn The overall method is improved 21.98%.AReG,RUBI,SCR,LMH,CSS They all passed UpDn Add depolarization technology on . It shows that our method of removing partiality is in two SOTA Model improvements , And better than all the above baselines , The previous work only modified UpDN.

In a balanced VQA-v2 During training and evaluation , Our method is designed for OOD The method of generalized design has the best performance , This is the closest in the baseline LXMERT The establishment of a SOTA, That is, settings for a specific balance .

（1） Not in training VQA-v2 The result on ： Besides , We used in VQA-CP The best model for training , And in VQA Evaluation on the test standard set , And not right VQA-v2 Data for retraining . The purpose here is to evaluate the bias data (VQA-CP) And whether the model trained on mutant data can be extended to use i.i.d Training test segmentation VQA-v2. Our overall accuracy is 67.63%, Where is - No problem 88.56%, The number based question is 50.76%, Other questions are 54.56%. This is better than all the existing ones in VQA-v2 Definitely trained VQA-CP The models are better （ See table 3 The report ）, Therefore, the universality of our method is proved .

4.4 analysis

（1） Effects of training with mutant samples ： We measured the effect of adding training data pairs with mutation samples UpDn and LXMERT Influence , Results such as table 4, Overall performance has improved , about yes-no There is a significant jump between the performance of the model and that of the counting class ,UpDn Especially benefited from mutant samples on the numerical problem ( Improved 23.94%).

The final model with only image and only problem mutation training is also compared . This is worse than the training of both mutations , But problem mutation is better than image mutation , And image mutation is better on the digital problem .

（2） Melting research ： Assess the impact of each component ：Answer Projection,Type Exposure and Pairxise Consistency, Pictured 5. The introduction of the answer projection has significantly improved yes-no Performance of , Type exposure improves the performance of other problems . Pairwise consistency loss significantly increases the number of problems and yes-no The performance of the problem .

There is a small difference between the original and mutant samples , The model needs to understand this difference , This in turn enables the model to reason about the problem and predict new answers . for example , Pairwise consistency loss allows the model to learn the correlation between a missing object and a change in the answer , So as to improve our VQA The counting power of the model . Similarly , Pairwise consistency allows the model to change its answer when deleting key objects yes-no Improve on the problem .

（3）LMH The effect of depolarization on mutants ： We compare our model with and without explicit depolarization methods LMH The results of training .LMH A hybrid learning strategy is implemented , By using the master model combined with the partial model trained only by the problem . In the table 6 in , When LMH When used in combination with mutants, the performance will be reduced . It may be because in the process of depolarization ,LMH Attenuated mutants introduce useful positive biases .

6、 ... and 、 Discuss and summarize

In this paper , We propose a method to train using input variants VQA Model approach , The goal is to generalize data outside the distribution . Our new answer projection module is trained to minimize the distance between the answer and the input projection , The typical VQA Classification task . Our type exposure model allows our network to take all valid answers of each question type as equally possible candidate answers , So as to avoid the negative answer language a priori . Coupled with pairwise consistency , These modules are in VQA-CP-v2 The most advanced precision is achieved on the data set , And reduced. VQA-v2 The gap between model performance on data .

We distinguish our work from robust learning methods using stochastic antagonism to perturbations , contrary , We treat input mutations as structured perturbations , This leads to semantic changes in the input space and structural changes in the output space . Let's imagine , The concept of input mutation can be extended to other visual and language tasks , For robustness . Simultaneous work in the field of image classification shows that , Well designed input perturbations or operational inputs can facilitate generalization , And result in improved performance .

原网站

版权声明
本文为[weixin_ forty-two million six hundred and fifty-three thousand ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206270305069509.html