当前位置：网站首页>2021:Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering

2021:Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering

2022-06-27 03:15:00 【weixin_ forty-two million six hundred and fifty-three thousand 】

Abstract

Visual question answering requires deep semantic and linguistic understanding of the problem , And the ability to relate it to objects in the image , It requires multimodal reasoning in computer vision and naturallanguageprocessing . We have put forward Graphhopper, Reasoning by integrating knowledge graphs 、 Computer vision and naturallanguageprocessing technology to handle tasks . In particular , Our approach is context driven based on scene entities and their semantic and spatial relationships 、 Serialized reasoning . First step , We obtain a scene graph that describes the objects in the image and their attributes and relationships . And then , Train an reinforcement learning agent , Autonomous Navigation on the extracted scene map in a multi hop manner , To generate reasoning paths , This is the basis for pushing to the answer . stay GQA Experiments on data sets , Based on manually managed and automatically generated scene diagrams . It turns out that , Manually managed scene graphs can catch up with human performance , and , We found that Graphhopper It is significantly superior to another most advanced scene graph reasoning model in both manual management and automatic generation .

One 、 Introduce

VQA There is a language a priori in the dataset , That is, for some challenging reasoning tasks , The algorithm uses prior knowledge , Use shortcuts to achieve proper reasoning . To solve this problem , Put forward GQA Data sets , It is more suitable to evaluate reasoning ability than other real-world data sets , Because images and problems are carefully filtered , Make the data less prone to bias .

many VQA Methods are agnostic to the explicit relational structure of the objects in the rendered scene , And depending on the structure of neural network, the regional features of images are processed separately . These methods lack the demonstrated reasoning ability . Our goal is to VQA The latest technology is combined with the latest research progress in the field of statistical relationship learning of knowledge map . A knowledge map is a collection of fact statements , Provides a human understandable 、 A structured representation of knowledge about the real world . suffer KGs Inspired by the multi hop reasoning method , We proposed Graphhopper, One will VQA Task modeling is a new method to find path problem on scene graph .

In detail , Given an image , Consider a scene graph and train a reinforcement learning agent to walk randomly under policy guidance on the scene graph , Until we get a decisive reasoning path . Compared with a purely embedded approach , Our method provides an explicit chain of reasoning , The answer that leads to derivation . in summary , Our main contributions are as follows ：

（1）Graphhopper It is the first one to enhance learning by using multi hop reasoning of scene graph VQA Method ;（2） stay GQA Experiments on datasets show the composition and interpretable properties of our method ;（3） For analytical reasoning , We consider manual planning （ground truth） Scene , This setting separates the noise related to the visual perception task , Focus only on language understanding and reasoning tasks , therefore , We demonstrate that our approach achieves performance similar to that of humans ;（4） Based on manually managed and automatically generated scene diagrams , We showed Graphhopper Better than neural state machine (NMS)-- A most advanced scene graph reasoning model , Run in one setting , Be similar to Graphhopper.

Two 、 Related work

3、 ... and 、 Method

VQA The task is treated as a scenario graph traversal problem , Start with a node , A proxy sequentially samples transitions with neighboring nodes , Until the answer node . By adding a transition to an existing path , The chain of reasoning is extended in turn . Before describing the agent's decision problem , We introduce the tag (notation).

Notation： A scene graph is a directed multigraph , Each node corresponds to a scene entity , The scene entity is an object or object attribute related to a bounding box . Each scene entity has a type corresponding to the prediction object or attribute tag , A scene graph is an ordered triple (s,p,o) Set .

Environment： agent S The state space of is composed of E×Q give , among E Is a scene map SG The node of ,Q Represents a collection of all problems . Time t The status of the agent is the current entity et And questions Q, So a time is t The state of St Expressed as St=(et,Q). From state St Set of available operations for ASt Express , Contains slave nodes et All outgoing edges of and their corresponding object nodes . We generate for each of the scene graphs NO_OP-label The nodes of include self - looping , These self loops allow the agent to remain in its current position when it reaches the answer node . Besides , The introduction of inverse relation allows the agent to transmit freely in any direction between two nodes .

The environment evolves deterministically by updating its state according to previous actions . Formally , The transition function is ：

Auxiliary node ： The principle of including auxiliary nodes is , They facilitate the walking of the delegates , Or help will QA The task is treated as a goal oriented walk on the scene graph . These nodes are included during runtime graph traversal , Ignored at compile time , For example, when computing node embedding . for example , When we add one for each scene graph connected to all other nodes hub node , Proxy from with global connection hub Start scenario graph traversal . For a binary class problem , We add... To the scene entity corresponding to the final position of the proxy YES and NO node , The proxy can then be converted to YES or NO node .

Problem and scenario diagram processing ： Use dimension d=300 Of GloVe Embedded initialization Q The words in , Similarly , We use class tags of entities and relationships to embed initialization entities and relationships . In the scene diagram , Nodes are embedded through a multilayer graph attention network (GAT), The idea of extending from graph convolution network by using self attention mechanism , When an entity embedding is formed by aggregating node features from its adjacent nodes ,GATs Simulate the convolution operator in the grid , Relationships and anti relationships between nodes allow context to pass through GAT Flow in two ways , therefore , The resulting embedding is context based , So that the neighborhood nodes with the same type but different graph can be distinguished . For questions Q Embedded generation of , First, use a Transformer, Then an average pooling operation .

Last , Because we have added auxiliary YES and NO node , We train a feedforward neural network to classify query type and binary type problems . This network consists of two fully connected layers , The intermediate output has ReLu plan . We find it easy to distinguish between queries and binary problems , Ignore problem classification errors .

Strategy ： We represent the history of agency , History is recorded through multiple layers LSTM code ：

at-1 Corresponding to the embedding of previous behavior edges and nodes , The distribution of behavior related to history is ：

Reward and optimize ： In sampling T After a transition , A final award is based on ：

utilize REINFORCE To maximize returns , therefore , The agent maximization problem is caused by ：

among ,T Represents a set of training questions , In the formula 4 The first expectation in the training is to replace with the empirical average in the training set . The second expectation approximates multiple rollout The empirical mean value of . We also use a moving average baseline to reduce variance , In addition, entropy regularization and parameters are used to strengthen the exploration , In the process of reasoning , We do not sample paths , But according to the formula （2） The given transition probability execution width is 20 Beam search .

Four 、 Data sets and experimental settings

4.1 Data sets

GQA It is more suitable for the reasoning and combination ability of the evaluation model in the real environment , contain 113K Images , about 1.2M The problem is divided into 80%10%10% Training for 、 Verification and testing . The overall vocabulary size includes 3097 word , contain 1702 A target class ,310 It's a relationship and 610 Target attributes . We use... For the generated scene graph GQA Pruned version of , First , Scenario diagram of manual management , Analytical reasoning and language comprehension , after , Use trimmed GQA A scenario graph generated on a dataset to evaluate performance , It shows the performance of our model on noisy data . We use state-of-the-art relationships Transformer The Internet RTN Generate scene graph ,DetectoRS Object detection , stay test-dev Carry out all experiments on .

Question type ： Questions to evaluate visual verification 、 Relational reasoning 、 Spatial reasoning 、 Comparison and logical reasoning , These problems can be classified according to structural or semantic criteria .

4.2 Experimental setup

4.3 Performance indicators

In addition to accuracy , Additional indicators ： Uniformity ( The answer should not contradict itself )、 effectiveness ( The answer lies within the scope of the question , For example, when asked about the color of an object , Red should be a valid answer )、 rationality ( The answer should be reasonable , For example, red is a reasonable color for apples , Not blue ).

5、 ... and 、 Results and discussions

Our architecture involves multiple components , It is important to be able to individually analyze the performance of different modules and processing steps , therefore , We first show in GQA Experimental results of manually managed real scene maps on the ground provided in the data set , And compare them. Graphhopper And NSM And human performance . This setting allows us to separate noise from the visual perception component , And quantify the reasoning power of our method . And then , Use the generated scene graph to present the results .

Besides , We also observed that secondary nodes help agents perform effectively ,Hub Nodes perform better than starting from any arbitrary node , Because it is easier to forward and backtrack from the node .

Copy NSM：NSM Also use VQA Scene graph reasoning , We will NSM As a comparison of baseline methods , However , Their reasoning method is different from ours , To compare our method with the reasoning ability of the same generated scene graph , We try to replicate NSM, Use from NSM Available parameters and implementation of .

5.1 The result of manually drawn scene graph

surface 1 Shown , Use a manually drawn scene map , Compare our method with human performance and NSM Performance of . We found that Graphhopper Better than... In all performance indicators NSM, Especially on open issues ; and ,Graphhopper It is also slightly better than human beings in the accuracy of the two problems ; On the other hand , In consistency 、 Effectiveness and rationality , Human beings are superior to Graphhopper And always reach a high value . All in all , These results can be regarded as proof of reasoning ability , And built a Graphhopper The performance limit of .

5.2 The result of automatically generating graphs

Although scene graph generation is not the focus of this work , But because of the following facts , It consists of GQA One of the main challenges in creating a good scenario map ：（1）GQA There is no open source code for scene graph generation and target detection ;（2） Compared with the existing scene graph data set ,GQA The uneven distribution of a large number of instances and classes will lead to a significant decline in accuracy ;（3） There is a lack of attribute prediction model in modern target detection framework .

In this work , We have solved all these challenges , Because our model performance directly depends on the quality of the scene graph .

Scene map generation ： First select two state-of-the-art Networks RTN To generate scene graph ,DetectoRS Target detection . be based on RTN Of Transformer Structural machine context scene graph embedding is most relevant to our structure , To make Graphhopper Common to any scene graph generator , We didn't use RTN Context embedding , But rely on GAT The context of .

GQA trim ： There are a large number of departments , Class distributions are highly skewed , This leads to a significant decrease in the accuracy of target detection and scene graph generation tasks . According to the former 800 Classes 、170 It's a relationship and 200 Attributes , This pruning allows us to reduce 60% The above words , At the same time, it covers more than 96% Combined answer .

Attribute prediction ： One of the disadvantages of the existing scene graph generation and target detection is that they can not predict the properties of the detected target , therefore , We combine attribute predictors in GQA, from RTN The context target embedding of is used as a property predictor ：

We are building GQA The object detector and scene graph generator are trained with their default parameters , This will facilitate the incorporation of all instances of training issues ( Such as the object 、 attribute 、 Relationship ) The coverage of 52% Add to 77%.

surface 2 Shows Graphhopper Performance in both settings ： First , Use your own pipeline to predict classes through the generated graph 、 Properties and relationships ; secondly , Just use from RTN The prediction relationship of ( With ground real objects and attributes ). We found that , Based on the generated graph ,Graphhopper Always better than NSM. Besides , stay pr Or forecast relationship settings , It gets a higher score , Because the graph does not contain any false predictions from the object detector . These results indicate , It has superior reasoning ability in the generated graph and the generated relationship between object relationships .

5.3 Discussion of reasoning ability

chart 4 Separate the results according to different problem types ,5 Semantic types in ( Left ) and 5 There are three types of structures ( in ), and , According to the length of the reasoning path ( Right ) Report performance . Besides , We also show the performance of the models of the three scenario graph settings that we consider in this work . chart 4a Shows the performance of a manually managed scenario graph , It describes the actual performance in an ideal environment ; chart 4b Describes the performance based only on the predictive relationship between objects , This setting shows Graphhopper And the performance of scene graph generator . Last , chart 4c An object-based detector is described 、 Scene graph generator and Graphhopper Performance of . First , We found that Graphhopper High accuracy for all types of questions in each setup . Besides , We found that , If answering a question requires many reasoning steps ,Graphhopper The performance of will not be affected . We speculate , This is because high complexity questions are difficult to answer , But due to the appropriate upper and lower culture embedded ( for example , adopt GAT and Transformer), The agent can extract specific information that identifies the correct target node . Good performance on these high complexity problems can be seen as evidence ,Graphhopper It can effectively transform the problem into a transition on the scene graph , Until you get the right answer .

Examples of reasoning paths ： chart 3 Shows three that will lead to the correct answer Graphhopper Example of scenario graph traversal . As can be seen from these examples , The sequential reasoning process on explicit scene graph entities makes the reasoning process easier to understand . In case of wrong prediction , The extracted path may provide insight Graphhopper The mechanism of , And facilitate commissioning .

6、 ... and 、 summary

We proposed Graphhopper, Combine existing KG Reasoning 、 A new method of visual question answering in computer vision and naturallanguageprocessing . say concretely , An agent is trained to extract the decisive reasoning path from the scene graph . To analyze the reasoning power of our method , We conduct strict experimental research on manually planned and generated scene maps , Scenario map based on manual planning , We show Graphhoper It has reached the human performance . Besides , We found that , On our automatically generated scene graph ,Graphhopper It is superior to another state-of-the-art scene graph reasoning model in all performance indicators considered . In the future work , We plan to combine the scenario map with the common sense knowledge map , To further improve the reasoning ability of graphics .

原网站

版权声明
本文为[weixin_ forty-two million six hundred and fifty-three thousand ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206270305069187.html