当前位置:网站首页>2021:Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering
2021:Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering
2022-06-27 03:15:00 【weixin_ forty-two million six hundred and fifty-three thousand 】
Abstract
Visual question answering requires deep semantic and linguistic understanding of the problem , And the ability to relate it to objects in the image , It requires multimodal reasoning in computer vision and naturallanguageprocessing . We have put forward Graphhopper, Reasoning by integrating knowledge graphs 、 Computer vision and naturallanguageprocessing technology to handle tasks . In particular , Our approach is context driven based on scene entities and their semantic and spatial relationships 、 Serialized reasoning . First step , We obtain a scene graph that describes the objects in the image and their attributes and relationships . And then , Train an reinforcement learning agent , Autonomous Navigation on the extracted scene map in a multi hop manner , To generate reasoning paths , This is the basis for pushing to the answer . stay GQA Experiments on data sets , Based on manually managed and automatically generated scene diagrams . It turns out that , Manually managed scene graphs can catch up with human performance , and , We found that Graphhopper It is significantly superior to another most advanced scene graph reasoning model in both manual management and automatic generation .
One 、 Introduce

VQA There is a language a priori in the dataset , That is, for some challenging reasoning tasks , The algorithm uses prior knowledge , Use shortcuts to achieve proper reasoning . To solve this problem , Put forward GQA Data sets , It is more suitable to evaluate reasoning ability than other real-world data sets , Because images and problems are carefully filtered , Make the data less prone to bias .
many VQA Methods are agnostic to the explicit relational structure of the objects in the rendered scene , And depending on the structure of neural network, the regional features of images are processed separately . These methods lack the demonstrated reasoning ability . Our goal is to VQA The latest technology is combined with the latest research progress in the field of statistical relationship learning of knowledge map . A knowledge map is a collection of fact statements , Provides a human understandable 、 A structured representation of knowledge about the real world . suffer KGs Inspired by the multi hop reasoning method , We proposed Graphhopper, One will VQA Task modeling is a new method to find path problem on scene graph .
In detail , Given an image , Consider a scene graph and train a reinforcement learning agent to walk randomly under policy guidance on the scene graph , Until we get a decisive reasoning path . Compared with a purely embedded approach , Our method provides an explicit chain of reasoning , The answer that leads to derivation . in summary , Our main contributions are as follows :
(1)Graphhopper It is the first one to enhance learning by using multi hop reasoning of scene graph VQA Method ;(2) stay GQA Experiments on datasets show the composition and interpretable properties of our method ;(3) For analytical reasoning , We consider manual planning (ground truth) Scene , This setting separates the noise related to the visual perception task , Focus only on language understanding and reasoning tasks , therefore , We demonstrate that our approach achieves performance similar to that of humans ;(4) Based on manually managed and automatically generated scene diagrams , We showed Graphhopper Better than neural state machine (NMS)-- A most advanced scene graph reasoning model , Run in one setting , Be similar to Graphhopper.
Two 、 Related work
3、 ... and 、 Method

VQA The task is treated as a scenario graph traversal problem , Start with a node , A proxy sequentially samples transitions with neighboring nodes , Until the answer node . By adding a transition to an existing path , The chain of reasoning is extended in turn . Before describing the agent's decision problem , We introduce the tag (notation).
Notation: A scene graph is a directed multigraph , Each node corresponds to a scene entity , The scene entity is an object or object attribute related to a bounding box . Each scene entity has a type corresponding to the prediction object or attribute tag , A scene graph is an ordered triple (s,p,o) Set .
Environment: agent S The state space of is composed of E×Q give , among E Is a scene map SG The node of ,Q Represents a collection of all problems . Time t The status of the agent is the current entity et And questions Q, So a time is t The state of St Expressed as St=(et,Q). From state St Set of available operations for ASt Express , Contains slave nodes et All outgoing edges of and their corresponding object nodes . We generate for each of the scene graphs NO_OP-label The nodes of include self - looping , These self loops allow the agent to remain in its current position when it reaches the answer node . Besides , The introduction of inverse relation allows the agent to transmit freely in any direction between two nodes .
The environment evolves deterministically by updating its state according to previous actions . Formally , The transition function is :
Auxiliary node : The principle of including auxiliary nodes is , They facilitate the walking of the delegates , Or help will QA The task is treated as a goal oriented walk on the scene graph . These nodes are included during runtime graph traversal , Ignored at compile time , For example, when computing node embedding . for example , When we add one for each scene graph connected to all other nodes hub node , Proxy from with global connection hub Start scenario graph traversal . For a binary class problem , We add... To the scene entity corresponding to the final position of the proxy YES and NO node , The proxy can then be converted to YES or NO node .
Problem and scenario diagram processing : Use dimension d=300 Of GloVe Embedded initialization Q The words in , Similarly , We use class tags of entities and relationships to embed initialization entities and relationships . In the scene diagram , Nodes are embedded through a multilayer graph attention network (GAT), The idea of extending from graph convolution network by using self attention mechanism , When an entity embedding is formed by aggregating node features from its adjacent nodes ,GATs Simulate the convolution operator in the grid , Relationships and anti relationships between nodes allow context to pass through GAT Flow in two ways , therefore , The resulting embedding is context based , So that the neighborhood nodes with the same type but different graph can be distinguished . For questions Q Embedded generation of , First, use a Transformer, Then an average pooling operation .
Last , Because we have added auxiliary YES and NO node , We train a feedforward neural network to classify query type and binary type problems . This network consists of two fully connected layers , The intermediate output has ReLu plan . We find it easy to distinguish between queries and binary problems , Ignore problem classification errors .
Strategy : We represent the history of agency , History is recorded through multiple layers LSTM code :
![]()
at-1 Corresponding to the embedding of previous behavior edges and nodes , The distribution of behavior related to history is :
![]()
Reward and optimize : In sampling T After a transition , A final award is based on :

utilize REINFORCE To maximize returns , therefore , The agent maximization problem is caused by :

among ,T Represents a set of training questions , In the formula 4 The first expectation in the training is to replace with the empirical average in the training set . The second expectation approximates multiple rollout The empirical mean value of . We also use a moving average baseline to reduce variance , In addition, entropy regularization and parameters are used to strengthen the exploration , In the process of reasoning , We do not sample paths , But according to the formula (2) The given transition probability execution width is 20 Beam search .
Four 、 Data sets and experimental settings
4.1 Data sets
GQA It is more suitable for the reasoning and combination ability of the evaluation model in the real environment , contain 113K Images , about 1.2M The problem is divided into 80%10%10% Training for 、 Verification and testing . The overall vocabulary size includes 3097 word , contain 1702 A target class ,310 It's a relationship and 610 Target attributes . We use... For the generated scene graph GQA Pruned version of , First , Scenario diagram of manual management , Analytical reasoning and language comprehension , after , Use trimmed GQA A scenario graph generated on a dataset to evaluate performance , It shows the performance of our model on noisy data . We use state-of-the-art relationships Transformer The Internet RTN Generate scene graph ,DetectoRS Object detection , stay test-dev Carry out all experiments on .
Question type : Questions to evaluate visual verification 、 Relational reasoning 、 Spatial reasoning 、 Comparison and logical reasoning , These problems can be classified according to structural or semantic criteria .
4.2 Experimental setup
4.3 Performance indicators
In addition to accuracy , Additional indicators : Uniformity ( The answer should not contradict itself )、 effectiveness ( The answer lies within the scope of the question , For example, when asked about the color of an object , Red should be a valid answer )、 rationality ( The answer should be reasonable , For example, red is a reasonable color for apples , Not blue ).
5、 ... and 、 Results and discussions
Our architecture involves multiple components , It is important to be able to individually analyze the performance of different modules and processing steps , therefore , We first show in GQA Experimental results of manually managed real scene maps on the ground provided in the data set , And compare them. Graphhopper And NSM And human performance . This setting allows us to separate noise from the visual perception component , And quantify the reasoning power of our method . And then , Use the generated scene graph to present the results .
Besides , We also observed that secondary nodes help agents perform effectively ,Hub Nodes perform better than starting from any arbitrary node , Because it is easier to forward and backtrack from the node .
Copy NSM:NSM Also use VQA Scene graph reasoning , We will NSM As a comparison of baseline methods , However , Their reasoning method is different from ours , To compare our method with the reasoning ability of the same generated scene graph , We try to replicate NSM, Use from NSM Available parameters and implementation of .
5.1 The result of manually drawn scene graph

surface 1 Shown , Use a manually drawn scene map , Compare our method with human performance and NSM Performance of . We found that Graphhopper Better than... In all performance indicators NSM, Especially on open issues ; and ,Graphhopper It is also slightly better than human beings in the accuracy of the two problems ; On the other hand , In consistency 、 Effectiveness and rationality , Human beings are superior to Graphhopper And always reach a high value . All in all , These results can be regarded as proof of reasoning ability , And built a Graphhopper The performance limit of .
5.2 The result of automatically generating graphs
Although scene graph generation is not the focus of this work , But because of the following facts , It consists of GQA One of the main challenges in creating a good scenario map :(1)GQA There is no open source code for scene graph generation and target detection ;(2) Compared with the existing scene graph data set ,GQA The uneven distribution of a large number of instances and classes will lead to a significant decline in accuracy ;(3) There is a lack of attribute prediction model in modern target detection framework .
In this work , We have solved all these challenges , Because our model performance directly depends on the quality of the scene graph .
Scene map generation : First select two state-of-the-art Networks RTN To generate scene graph ,DetectoRS Target detection . be based on RTN Of Transformer Structural machine context scene graph embedding is most relevant to our structure , To make Graphhopper Common to any scene graph generator , We didn't use RTN Context embedding , But rely on GAT The context of .
GQA trim : There are a large number of departments , Class distributions are highly skewed , This leads to a significant decrease in the accuracy of target detection and scene graph generation tasks . According to the former 800 Classes 、170 It's a relationship and 200 Attributes , This pruning allows us to reduce 60% The above words , At the same time, it covers more than 96% Combined answer .
Attribute prediction : One of the disadvantages of the existing scene graph generation and target detection is that they can not predict the properties of the detected target , therefore , We combine attribute predictors in GQA, from RTN The context target embedding of is used as a property predictor :

We are building GQA The object detector and scene graph generator are trained with their default parameters , This will facilitate the incorporation of all instances of training issues ( Such as the object 、 attribute 、 Relationship ) The coverage of 52% Add to 77%.

surface 2 Shows Graphhopper Performance in both settings : First , Use your own pipeline to predict classes through the generated graph 、 Properties and relationships ; secondly , Just use from RTN The prediction relationship of ( With ground real objects and attributes ). We found that , Based on the generated graph ,Graphhopper Always better than NSM. Besides , stay pr Or forecast relationship settings , It gets a higher score , Because the graph does not contain any false predictions from the object detector . These results indicate , It has superior reasoning ability in the generated graph and the generated relationship between object relationships .
5.3 Discussion of reasoning ability

chart 4 Separate the results according to different problem types ,5 Semantic types in ( Left ) and 5 There are three types of structures ( in ), and , According to the length of the reasoning path ( Right ) Report performance . Besides , We also show the performance of the models of the three scenario graph settings that we consider in this work . chart 4a Shows the performance of a manually managed scenario graph , It describes the actual performance in an ideal environment ; chart 4b Describes the performance based only on the predictive relationship between objects , This setting shows Graphhopper And the performance of scene graph generator . Last , chart 4c An object-based detector is described 、 Scene graph generator and Graphhopper Performance of . First , We found that Graphhopper High accuracy for all types of questions in each setup . Besides , We found that , If answering a question requires many reasoning steps ,Graphhopper The performance of will not be affected . We speculate , This is because high complexity questions are difficult to answer , But due to the appropriate upper and lower culture embedded ( for example , adopt GAT and Transformer), The agent can extract specific information that identifies the correct target node . Good performance on these high complexity problems can be seen as evidence ,Graphhopper It can effectively transform the problem into a transition on the scene graph , Until you get the right answer .

Examples of reasoning paths : chart 3 Shows three that will lead to the correct answer Graphhopper Example of scenario graph traversal . As can be seen from these examples , The sequential reasoning process on explicit scene graph entities makes the reasoning process easier to understand . In case of wrong prediction , The extracted path may provide insight Graphhopper The mechanism of , And facilitate commissioning .
6、 ... and 、 summary
We proposed Graphhopper, Combine existing KG Reasoning 、 A new method of visual question answering in computer vision and naturallanguageprocessing . say concretely , An agent is trained to extract the decisive reasoning path from the scene graph . To analyze the reasoning power of our method , We conduct strict experimental research on manually planned and generated scene maps , Scenario map based on manual planning , We show Graphhoper It has reached the human performance . Besides , We found that , On our automatically generated scene graph ,Graphhopper It is superior to another state-of-the-art scene graph reasoning model in all performance indicators considered . In the future work , We plan to combine the scenario map with the common sense knowledge map , To further improve the reasoning ability of graphics .
边栏推荐
- 2021:Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering
- 元透实盘周记20220627
- Regular expressions: Syntax
- 平均风向风速计算(单位矢量法)
- 学习太极创客 — MQTT 第二章(二)ESP8266 QoS 应用
- Microsoft365 developer request
- Pat grade a 1021 deep root
- pytorch_grad_cam——pytorch下的模型特征(Class Activation Mapping, CAM)可视化库
- Constraintlayout Development Guide
- Human soberness: bottom logic and top cognition
猜你喜欢

Pat class a 1024 palindromic number

ESP8266

What if asreml-r does not converge in operation?

PAT甲级 1024 Palindromic Number

Yuantou firm offer weekly record 20220627

Flink Learning 2: Application Scenarios

Google began to roll itself, AI architecture pathways was blessed, and 20billion generation models were launched

How does source insight (SI) display the full path? (do not display omitted paths) (turn off trim long path names with ellipses)

Career outlook, money outlook and happiness outlook

Topolvm: kubernetes local persistence scheme based on LVM, capacity aware, dynamically create PV, and easily use local disk
随机推荐
Human soberness: bottom logic and top cognition
Pat grade a 1023 have fun with numbers
Microsoft365 developer request
paddlepaddle 21 基于dropout实现用4行代码dropblock
2021:Greedy Gradient Ensemble for Robust Visual Question Answering
LeetCode 785:判断二分图
Pat grade a 1026 table tennis
Solve the problem of error reporting in cherry pick submission
Learn Tai Chi Maker - mqtt (VI) esp8266 releases mqtt message
dat. gui. JS star circle track animation JS special effect
jmeter分布式压测
Quicksand painting simulator source code
流沙画模拟器源码
Is the money invested in financial products guaranteed? Is there no more?
2020:MUTANT: A Training Paradigm for Out-of-Distribution Generalizationin Visual Question Answering
455. distribute biscuits [distribution questions]
Yiwen teaches you Kali information collection
Flink Learning 2: Application Scenarios
Yuantou firm offer weekly record 20220627
PAT甲级 1024 Palindromic Number