当前位置：网站首页>【AI4Code】《GraphCodeBERT: Pre-Training Code Representations With DataFlow》 ICLR 2021

【AI4Code】《GraphCodeBERT: Pre-Training Code Representations With DataFlow》 ICLR 2021

2022-07-25 12:40:00 【chad_ lee】

《GraphCodeBERT: Pre-Training Code Representations With DataFlow》 ICLR 2021

In recent years , The pre training model applied to programming languages has developed rapidly , Related tasks such as code search, code completion, code summarization Also improved . however , The existing pre training model is to code snippet（ code snippet ） Think of it as a token Sequence , Neglected The structure of the code .

In this paper, the GraphCodeBERT, There is no syntactic level AST, Instead, use the data flow of the code （data flow ） To represent source code information . The data flow of code is a graph, A node represents a variable variable （variable）, Edges represent dependencies between variables （where-the-value-comes-from）. no need AST Considering that the data flow diagram is not like AST So complicated , It will not bring unnecessary deep information .

The downstream task of this paper is natural language code search Code search 、clone detection Clone detection 、code translation Code translation 、code refinement repair bug.

Data flow diagram data flow

data flow It's a graph, Nodes are variables , Edge representation where the value of each variable comes from.

** Why build a map ？** For the same source code , Using different abstract grammars AST Is different , But the data flow of the code is constant . Therefore, data flow graph can provide important semantic information .

For example v = max value − min value , Programmers do not always name variables as required , So you want to know about variables v The semantics of the , You can consider variables v The source of the , From... In the data flow max and min. In addition, the data flow graph can also support parsing the different semantic information of the same variable in different execution stages , Like in the picture x3, x7, x9, x11 Although it's all x This token, however Semantic information is different Of , As token Sequence training is not suitable .

The method of constructing data flow graph is shown above , For a piece of code $\left\{c_{1}, c_{2}, \ldots, c_{n}\right\}$ , First use the compiler （Tree-sitter） Parse it into AST,AST Contains the syntax information of the code segment , take AST The leaf nodes of are identified as variable sequences $V=\left\{v_{1}, v_{2}, \ldots, v_{k}\right\}$ . Then take each variable as a node , There is a directional side $\varepsilon=\left\langle v_{i}, v_{j}\right\rangle$ Said variable j The value of depends on Variable i Value . For example, code x = expr,x Depends on all variables in the expression to the right of the equal sign , So the data flow graph is a directed graph ,a Point to x signify x Depend on a. The set of directed edges is $E=\left\{\varepsilon_{1}, \varepsilon_{2}, \ldots, \varepsilon_{l}\right\}$ , Code C The data flow graph of is represented as $\mathcal{G}(C)=(V, E)$ .

Model

The model architecture uses Standards BERT, Some model structural parameters will not be discussed in detail . The only difference is in Attention There is a graph based in the module $\mathcal{G}(C)=(V, E)$ Of mask（ After all, graph structure information has to be used ）

Input and output

There are three sequences ： code snippet $C=\left\{c_{1}, c_{2}, \ldots, c_{n}\right\}$ , The comment text fragment of this code $W=\left\{w_{1}, w_{2}, \ldots, w_{m}\right\}$ as well as Variable node sequence $V=\left\{v_{1}, v_{2}, \ldots, v_{k}\right\}$ . Input X It is spliced by three sequences ： $X=\{[C L S], W,[S E P], C,[S E P], V\}$

The output is each token Vector representation of , Used to complete various pre training tasks .

Graph-Guided Masked Attention

It's mainly in BERT Chinese vs multi-head attention Made a design ,multi-head The output of is ：
$\begin{gathered} h e a d_{i}=\operatorname{softmax}\left(\frac{Q_{i} \cdot K_{i}^{T}}{\sqrt{d_{k}}}+M\right) \cdot V_{i} \\ \hat{G}^{n}=\left[\text { head }_{1} ; \ldots ; \text { head }_{u}\right] \cdot W_{n}^{O} \end{gathered}$
among M yes Graph-Guided Masked Attention matrix （GraphCodeBert Compared with Bert The characteristics of ）, yes

$\times|X|$ The vector of the dimension , $M$ It does two things ：1、 The first i Variables if and j If variables have no edge connection in the data flow graph （ $\left.\left\langle v_{j}, v_{i}\right\rangle \in E\right)$ ）,softmax The weight of $ M_{i j}$ For negative infinity , That is, variables are not allowed i Pay attention to Variable j, Edge connection is 0, allow i Be careful j;2、 If the variable node $v_i$ Is from the code token $c_j$ Identified , Allow i and j Pay attention to each other , Otherwise, it is also negative infinite .
$M_{i j}=\left\{\begin{array}{rl} 0 & i f\left(q_{i} \in[C L S],[S E P]\right) \operatorname{or}\left(q_{i}, k_{j} \in W \cup C\right) \operatorname{or}\left(\left\langle q_{i}, k_{j}\right\rangle \in E \cup E^{\prime}\right) \\ -\infty & \text { otherwise } \end{array}\right.$

The values of the white parts are 0
Orange part , If the code token $c_i$ And variables $v_j$ There is correspondence , such as return x Medium token x and x11 There is a corresponding relationship , that $M_{c_{i} v_{j}}=0$ ; other token（ Including others x） and x11 There is no corresponding relationship , Set it to -∞.
The blue part , If the variable $v_i$ And variables $v_j$ There is a data flow relationship , $M_{c_{i} v_{j}}=0$ Namely 0, Otherwise negative infinity .

Here is more reflection Multi-head attention And graph convolution , Only between nodes with edge connections attention.

Pretraining task

Three pre training tasks , Namely MLM、Edge Prediction and Node Alignment

Masked Language Modeling

Only on code sequences and annotation text sequences MLM

Edge Prediction

Edge prediction of data flow graph , The purpose is to let the model learn "where-the-value-comes-from" Information about , Corresponding to the blue part in the architecture diagram . Sample randomly from the data flow graph 20% The node of is denoted as $V_{s}$ , then mask Drop this 20% The edge designed by the node ,mask The way to do this is to edge Mask The value in the matrix is set to negative infinity . And then use BERT The output of is brought into BCE Two classification loss In doing Edge Prediction：
$\operatorname{loss}_{E d g e P r e d}=-\sum_{e_{i j} \in E_{c}}\left[\delta\left(e_{i j} \in E_{m a s k}\right) \log p_{e_{i j}}+\left(1-\delta\left(e_{i j} \in E_{m a s k}\right)\right) \log \left(1-p_{e_{i j}}\right)\right]$
here $\delta\left(e_{i j} \in E\right)$ is 1 if $\left\langle v_{i}, v_{j}\right\rangle \in E$ otherwise 0 Namely BCE Of label, $p_{e_{i j}}$ Namely BERT Output embedding Inner product . Negative sampling is also considered here .

Node Alignment

In order to align the representation of code sequence with the representation of data flow graph , In the code sequence 4 A the same token “x”, But in the data flow graph x11 It should correspond to the last expression in the code sequence “return x” Of x.

Based on this idea , The specific approach is to first mask matrix M in “x” and x11 The edge of mask fall （ from 0 Turn into -∞）, Yes BERT The output of BCE Two classification , The negative sample here is the rest of the code sequence token x：

$\operatorname{loss}_{N o d e A l i g n}=-\sum_{e_{i j} \in E_{c}^{\prime}}\left[\delta\left(e_{i j} \in E_{m a s k}^{\prime}\right) \log p_{e_{i j}}+\left(1-\delta\left(e_{i j} \in E_{m a s k}^{\prime}\right)\right) \log \left(1-p_{e_{i j}}\right)\right]$

experiment

4 Downstream missions ： Search code 、 Code clone detection 、 Code translation 、 repair bug

NATURAL LANGUAGE CODE SEARCH

The meaning of code search task is to give a natural language input , It is required to find the code with the most relevant semantics from a group of candidate codes , The data set used is CodeSearchNet Data set of , Use the first paragraph of the code document as query, use GraphCodeBERT , respectively, encode query and Code sequences + Data flow diagram , And then use [CLS] Output the representation to calculate the similarity . It's fine too fine-tuning, It's the twin towers .

Insert picture description here

Code Clone Detection

Given two code snippets , It is required to measure its similarity , It's using BigCloneBench Data sets . The input is code fragment and data flow diagram , Or use it CLS The representation of .

Insert picture description here

Code Translation

The meaning of code translation is to translate one programming language into another programming language , Its purpose is to migrate legacy software from one programming language of the platform to another , With Lucene、POI And other open source projects are data sets , These projects all have Java and C# The implementation of the , Model input in task Java(C#) Code , Output the corresponding C#(Java) Code .

The practice is to pre train GraphCodeBERT As Encoder, Then initialize a random decoder, then fine-tuning.

Insert picture description here