当前位置:网站首页>【AI4Code】《GraphCodeBERT: Pre-Training Code Representations With DataFlow》 ICLR 2021
【AI4Code】《GraphCodeBERT: Pre-Training Code Representations With DataFlow》 ICLR 2021
2022-07-25 12:40:00 【chad_ lee】
《GraphCodeBERT: Pre-Training Code Representations With DataFlow》 ICLR 2021

In recent years , The pre training model applied to programming languages has developed rapidly , Related tasks such as code search, code completion, code summarization Also improved . however , The existing pre training model is to code snippet( code snippet ) Think of it as a token Sequence , Neglected The structure of the code .
In this paper, the GraphCodeBERT, There is no syntactic level AST, Instead, use the data flow of the code (data flow ) To represent source code information . The data flow of code is a graph, A node represents a variable variable (variable), Edges represent dependencies between variables (where-the-value-comes-from). no need AST Considering that the data flow diagram is not like AST So complicated , It will not bring unnecessary deep information .
The downstream task of this paper is natural language code search Code search 、clone detection Clone detection 、code translation Code translation 、code refinement repair bug.
Data flow diagram data flow
data flow It's a graph, Nodes are variables , Edge representation where the value of each variable comes from.
** Why build a map ?** For the same source code , Using different abstract grammars AST Is different , But the data flow of the code is constant . Therefore, data flow graph can provide important semantic information .

For example v = max value − min value , Programmers do not always name variables as required , So you want to know about variables v The semantics of the , You can consider variables v The source of the , From... In the data flow max and min. In addition, the data flow graph can also support parsing the different semantic information of the same variable in different execution stages , Like in the picture x3, x7, x9, x11 Although it's all x This token, however Semantic information is different Of , As token Sequence training is not suitable .
The method of constructing data flow graph is shown above , For a piece of code C = { c 1 , c 2 , … , c n } C = \left\{c_{1}, c_{2}, \ldots, c_{n}\right\} C={ c1,c2,…,cn}, First use the compiler (Tree-sitter) Parse it into AST,AST Contains the syntax information of the code segment , take AST The leaf nodes of are identified as variable sequences V = { v 1 , v 2 , … , v k } V=\left\{v_{1}, v_{2}, \ldots, v_{k}\right\} V={ v1,v2,…,vk}. Then take each variable as a node , There is a directional side ε = * v i , v j * \varepsilon=\left\langle v_{i}, v_{j}\right\rangle ε=*vi,vj* Said variable j The value of depends on Variable i Value . For example, code x = expr,x Depends on all variables in the expression to the right of the equal sign , So the data flow graph is a directed graph ,a Point to x signify x Depend on a. The set of directed edges is E = { ε 1 , ε 2 , … , ε l } E=\left\{\varepsilon_{1}, \varepsilon_{2}, \ldots, \varepsilon_{l}\right\} E={ ε1,ε2,…,εl}, Code C The data flow graph of is represented as G ( C ) = ( V , E ) \mathcal{G}(C)=(V, E) G(C)=(V,E).
Model
The model architecture uses Standards BERT, Some model structural parameters will not be discussed in detail . The only difference is in Attention There is a graph based in the module G ( C ) = ( V , E ) \mathcal{G}(C)=(V, E) G(C)=(V,E) Of mask( After all, graph structure information has to be used )

Input and output
There are three sequences : code snippet C = { c 1 , c 2 , … , c n } C=\left\{c_{1}, c_{2}, \ldots, c_{n}\right\} C={ c1,c2,…,cn}, The comment text fragment of this code W = { w 1 , w 2 , … , w m } W=\left\{w_{1}, w_{2}, \ldots, w_{m}\right\} W={ w1,w2,…,wm} as well as Variable node sequence V = { v 1 , v 2 , … , v k } V=\left\{v_{1}, v_{2}, \ldots, v_{k}\right\} V={ v1,v2,…,vk}. Input X It is spliced by three sequences : X = { [ C L S ] , W , [ S E P ] , C , [ S E P ] , V } X=\{[C L S], W,[S E P], C,[S E P], V\} X={[CLS],W,[SEP],C,[SEP],V}
The output is each token Vector representation of , Used to complete various pre training tasks .
Graph-Guided Masked Attention
It's mainly in BERT Chinese vs multi-head attention Made a design ,multi-head The output of is :
h e a d i = softmax ( Q i ⋅ K i T d k + M ) ⋅ V i G ^ n = [ head 1 ; … ; head u ] ⋅ W n O \begin{gathered} h e a d_{i}=\operatorname{softmax}\left(\frac{Q_{i} \cdot K_{i}^{T}}{\sqrt{d_{k}}}+M\right) \cdot V_{i} \\ \hat{G}^{n}=\left[\text { head }_{1} ; \ldots ; \text { head }_{u}\right] \cdot W_{n}^{O} \end{gathered} headi=softmax(dkQi⋅KiT+M)⋅ViG^n=[ head 1;…; head u]⋅WnO
among M yes Graph-Guided Masked Attention matrix (GraphCodeBert Compared with Bert The characteristics of ), yes
∣ X ∣ × ∣ X ∣ |X| \times|X| ∣X∣×∣X∣ The vector of the dimension , M M M It does two things :1、 The first i Variables if and j If variables have no edge connection in the data flow graph ( * v j , v i * ∈ E ) \left.\left\langle v_{j}, v_{i}\right\rangle \in E\right) *vj,vi*∈E)),softmax The weight of $ M_{i j}$ For negative infinity , That is, variables are not allowed i Pay attention to Variable j, Edge connection is 0, allow i Be careful j;2、 If the variable node v i v_i vi Is from the code token c j c_j cj Identified , Allow i and j Pay attention to each other , Otherwise, it is also negative infinite .
M i j = { 0 i f ( q i ∈ [ C L S ] , [ S E P ] ) or ( q i , k j ∈ W ∪ C ) or ( * q i , k j * ∈ E ∪ E ′ ) − ∞ otherwise M_{i j}=\left\{\begin{array}{rl} 0 & i f\left(q_{i} \in[C L S],[S E P]\right) \operatorname{or}\left(q_{i}, k_{j} \in W \cup C\right) \operatorname{or}\left(\left\langle q_{i}, k_{j}\right\rangle \in E \cup E^{\prime}\right) \\ -\infty & \text { otherwise } \end{array}\right. Mij={ 0−∞if(qi∈[CLS],[SEP])or(qi,kj∈W∪C)or(*qi,kj*∈E∪E′) otherwise 
- The values of the white parts are 0
- Orange part , If the code token c i c_i ci And variables v j v_j vj There is correspondence , such as return x Medium token x and x11 There is a corresponding relationship , that M c i v j = 0 M_{c_{i} v_{j}}=0 Mcivj=0; other token( Including others x) and x11 There is no corresponding relationship , Set it to -∞.
- The blue part , If the variable v i v_i vi And variables v j v_j vj There is a data flow relationship , M c i v j = 0 M_{c_{i} v_{j}}=0 Mcivj=0 Namely 0, Otherwise negative infinity .
Here is more reflection Multi-head attention And graph convolution , Only between nodes with edge connections attention.
Pretraining task
Three pre training tasks , Namely MLM、Edge Prediction and Node Alignment
Masked Language Modeling
Only on code sequences and annotation text sequences MLM
Edge Prediction
Edge prediction of data flow graph , The purpose is to let the model learn "where-the-value-comes-from" Information about , Corresponding to the blue part in the architecture diagram . Sample randomly from the data flow graph 20% The node of is denoted as V s V_{s} Vs , then mask Drop this 20% The edge designed by the node ,mask The way to do this is to edge Mask The value in the matrix is set to negative infinity . And then use BERT The output of is brought into BCE Two classification loss In doing Edge Prediction:
loss E d g e P r e d = − ∑ e i j ∈ E c [ δ ( e i j ∈ E m a s k ) log p e i j + ( 1 − δ ( e i j ∈ E m a s k ) ) log ( 1 − p e i j ) ] \operatorname{loss}_{E d g e P r e d}=-\sum_{e_{i j} \in E_{c}}\left[\delta\left(e_{i j} \in E_{m a s k}\right) \log p_{e_{i j}}+\left(1-\delta\left(e_{i j} \in E_{m a s k}\right)\right) \log \left(1-p_{e_{i j}}\right)\right] lossEdgePred=−eij∈Ec∑[δ(eij∈Emask)logpeij+(1−δ(eij∈Emask))log(1−peij)]
here δ ( e i j ∈ E ) \delta\left(e_{i j} \in E\right) δ(eij∈E) is 1 if * v i , v j * ∈ E \left\langle v_{i}, v_{j}\right\rangle \in E *vi,vj*∈E otherwise 0 Namely BCE Of label, p e i j p_{e_{i j}} peij Namely BERT Output embedding Inner product . Negative sampling is also considered here .
Node Alignment

In order to align the representation of code sequence with the representation of data flow graph , In the code sequence 4 A the same token “x”, But in the data flow graph x11 It should correspond to the last expression in the code sequence “return x” Of x.
Based on this idea , The specific approach is to first mask matrix M in “x” and x11 The edge of mask fall ( from 0 Turn into -∞), Yes BERT The output of BCE Two classification , The negative sample here is the rest of the code sequence token x:
loss N o d e A l i g n = − ∑ e i j ∈ E c ′ [ δ ( e i j ∈ E m a s k ′ ) log p e i j + ( 1 − δ ( e i j ∈ E m a s k ′ ) ) log ( 1 − p e i j ) ] \operatorname{loss}_{N o d e A l i g n}=-\sum_{e_{i j} \in E_{c}^{\prime}}\left[\delta\left(e_{i j} \in E_{m a s k}^{\prime}\right) \log p_{e_{i j}}+\left(1-\delta\left(e_{i j} \in E_{m a s k}^{\prime}\right)\right) \log \left(1-p_{e_{i j}}\right)\right] lossNodeAlign=−∑eij∈Ec′[δ(eij∈Emask′)logpeij+(1−δ(eij∈Emask′))log(1−peij)]
experiment
4 Downstream missions : Search code 、 Code clone detection 、 Code translation 、 repair bug
NATURAL LANGUAGE CODE SEARCH
The meaning of code search task is to give a natural language input , It is required to find the code with the most relevant semantics from a group of candidate codes , The data set used is CodeSearchNet Data set of , Use the first paragraph of the code document as query, use GraphCodeBERT , respectively, encode query and Code sequences + Data flow diagram , And then use [CLS] Output the representation to calculate the similarity . It's fine too fine-tuning, It's the twin towers .

Code Clone Detection
Given two code snippets , It is required to measure its similarity , It's using BigCloneBench Data sets . The input is code fragment and data flow diagram , Or use it CLS The representation of .

Code Translation
The meaning of code translation is to translate one programming language into another programming language , Its purpose is to migrate legacy software from one programming language of the platform to another , With Lucene、POI And other open source projects are data sets , These projects all have Java and C# The implementation of the , Model input in task Java(C#) Code , Output the corresponding C#(Java) Code .
The practice is to pre train GraphCodeBERT As Encoder, Then initialize a random decoder, then fine-tuning.

Code Redinement
repair bug, Input with bug Of JAVA Code , Output correct JAVA Code , The process is similar to code translation .

边栏推荐
- 防范SYN洪泛攻击的方法 -- SYN cookie
- 【二】栅格数据显示拉伸色带(以DEM数据为例)
- 919. 完全二叉树插入器 : 简单 BFS 运用题
- Feign use
- Leetcode 0133. clone diagram
- 请问一下,使用数据集成从postgreSQL导数据到Mysql数据库,有部分数据的字段中出现emoj
- 软件测试流程包括哪些内容?测试方法有哪些?
- 推荐系统-协同过滤在Spark中的实现
- R language uses LM function to build multiple linear regression model, step function to build forward stepwise regression model to screen the best subset of prediction variables, and scope parameter t
- 我想问DMS有没有定时备份某一个数据库的功能?
猜你喜欢

【ROS进阶篇】第九讲 URDF的编程优化Xacro使用
软件测试流程包括哪些内容?测试方法有哪些?

Kyligence 入选 Gartner 2022 数据管理技术成熟度曲线报告

The first scratch crawler

阿里云技术专家秦隆:可靠性保障必备——云上如何进行混沌工程?

什么是CI/CD?

Resttemplate and ribbon are easy to use

485通讯( 详解 )
![[micro service ~sentinel] sentinel degradation, current limiting, fusing](/img/60/448c5f40af4c0937814c243bd7cb04.png)
[micro service ~sentinel] sentinel degradation, current limiting, fusing

2022.07.24(LC_6126_设计食物评分系统)
随机推荐
力扣 83双周赛T4 6131.不可能得到的最短骰子序列、303 周赛T4 6127.优质数对的数目
交换机链路聚合详解【华为eNSP】
【5】 Page and print settings
想要白嫖正则大全是吧?这一次给你个够!
Spirng @Conditional 条件注解的使用
Pytorch advanced training skills
485通讯( 详解 )
[micro service ~sentinel] sentinel degradation, current limiting, fusing
Maskgae: masked graph modeling meets graph autoencoders
Build a series of vision transformer practices, and finally meet, Timm library!
【六】地图框设置
Pytorch visualization
如何从远程访问 DMS数据库?IP地址是啥?用户名是啥?
启牛开的证券账户安全吗?是怎么开账户的
2022.07.24 (lc_6126_design food scoring system)
Cmake learning notes (II) generation and use of Library
Can flinkcdc import multiple tables in mongodb database together?
艰辛的旅程
Kyligence 入选 Gartner 2022 数据管理技术成熟度曲线报告
Intval MD5 bypass [wustctf2020] plain