当前位置：网站首页>【AI4Code】《InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees》ICSE‘21

【AI4Code】《InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees》ICSE‘21

2022-07-25 12:39:00 【chad_ lee】

《InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees》 ICSE 2021

Commit to self-monitoring and training the representation of code , Do not use any labels .

Put forward InferCode, Use self supervised learning from the abstract syntax tree of the code （AST） Learning the representation of code in , By forecasting AST The subtree in constructs the training label , The subtree structure can be constructed AST Automatically generated in the process . Based on the assumption that ： Similar code snippets AST Contains similar subtrees , For example, two kinds of bubble sorting codes , Contains a large number of similar code structures ：

Insert picture description here

InferCode Unsupervised downstream tasks of have code clustering 、 Clone detection and cross language code search ; Supervised migration learning includes code classification and code recommendation .

Method

summary

Insert picture description here

InferCode And Doc2vec similar , take AST Treat as document , Subtree as a word in the document . Given AST aggregate $\left\{T_{1}, T_{2}, \ldots, T_{n\}}\right.$ , $T_i$ Set of all subtrees of $\left\{\ldots, T_{i j}, \ldots\right\}$ , take $T_i$ and $T_{ij}$ Expressed as D Dimension vector , Consider subtree $T_{ij}$ Appear in the $T_i$ In the context of , Maximize logarithmic losses ： $\sum_{j} \log P_{r}\left(T_{i j} \mid T_{i}\right)$ . therefore InferCode The steps are ：

Preprocess the dataset , Generate for each piece of code AST, And identify the set of subtrees . All subtree sets are cumulatively constructed into subtree corpus .
use TBCNN code AST Generate code vector , Used to predict subtrees .
encoder Trained for downstream tasks .

Identify subtrees

Traverse AST, Take the node that meets the specified conditions as the root node of the subtree , Identify the subtree . The specified condition is （expr_stmt,decl_stmt,expr,condition）. In addition, some key nodes are also considered , for example if,for,while, these A separate The node of is treated as a subtree .

Therefore, the selection condition of subtree is that the subtree structure is relatively small , The author claims that such fine-grained subtrees are easier to appear in code fragments . And coarse-grained subtree , Such as if,while,for Sentence block , These subtrees are too big . As a single word , Appear infrequently in the code base ,encoder It is difficult to learn meaningful expressions directly ; Although the syntax of these large subtrees is different, it does not mean that the function of the code is different ,encoder It's hard to learn their similarities .

This paper gives an example of a subtree recognized by bubble sorting code ：
Insert picture description here

Learn code representation

First of all, we will introduce the multi tree AST Turn into a binary tree , And then use Tree-based CNN（TBCNN） As encoder.
Insert picture description here

First, the initial of each node embedding yes Type embedding and Token embedding Output through a linear layer .

And then through TBCNN The convolution layer of can learn the information of nodes .

TBCNN (AAAI‘16)

Insert picture description here

TBCNN The thought of is similar to GCN,TBCNN Three convolution kernels are designed for tree structured data ： $\mathbf{W}^{t}, \mathbf{W}^{l}, \mathbf{W}^{r} \in \mathbb{R}^{D \times D}$ respectively “top”,“left” and “right”, So for AST Convolution window “depth d” Contains $K=2^{d}-1$ Nodes $\left[\mathbf{x}_{1}, \ldots, \mathbf{x}_{K}\right]$ , The final output of this window is ：
$\mathbf{y}=\tanh \left(\sum_{i=1}^{K}\left[\eta_{i}^{t} \mathbf{W}^{t}+\eta_{i}^{l} \mathbf{W}^{l}+\eta_{i}^{r} \mathbf{W}^{r}\right] \mathbf{x}_{i}+\mathbf{b}\right)$
That is, for any node in the window , The weight matrix of convolution kernel makes different linear combinations according to the depth and position of the node in the window , To achieve the effect of tree convolution .

After convolution, each node gets an eigenvector ,InferCode Use attention Instead of maxpooling, Aggregate the information of all nodes , Get the vector representation of the code fragment . Through a learnable attention Vector implementation ：

Insert picture description here

Final $\vec{v}$ Is the vector representation of this code .

Prediction subtree

Note that we have extracted several subtree structures in data preprocessing , These subtree structures will form a dictionary $L$ （ The size in the source code is 10w）, Each subtree is assigned a learnable vector representation , And then use softmax Structure represents the probability of the subtree in the code fragment ：
$\text { for } l_{i} \in L: q\left(l_{i}\right)=\frac{\exp \left(\vec{v}^{T} \cdot \mathbf{W}_{i}^{\text {subtrees }}\right)}{\sum_{l_{j} \in L} \exp \left(\vec{v}^{T} \cdot \mathbf{W}_{i}^{\text {subtrees }}\right)}$
So the essence here is to regard subtree as label, By judging the subtree label Whether it exists in the code , This will detect whether the vector representation learned from the code contains the information of the subtree , Or say The learned code represents the combination of subtree information .

therefore InferCode The training parameters are $\mathbf{W}^{\text {type }}, \quad \mathbf{W}^{\text {token }}, \mathbf{W}^{t}, \mathbf{W}^{l}, \mathbf{W}^{r} \in \mathbb{R}^{D \times D}, a \in \mathbb{R}^{D}, \mathbf{W}^{\text {subtrees }} \in \mathbb{R}^{|L| \times D}$ , The parameters are still very light .

assessment

Three unsupervised tasks ： Code clustering 、 Clone detection and Cross language code search ; Two have supervisory tasks ： Code classification 、 Code recommendation .

Code clustering

Two data sets ：OJ dataset, contain 52000 individual C code snippet , Belong to 104 Classes ;Sortting Algorithm dataset,10 class Java Sort code , Each category contains 1000 Segment code .

Indicators to evaluate the clustering effect Adjusted Rand Index： use C Represents the actual classification ,K Represents the clustering result . Definition a For in C Are divided into the same category , stay K The number of instance pairs divided into the same cluster in . Definition b For in C Are divided into different categories , stay K The number of instance pairs divided into different clusters in . Definition Rand Index（ Rand coefficient ）：
$I=\frac{a+b}{\left(\begin{array}{l} n \\ 2 \end{array}\right)}$
That is, choose any two samples from all samples , Clustering results and GroudTruth Agreement .**RI The closer to 1 The better .**Rand Index The clustering result of random partition cannot be guaranteed RI It's close to 0. therefore , Put forward Adjusted Rand index（ Adjusted Rand coefficient ）：
$I=\frac{R I-E[R I]}{\max (R I)-E[R I]}$

Clone detection

constructed 50000 Clone code pairs and 50000 Non cloned code pairs are used as positive samples and negative samples .

Cross language code search

Collected Java、C、C++、C# Code samples of specific algorithm implementation 3000 From left to right Rosetta Code, Random for each language 5000 Code files from Github in .

Code classification

Oversight tasks , stay InferCode Add a classifier to fine tune .

Code recommendation

Oversight tasks , stay InferCode Add one to the base softmax Layer fine tuning .

Agent task tab selection

Insert picture description here

In addition to the option of using subtrees as labels , You can also choose code token、 The method name is used as the prediction label of the pre training task

原网站

版权声明
本文为[chad_ lee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251110593753.html