当前位置：网站首页>【AI4Code】《CoSQA: 20,000+ Web Queries for Code Search and Question Answering》 ACL 2021

【AI4Code】《CoSQA: 20,000+ Web Queries for Code Search and Question Answering》 ACL 2021

2022-07-25 12:40:00 【chad_ lee】

《CoSQA: 20,000+ Web Queries for Code Search and Question Answering》 ACL 2021

similar CLIP Made a NL-PL Of query-key Binary data sets , And then it's like CLIP The same two-mode alignment training , On this basis, comparative learning is added , Two kinds of data amplification methods are designed . Twin tower encoder All are CodeBERT.

CoSQA Data sets

What the article wants to achieve is that we can search for pictures on the Internet , Input according to the demand query, Return the code implementation that meets the requirements （ Now we usually return to the blog ）. This article has made great efforts to construct such a data set , It looks something like this .
Insert picture description here

There are also a lot of data structure details , For example, partial satisfaction query The needs of , Completely satisfied query The needs of , Meet less than 50% The needs of , Only and query Relevant, etc .

Model

Insert picture description here

The input form of the model is a sequence ：[CLS] xxxxxxxx [SEP]. Twin network for model ,query and code All with the same CodeBERT code . The output of the model is [CLS] The representation of .
$\mathbf{q}_{i}=\mathbf{C o d e} \mathbf{E R T}\left(q_{i}\right), \quad \mathbf{c}_{i}=\mathbf{C o d e B} \mathbf{E R T}\left(c_{i}\right)$
The model is not simply used q and c The inner product of calculates the similarity , Instead, use another MLP Calculate the matching relationship between the two .MLP The output of is a vector , Not the similarity score
$\mathbf{r}^{(i, i)}=\tanh \left(\mathbf{W}_{1} \cdot\left[\mathbf{q}_{i}, \mathbf{c}_{i}, \mathbf{q}_{i}-\mathbf{c}_{i}, \mathbf{q}_{i} \bigodot \mathbf{c}_{i}\right]\right)$
Another single layer NN Calculate the similarity between them ：
$s^{(i, i)}=\operatorname{sigmoid}\left(\mathbf{W}_{2} \cdot \mathbf{r}^{(i, i)}\right)$
And then use BCE loss Training ：
$\mathcal{L}_{b}=-\left[y_{i} \cdot \log s^{(i, i)}+\left(1-y_{i}\right) \log \left(1-s^{(i, i)}\right)\right]$

Comparative learning

except BCE loss Outside ,In-Batch Augmentation (IBA) and Query-Rewritten Augmentation (QRA)

IBA Loss

For each of these query, At the same time, select the current batch Others in code As a negative sample , Equivalent to a query More than code Negative sample
$\mathcal{L}_{i b}=-\frac{1}{n-1} \sum_{\substack{j=1 \\ j \neq i}}^{n} \log \left(1-s^{(i, j)}\right)$

QRA Loss

because Web query Usually very short , And grammar is not guaranteed , So for a couple The label is 1 Of query-code pair, Yes query Do some rewriting and modification , Include ： Randomly delete a word 、 Randomly switch the positions of two words 、 Copy a word randomly .

This is equivalent to a code More than Query Positive sample . stay QRA It will also be applied on the basis of IBA loss：
$\mathcal{L}_{q r}=\mathcal{L}_{b}^{\prime}+\mathcal{L}_{i b}^{\prime}$

experiment

In this paper, code contrastive learning method (CoCLR) Is a learning method , The experiment is done in pre training CodeBERT Continue training on the basis of .

Two task One is Code Question Answering Divide the test set directly from the training set , One is code search .
Insert picture description here

CodeBERT+CoSQA Is in the BERT Based on the explicit alignment of the two languages , There is a certain improvement , But the best effect is the data amplification of comparative learning . The most useful one is batch Inner negative sample .query In the enhancement, the order of exchanging words is greatly improved , It's also more intuitive , Because usually changing two words doesn't affect reading comprehension .

原网站

版权声明
本文为[chad_ lee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251110593682.html