当前位置:网站首页>【AI4Code】《CoSQA: 20,000+ Web Queries for Code Search and Question Answering》 ACL 2021
【AI4Code】《CoSQA: 20,000+ Web Queries for Code Search and Question Answering》 ACL 2021
2022-07-25 12:40:00 【chad_ lee】
《CoSQA: 20,000+ Web Queries for Code Search and Question Answering》 ACL 2021
similar CLIP Made a NL-PL Of query-key Binary data sets , And then it's like CLIP The same two-mode alignment training , On this basis, comparative learning is added , Two kinds of data amplification methods are designed . Twin tower encoder All are CodeBERT.
CoSQA Data sets
What the article wants to achieve is that we can search for pictures on the Internet , Input according to the demand query, Return the code implementation that meets the requirements ( Now we usually return to the blog ). This article has made great efforts to construct such a data set , It looks something like this .
There are also a lot of data structure details , For example, partial satisfaction query The needs of , Completely satisfied query The needs of , Meet less than 50% The needs of , Only and query Relevant, etc .

Model

The input form of the model is a sequence :[CLS] xxxxxxxx [SEP]. Twin network for model ,query and code All with the same CodeBERT code . The output of the model is [CLS] The representation of .
q i = C o d e E R T ( q i ) , c i = C o d e B E R T ( c i ) \mathbf{q}_{i}=\mathbf{C o d e} \mathbf{E R T}\left(q_{i}\right), \quad \mathbf{c}_{i}=\mathbf{C o d e B} \mathbf{E R T}\left(c_{i}\right) qi=CodeERT(qi),ci=CodeBERT(ci)
The model is not simply used q and c The inner product of calculates the similarity , Instead, use another MLP Calculate the matching relationship between the two .MLP The output of is a vector , Not the similarity score
r ( i , i ) = tanh ( W 1 ⋅ [ q i , c i , q i − c i , q i ⨀ c i ] ) \mathbf{r}^{(i, i)}=\tanh \left(\mathbf{W}_{1} \cdot\left[\mathbf{q}_{i}, \mathbf{c}_{i}, \mathbf{q}_{i}-\mathbf{c}_{i}, \mathbf{q}_{i} \bigodot \mathbf{c}_{i}\right]\right) r(i,i)=tanh(W1⋅[qi,ci,qi−ci,qi⨀ci])
Another single layer NN Calculate the similarity between them :
s ( i , i ) = sigmoid ( W 2 ⋅ r ( i , i ) ) s^{(i, i)}=\operatorname{sigmoid}\left(\mathbf{W}_{2} \cdot \mathbf{r}^{(i, i)}\right) s(i,i)=sigmoid(W2⋅r(i,i))
And then use BCE loss Training :
L b = − [ y i ⋅ log s ( i , i ) + ( 1 − y i ) log ( 1 − s ( i , i ) ) ] \mathcal{L}_{b}=-\left[y_{i} \cdot \log s^{(i, i)}+\left(1-y_{i}\right) \log \left(1-s^{(i, i)}\right)\right] Lb=−[yi⋅logs(i,i)+(1−yi)log(1−s(i,i))]
Comparative learning
except BCE loss Outside ,In-Batch Augmentation (IBA) and Query-Rewritten Augmentation (QRA)
IBA Loss
For each of these query, At the same time, select the current batch Others in code As a negative sample , Equivalent to a query More than code Negative sample
L i b = − 1 n − 1 ∑ j = 1 j ≠ i n log ( 1 − s ( i , j ) ) \mathcal{L}_{i b}=-\frac{1}{n-1} \sum_{\substack{j=1 \\ j \neq i}}^{n} \log \left(1-s^{(i, j)}\right) Lib=−n−11j=1j=i∑nlog(1−s(i,j))
QRA Loss
because Web query Usually very short , And grammar is not guaranteed , So for a couple The label is 1 Of query-code pair, Yes query Do some rewriting and modification , Include : Randomly delete a word 、 Randomly switch the positions of two words 、 Copy a word randomly .
This is equivalent to a code More than Query Positive sample . stay QRA It will also be applied on the basis of IBA loss:
L q r = L b ′ + L i b ′ \mathcal{L}_{q r}=\mathcal{L}_{b}^{\prime}+\mathcal{L}_{i b}^{\prime} Lqr=Lb′+Lib′
experiment
In this paper, code contrastive learning method (CoCLR) Is a learning method , The experiment is done in pre training CodeBERT Continue training on the basis of .
Two task One is Code Question Answering Divide the test set directly from the training set , One is code search .
CodeBERT+CoSQA Is in the BERT Based on the explicit alignment of the two languages , There is a certain improvement , But the best effect is the data amplification of comparative learning . The most useful one is batch Inner negative sample .query In the enhancement, the order of exchanging words is greatly improved , It's also more intuitive , Because usually changing two words doesn't affect reading comprehension .
边栏推荐
- Can flinkcdc import multiple tables in mongodb database together?
- intval md5绕过之[WUSTCTF2020]朴实无华
- Intval MD5 bypass [wustctf2020] plain
- Pairwise comparison of whether the mean values between R language groups are the same: pairwise hypothesis test of the mean values of multiple grouped data is performed using pairwise.t.test function
- [high concurrency] deeply analyze the execution process of worker threads in the thread pool through the source code
- Kyligence was selected into Gartner 2022 data management technology maturity curve report
- Experimental reproduction of image classification (reasoning only) based on caffe resnet-50 network
- Ansible
- 2.1.2 机器学习的应用
- 【ROS进阶篇】第九讲 URDF的编程优化Xacro使用
猜你喜欢

2022.07.24(LC_6125_相等行列对)

Fault tolerant mechanism record

Azure Devops(十四) 使用Azure的私有Nuget仓库

Visualize the training process using tensorboard

Leetcode 0133. clone diagram

3.2.1 什么是机器学习?

Feign use

Cmake learning notes (II) generation and use of Library

Introduction to the scratch crawler framework

想要做好软件测试,可以先了解AST、SCA和渗透测试
随机推荐
Pytorch environment configuration and basic knowledge
WPF project introduction 1 - Design and development of simple login page
Eureka usage record
【六】地图框设置
Cmake learning notes (II) generation and use of Library
Azure Devops(十四) 使用Azure的私有Nuget仓库
Detailed explanation of flex box
想要做好软件测试,可以先了解AST、SCA和渗透测试
如何从远程访问 DMS数据库?IP地址是啥?用户名是啥?
【四】布局视图和布局工具条使用
【ROS进阶篇】第九讲 URDF的编程优化Xacro使用
Microsoft azure and Analysys jointly released the report "Enterprise Cloud native platform driven digital transformation"
循环创建目录与子目录
Use of hystrix
Script set random user_ agent
Alibaba cloud technology expert Qin long: reliability assurance is a must - how to carry out chaos engineering on the cloud?
[advanced C language] dynamic memory management
Fault tolerant mechanism record
2.1.2 机器学习的应用
深度学习MEMC插帧论文列表paper list