当前位置:网站首页>ACL 2022 | zero sample multilingual extracted text summarization based on neural label search
ACL 2022 | zero sample multilingual extracted text summarization based on neural label search
2022-06-26 17:05:00 【PaperWeekly】
author | Machine center editorial department
source | Almost Human
This study aims to solve the problem of French under zero samples 、 German 、 Spanish 、 Multi lingual Abstract tasks such as Russian and Turkish , And in the multilingual summary dataset MLSUM Significantly improved the score of the baseline model .
Abstract text summarization has achieved good performance in English , This is mainly due to the large-scale pre-training language model and abundant annotated corpus . But for other small languages , At present, it is difficult to obtain large-scale annotation data .
Institute of information engineering, Chinese Academy of Sciences and Microsoft Research Asia The joint proposal is based on Zero-Shot Multi language extraction text summarization model . The specific method is to use the extracted text summarization model pre trained in English to extract abstracts directly from other low resource languages ; And for multilingualism Zero-Shot Single language label deviation in , Put forward Multilingual tags (Multilingual Label) Annotation algorithm and Neural label search model (Neural Label Search for Summarization, NLSSum).
Experimental results show that , Model NLSSum In a multilingual summary dataset MLSUM In all languages Baseline The score of the model . Among them, in Russian (Ru) On dataset , The performance of the zero sample model is close to that of the model obtained by using the full amount of supervised data .
The study was published in ACL 2022 On the long article of the main meeting .
Paper title :
Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization
Thesis link :
https://aclanthology.org/2022.acl-long.42.pdf
introduction
With BERT Development in the field of naturallanguageprocessing , The model of pre training on large scale unlabeled data has been widely concerned .
In recent years, , There is a lot of research work on training unlabeled corpus in multiple languages , Thus, a pre training model supporting multiple languages is obtained . These multilingual text-based pre training models can achieve good performance in cross language downstream tasks , for example mBERT、XLM and XLMR. Based on Zero-Shot Multilingual tasks for , The above multi language model can also achieve good results . among ,XLMR Model Zero-Shot The effect is XNLI Other models can already be reached on the dataset Fine-tune The level of . Therefore, it is necessary for us to carry out the task based on Zero-Shot Provides the basis for the exploration of .
In a single language extract text summary , Datasets usually contain only original documents and artificially written abstracts , Therefore, we need to use the sentence labeling algorithm based on greedy algorithm to label each sentence in the original text . But this algorithm is a single language oriented annotation method , The result will lead to the problem of monolingual label deviation , Optimization is still needed for multilingual tasks . The following chart shows the single language label deviation problem .
▲ surface 1. Multilingual Zero-Shot Single language label deviation in
Above table 1 The example shows , This example is by far the most common in the field of abstracts CNN/DM Some documents selected in the dataset .CNN/DM Is an English data set , The upper part of the example is the English representation in the original document and the manually written English abstract ; The lower half of the example uses Microsoft's open source industrial translation model Marian, Translate all English documents and abstracts into German . The sentence in the example has a high similarity with the manually written summary , Therefore, you will get a higher ROUGE fraction .
But for documents translated into German, sentences and Abstracts , We find that the similarity between the two is low , Corresponding ROUGE The score will also be lower . In this case , A multilingual text summarization model trained directly with labeled tags in English language environment , It is not optimal in the language environment of other languages .
The above examples show that the same sentence may have label deviation in different language environments , That is, the current greedy labeling method can not meet the requirements based on Zero-Shot Multilingual text summarization task for .
In order to solve the above problems, based on Zero-Shot The problem of monolingual label bias in multilingual extracted text abstracts , We propose a multi language tagging algorithm . On the basis of the original single language label , Through the use of translation and bilingual dictionaries in CNN/DM Several other groups of sentence tags for multilingual interaction are constructed on the data set . For these groups of language tags , Design a neural language label search model (NLSSum) To make full use of them to supervise the learning of the abstract model .
stay NLSSum In the model , The sentence level of these groups of tags is calculated by using the weight of hierarchy (Sentence-Level) And group level (Set-Level) Weight assignment of . During the training of the extraction model , Sentence-Level and Set-Level The weight predictor is trained on English labeled corpus together with the abstract extractor . During model inference test , In other languages, only the summary extractor is used for summary extraction .
Technology Overview
We aim at Zero-Shot Monolingual label shift in multilingual summarization tasks , Put forward The neural label search model uses neural network to search the weights of multi language labels , The weighted tags are used to supervise the extractor . The specific process is divided into the following five steps :
Multilingual data enhancement : At present, the original English documents are translated 、 Bilingual dictionaries are used to reduce the deviation from the target language ;
Multilingual tags : Our abstract model is finally supervised by multilingual tags , The multilingual tags contain a total of 4 Group tags , this 4 Group labels are labeled according to different strategies ;
Neural tag search : In this step, a hierarchical weight prediction is designed for different groups of labels , Including sentence level (Sentence-Level) And group level (Set-Level), Finally, weighted tags are used to supervise the abstract model ;
Fine tuning training / Fine-Tunig: Use enhanced document data and weighted average multilingual tags to Fine-Tune Neural digest extraction model ;
be based on Zero-Shot Multi language abstract extraction : Using the model trained on English annotation data, we can extract abstract sentences directly from low resource language documents .
▲ chart 1: Multilingual tags
Pictured above 1 Shown , In the original English document D And manually writing summaries s Four groups of multi language tags are designed on the (Ua,Ub,Uc,Ud), The specific construction method is as follows :
1. Label set Ua: Definition Ua=GetPosLabel (D,s) To use documents D And manually writing summaries s A set of sentences extracted as abstracts using a greedy algorithm , among GetPosLabel The returned tag is 1 An index of sentences . Use (D,s) What you get is a summary sentence from the English document , This result is not optimal for other languages , So we also designed another three groups of labels .
2. Label set Ub: First of all, the original English document and the manually written abstract are both used in the machinetranslation model MarianMT Translate it into the target language , Marked as DMT and sMT, And then use Ub=GetPosLabel (DMT,sMT) To get the index set of abstract sentences on the translated document . This method with the help of machinetranslation model is equivalent to using the syntactic structure of the target language to express the semantics of the original English , Therefore, the obtained abstract sentences can reflect the bias of the target language syntactic structure on the abstract information .
3. Label set Uc: In the construction of this set of labels , First, the original English document is automatically translated into the target language DMT, Then, the manual English abstract is replaced by the target language with a bilingual dictionary SWR ( Replace words in all summaries ), And then we use Uc=GetPosLabel (DMT,SWR) To get the abstract sentence index set of translation and word replacement interaction . This method uses machinetranslation to replace the syntactic structure of the original document , Bilingual dictionary translation is used to preserve the syntactic structure of the original language and keep consistent with the document language , Therefore, we can get the interaction between the target language and the original language in extracting abstract sentences .
4. Label set Ud: In this method , The document is in original English D; First, it is translated into the target language through machinetranslation , Then the word is replaced by a bilingual dictionary and converted back to English , Use S′ To express . Finally we use Ud=GetPosLabel (D,S′) To get a set of abstract sentence tags . In this way , The original document remains unchanged , The abstract uses the syntactic structure of the target language , Therefore, the interaction between the target language and the original language in extracting abstract sentences can be obtained again .
It should be noted that , Use GetPosLabel (D,S) When , Make sure that D and S Is the expression of the same language , Because the label labeling algorithm based on greedy algorithm is essentially to match the word level . in addition , There are many other ways to construct multilingual tags , We just selected a few representative methods . The machinetranslation model and bilingual dictionary replacement used in these methods may introduce additional errors , Therefore, it is necessary to learn appropriate weights for these groups of tags .
Here's the picture 2 Shown , For several groups of multilingual tags that have been obtained (Ua,Ub,Uc,Ud), We need to design a neural label search model to set weights for different groups of labels . The weight consists of two parts , Sentence level (Sentence-Level) And group level (Set-Level). The weights corresponding to these two levels , We define two weight predictors respectively , Sentence level weight prediction Transformeralpha And group level weight prediction Transformerbeta.
▲ chart 2: Multilingual neural tag search summary model
experimental result
NLSSum It is through the way of neural search to MultilingualLabel Different tag sets in are given different weights , And finally get the weighted average label . This final tag is used to train the abstract model on the English dataset . Compared with monolingual tags , There is more cross language semantic and grammatical information in multi language tags , Therefore, this model can be used in Baseline Get a big improvement on the basis of .
The following table 2 Shown , The data sets used in the experiment include CNN/DM and MLSUM, The specific data set is described in table 6.2 Shown .MLSUM It is the first large-scale multilingual text summary data set , It crawled from the Xinwang website 150 Million documents and abstracts , Contains five languages : French (French,Fr)、 German (German,De)、 Spanish (Spanish,ES)、 Russian (Russian,Ru) And Turkish (Turkish,Tr).MLSUM It is to verify when testing inference Zero-Shot Cross language transfer capability of multilingual models . In the training phase, we use the most common in the field of text summarization CNN/DM English data set .
▲ surface 2:MLSUM On dataset ROUGE result
Here to MLSUM Of each baseline model on the dataset ROUGE The results are compared . The table is divided into three parts .
The first part The show is Oracle and Lead These simple baseline models ;
The second part Some baseline models based on supervised learning are shown , among (TrainAll) Is to train on datasets in all languages ,(TrainOne) Is to train separately on the data set of each language ;
The third part It shows the results of unsupervised learning , All models are trained only on English datasets .
among , According to the results of the second part, it is easy to find , In supervised learning , The generation based summarization is more appropriate than the extraction based summarization . In the third part , Baseline model XLMRSum The performance of can surpass that of the generative model MARGE, This shows that it is more appropriate to use the extraction method in unsupervised learning .
in addition , When using machinetranslation and bilingual dictionary replacement to enhance the data of the original document ( Baseline model XLMRSum-MT and XLMRSum-WR), You can find XLMRSum-MT The model will degrade the performance of the model , and XLMRSum-WR It will improve the performance , Therefore, in the final model, the word replacement method based on bilingual dictionary is selected for data enhancement .
So for us NLSSum Model , We also have two configurations ,NLSSum-Sep Yes, it will CNN/DM Replace individual words with corresponding target language and conduct fine-tuning training ;NLSSum yes CNN/DM Words are replaced with all target languages respectively and fine tuning training is carried out on the data set of all languages after replacement .
Final results showed , Trained in all languages NLSSum better . From the table, we can draw the following conclusions :
The input data enhancement based on the translation model will introduce errors , Therefore, the use of translation models in input should be avoided ; contrary , Word replacement in bilingual dictionaries is a good data enhancement method ;
The label construction process does not involve model input , So we can use machinetranslation model to assist tag generation .
Here's the picture 3 Shown , Through visual analysis, we can further study the distribution of important information between different languages , It can be seen that the distribution of important information in English language is relatively high , The important information in other languages is scattered , This is also an important reason why multilingual tags can improve model performance .
▲ chart 3: The distribution of abstract sentences in different languages
Future research will focus on :1. To find a more reasonable algorithm for multi language sentence level tagging ;2. Study how to improve the results of low resource language summarization , At the same time, it does not reduce the results of English corpus .
Read more
# cast draft through Avenue #
Let your words be seen by more people
How to make more high-quality content reach the reader group in a shorter path , How about reducing the cost of finding quality content for readers ? The answer is : People you don't know .
There are always people you don't know , Know what you want to know .PaperWeekly Maybe it could be a bridge , Push different backgrounds 、 Scholars and academic inspiration in different directions collide with each other , There are more possibilities .
PaperWeekly Encourage university laboratories or individuals to , Share all kinds of quality content on our platform , It can be Interpretation of the latest paper , It can also be Analysis of academic hot spots 、 Scientific research experience or Competition experience explanation etc. . We have only one purpose , Let knowledge really flow .
The basic requirements of the manuscript :
• The article is really personal Original works , Not published in public channels , For example, articles published or to be published on other platforms , Please clearly mark
• It is suggested that markdown Format writing , The pictures are sent as attachments , The picture should be clear , No copyright issues
• PaperWeekly Respect the right of authorship , And will be adopted for each original first manuscript , Provide Competitive remuneration in the industry , Specifically, according to the amount of reading and the quality of the article, the ladder system is used for settlement
Contribution channel :
• Send email :[email protected]
• Please note your immediate contact information ( WeChat ), So that we can contact the author as soon as we choose the manuscript
• You can also directly add Xiaobian wechat (pwbot02) Quick contribution , remarks : full name - contribute
△ Long press add PaperWeekly Small make up
Now? , stay 「 You know 」 We can also be found
Go to Zhihu home page and search 「PaperWeekly」
Click on 「 Focus on 」 Subscribe to our column
·
边栏推荐
- Redis OM . Net redis object mapping framework
- Byte interview: two array interview questions, please accept
- Convert the decimal positive integer m into the number in the forward K (2 < =k < =9) system and output it in bits
- Romance of the Three Kingdoms: responsibility chain model
- Gui+sqlserver examination system
- SIGIR 2022 | 港大等提出超图对比学习在推荐系统中的应用
- Don't believe it, 98% of programmers are like this
- Redis overview
- Teach you to learn dapr - 4 Service invocation
- Today, I met a "migrant worker" who took out 38K from Tencent, which let me see the ceiling of the foundation
猜你喜欢
Teach you to learn dapr - 9 Observability
In those years, interview the abused red and black trees
关于FlowUs这一款国民好笔记
Turtle cartography
探讨:下一代稳定币
对NFT市场前景的7个看法
Redis' 43 serial cannons, try how many you can carry
Kubecon China 2021 Alibaba cloud special session is coming! These first day highlights should not be missed
The high concurrency system is easy to play, and Alibaba's new 100 million level concurrent design quick notes are really fragrant
[suggested collection] 11 online communities suitable for programmers
随机推荐
Necessary decorator mode for 3 years' work
Teach you to learn dapr - 5 Status management
[learn FPGA programming from scratch -46]: Vision - development and technological progress of integrated circuits
建立自己的网站(16)
The latest masterpiece of Alibaba, which took 182 days to produce 1015 pages of distributed full stack manual, is so delicious
Viewing the task arrangement ability of monorepo tool from turborepo
Junit单元测试
[matlab project practice] prediction of remaining service life of lithium ion battery based on convolutional neural network and bidirectional long short time (cnn-lstm) fusion
JUnit unit test
Call the random function to generate 20 different integers and put them in the index group of institute a
Count the number of words in a line of string and take it as the return value of the function
The function keeps the value of variable H to two decimal places and rounds the third digit
Interpretation of new plug-ins | how to enhance authentication capability with forward auth
C language --- basic function realization of push box 01
【推荐系统学习】推荐系统架构
Leetcode 1170. Frequency of occurrence of the minimum letter of the comparison string (yes, solved)
量化合约系统开发分析案例丨合约量化系统开发方案详解
Sandboxed container: container or virtual machine
去中心化NFT交易协议将击败OpenSea
Technical scheme design of chain game system development - NFT chain game system development process and source code