当前位置:网站首页>ACL 2022 | zero sample multilingual extracted text summarization based on neural label search

ACL 2022 | zero sample multilingual extracted text summarization based on neural label search

2022-06-26 17:05:00 PaperWeekly

59b5be7cc5b5f50c3139426f037aa34d.gif

author  |  Machine center editorial department

source  |  Almost Human

This study aims to solve the problem of French under zero samples 、 German 、 Spanish 、 Multi lingual Abstract tasks such as Russian and Turkish , And in the multilingual summary dataset MLSUM Significantly improved the score of the baseline model .

Abstract text summarization has achieved good performance in English , This is mainly due to the large-scale pre-training language model and abundant annotated corpus . But for other small languages , At present, it is difficult to obtain large-scale annotation data .

Institute of information engineering, Chinese Academy of Sciences and Microsoft Research Asia The joint proposal is based on Zero-Shot Multi language extraction text summarization model . The specific method is to use the extracted text summarization model pre trained in English to extract abstracts directly from other low resource languages ; And for multilingualism Zero-Shot Single language label deviation in , Put forward Multilingual tags (Multilingual Label) Annotation algorithm and Neural label search model (Neural Label Search for Summarization, NLSSum).

Experimental results show that , Model NLSSum In a multilingual summary dataset MLSUM In all languages Baseline The score of the model . Among them, in Russian (Ru) On dataset , The performance of the zero sample model is close to that of the model obtained by using the full amount of supervised data .

The study was published in ACL 2022 On the long article of the main meeting .

8b58c5e3482b5863c049bdd8ecbf3d51.png

Paper title :

Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

Thesis link :

https://aclanthology.org/2022.acl-long.42.pdf

da0826ccd789420a7373665caf23206e.png

introduction

With BERT Development in the field of naturallanguageprocessing , The model of pre training on large scale unlabeled data has been widely concerned .

In recent years, , There is a lot of research work on training unlabeled corpus in multiple languages , Thus, a pre training model supporting multiple languages is obtained . These multilingual text-based pre training models can achieve good performance in cross language downstream tasks , for example mBERT、XLM and XLMR. Based on Zero-Shot Multilingual tasks for , The above multi language model can also achieve good results . among ,XLMR Model Zero-Shot The effect is XNLI Other models can already be reached on the dataset Fine-tune The level of . Therefore, it is necessary for us to carry out the task based on Zero-Shot Provides the basis for the exploration of .

In a single language extract text summary , Datasets usually contain only original documents and artificially written abstracts , Therefore, we need to use the sentence labeling algorithm based on greedy algorithm to label each sentence in the original text . But this algorithm is a single language oriented annotation method , The result will lead to the problem of monolingual label deviation , Optimization is still needed for multilingual tasks . The following chart shows the single language label deviation problem .

7ed512243683816f6dbe17ba994c1803.png

▲  surface 1. Multilingual Zero-Shot Single language label deviation in

Above table 1 The example shows , This example is by far the most common in the field of abstracts CNN/DM Some documents selected in the dataset .CNN/DM Is an English data set , The upper part of the example is the English representation in the original document and the manually written English abstract ; The lower half of the example uses Microsoft's open source industrial translation model Marian, Translate all English documents and abstracts into German . The sentence in the example has a high similarity with the manually written summary , Therefore, you will get a higher ROUGE fraction .

But for documents translated into German, sentences and Abstracts , We find that the similarity between the two is low , Corresponding ROUGE The score will also be lower . In this case , A multilingual text summarization model trained directly with labeled tags in English language environment , It is not optimal in the language environment of other languages .

The above examples show that the same sentence may have label deviation in different language environments , That is, the current greedy labeling method can not meet the requirements based on Zero-Shot Multilingual text summarization task for .

In order to solve the above problems, based on Zero-Shot The problem of monolingual label bias in multilingual extracted text abstracts , We propose a multi language tagging algorithm . On the basis of the original single language label , Through the use of translation and bilingual dictionaries in CNN/DM Several other groups of sentence tags for multilingual interaction are constructed on the data set . For these groups of language tags , Design a neural language label search model (NLSSum) To make full use of them to supervise the learning of the abstract model .

stay NLSSum In the model , The sentence level of these groups of tags is calculated by using the weight of hierarchy (Sentence-Level) And group level (Set-Level) Weight assignment of . During the training of the extraction model , Sentence-Level and Set-Level The weight predictor is trained on English labeled corpus together with the abstract extractor . During model inference test , In other languages, only the summary extractor is used for summary extraction .

f1e1453bc026e658753178764d67c004.png

Technology Overview

We aim at Zero-Shot Monolingual label shift in multilingual summarization tasks , Put forward The neural label search model uses neural network to search the weights of multi language labels , The weighted tags are used to supervise the extractor . The specific process is divided into the following five steps :

  • Multilingual data enhancement : At present, the original English documents are translated 、 Bilingual dictionaries are used to reduce the deviation from the target language ;

  • Multilingual tags : Our abstract model is finally supervised by multilingual tags , The multilingual tags contain a total of 4 Group tags , this 4 Group labels are labeled according to different strategies ;

  • Neural tag search : In this step, a hierarchical weight prediction is designed for different groups of labels , Including sentence level (Sentence-Level) And group level (Set-Level), Finally, weighted tags are used to supervise the abstract model ;

  • Fine tuning training / Fine-Tunig: Use enhanced document data and weighted average multilingual tags to Fine-Tune Neural digest extraction model ;

  • be based on Zero-Shot Multi language abstract extraction : Using the model trained on English annotation data, we can extract abstract sentences directly from low resource language documents .

191ee6c35614ed648f1c38039db43ad9.png

▲  chart 1: Multilingual tags

Pictured above 1 Shown , In the original English document D And manually writing summaries s Four groups of multi language tags are designed on the (Ua,Ub,Uc,Ud), The specific construction method is as follows :

1. Label set Ua: Definition Ua=GetPosLabel (D,s) To use documents D And manually writing summaries s A set of sentences extracted as abstracts using a greedy algorithm , among GetPosLabel The returned tag is 1 An index of sentences . Use (D,s) What you get is a summary sentence from the English document , This result is not optimal for other languages , So we also designed another three groups of labels .

2. Label set Ub: First of all, the original English document and the manually written abstract are both used in the machinetranslation model MarianMT Translate it into the target language , Marked as DMT and sMT, And then use Ub=GetPosLabel (DMT,sMT) To get the index set of abstract sentences on the translated document . This method with the help of machinetranslation model is equivalent to using the syntactic structure of the target language to express the semantics of the original English , Therefore, the obtained abstract sentences can reflect the bias of the target language syntactic structure on the abstract information .

3. Label set Uc: In the construction of this set of labels , First, the original English document is automatically translated into the target language DMT, Then, the manual English abstract is replaced by the target language with a bilingual dictionary SWR ( Replace words in all summaries ), And then we use Uc=GetPosLabel (DMT,SWR) To get the abstract sentence index set of translation and word replacement interaction . This method uses machinetranslation to replace the syntactic structure of the original document , Bilingual dictionary translation is used to preserve the syntactic structure of the original language and keep consistent with the document language , Therefore, we can get the interaction between the target language and the original language in extracting abstract sentences .

4. Label set Ud: In this method , The document is in original English D; First, it is translated into the target language through machinetranslation , Then the word is replaced by a bilingual dictionary and converted back to English , Use S′ To express . Finally we use Ud=GetPosLabel (D,S′) To get a set of abstract sentence tags . In this way , The original document remains unchanged , The abstract uses the syntactic structure of the target language , Therefore, the interaction between the target language and the original language in extracting abstract sentences can be obtained again .

It should be noted that , Use GetPosLabel (D,S) When , Make sure that D and S Is the expression of the same language , Because the label labeling algorithm based on greedy algorithm is essentially to match the word level . in addition , There are many other ways to construct multilingual tags , We just selected a few representative methods . The machinetranslation model and bilingual dictionary replacement used in these methods may introduce additional errors , Therefore, it is necessary to learn appropriate weights for these groups of tags .

Here's the picture 2 Shown , For several groups of multilingual tags that have been obtained (Ua,Ub,Uc,Ud), We need to design a neural label search model to set weights for different groups of labels . The weight consists of two parts , Sentence level (Sentence-Level) And group level (Set-Level). The weights corresponding to these two levels , We define two weight predictors respectively , Sentence level weight prediction Transformeralpha And group level weight prediction Transformerbeta.

63348f638a1192e2bfe10c7b58f64ff7.png

▲  chart 2: Multilingual neural tag search summary model

cdff0d3f0b2ad0c2a51d3be4952bd682.png

experimental result

NLSSum It is through the way of neural search to MultilingualLabel Different tag sets in are given different weights , And finally get the weighted average label . This final tag is used to train the abstract model on the English dataset . Compared with monolingual tags , There is more cross language semantic and grammatical information in multi language tags , Therefore, this model can be used in Baseline Get a big improvement on the basis of .

The following table 2 Shown , The data sets used in the experiment include CNN/DM and MLSUM, The specific data set is described in table 6.2 Shown .MLSUM It is the first large-scale multilingual text summary data set , It crawled from the Xinwang website 150 Million documents and abstracts , Contains five languages : French (French,Fr)、 German (German,De)、 Spanish (Spanish,ES)、 Russian (Russian,Ru) And Turkish (Turkish,Tr).MLSUM It is to verify when testing inference Zero-Shot Cross language transfer capability of multilingual models . In the training phase, we use the most common in the field of text summarization CNN/DM English data set .

864dcd68b0333e4c7c2b36f68bf7c084.png

▲  surface 2:MLSUM On dataset ROUGE result

Here to MLSUM Of each baseline model on the dataset ROUGE The results are compared . The table is divided into three parts .

  • The first part The show is Oracle and Lead These simple baseline models ;

  • The second part Some baseline models based on supervised learning are shown , among (TrainAll) Is to train on datasets in all languages ,(TrainOne) Is to train separately on the data set of each language ;

  • The third part It shows the results of unsupervised learning , All models are trained only on English datasets .

among , According to the results of the second part, it is easy to find , In supervised learning , The generation based summarization is more appropriate than the extraction based summarization . In the third part , Baseline model XLMRSum The performance of can surpass that of the generative model MARGE, This shows that it is more appropriate to use the extraction method in unsupervised learning .

in addition , When using machinetranslation and bilingual dictionary replacement to enhance the data of the original document ( Baseline model XLMRSum-MT and XLMRSum-WR), You can find XLMRSum-MT The model will degrade the performance of the model , and XLMRSum-WR It will improve the performance , Therefore, in the final model, the word replacement method based on bilingual dictionary is selected for data enhancement .

So for us NLSSum Model , We also have two configurations ,NLSSum-Sep  Yes, it will CNN/DM Replace individual words with corresponding target language and conduct fine-tuning training ;NLSSum yes CNN/DM Words are replaced with all target languages respectively and fine tuning training is carried out on the data set of all languages after replacement .

Final results showed , Trained in all languages NLSSum better . From the table, we can draw the following conclusions :

  • The input data enhancement based on the translation model will introduce errors , Therefore, the use of translation models in input should be avoided ; contrary , Word replacement in bilingual dictionaries is a good data enhancement method ;

  • The label construction process does not involve model input , So we can use machinetranslation model to assist tag generation .

Here's the picture 3 Shown , Through visual analysis, we can further study the distribution of important information between different languages , It can be seen that the distribution of important information in English language is relatively high , The important information in other languages is scattered , This is also an important reason why multilingual tags can improve model performance .

9e04327b4bbe44860af3e31746443fd2.png

▲  chart 3: The distribution of abstract sentences in different languages

Future research will focus on :1. To find a more reasonable algorithm for multi language sentence level tagging ;2. Study how to improve the results of low resource language summarization , At the same time, it does not reduce the results of English corpus .

Read more

090db15db568c12e228f61511b34d8fa.png

162801068ab68445b6af3308e9839870.png

8241c248f52461fe5e04ec57ba004892.png

1d508a092bc5540498eda570403dd64e.gif

# cast draft   through Avenue #

  Let your words be seen by more people  

How to make more high-quality content reach the reader group in a shorter path , How about reducing the cost of finding quality content for readers ? The answer is : People you don't know .

There are always people you don't know , Know what you want to know .PaperWeekly Maybe it could be a bridge , Push different backgrounds 、 Scholars and academic inspiration in different directions collide with each other , There are more possibilities . 

PaperWeekly Encourage university laboratories or individuals to , Share all kinds of quality content on our platform , It can be Interpretation of the latest paper , It can also be Analysis of academic hot spots Scientific research experience or Competition experience explanation etc. . We have only one purpose , Let knowledge really flow .

  The basic requirements of the manuscript :

• The article is really personal Original works , Not published in public channels , For example, articles published or to be published on other platforms , Please clearly mark  

• It is suggested that  markdown  Format writing , The pictures are sent as attachments , The picture should be clear , No copyright issues

• PaperWeekly Respect the right of authorship , And will be adopted for each original first manuscript , Provide Competitive remuneration in the industry , Specifically, according to the amount of reading and the quality of the article, the ladder system is used for settlement

  Contribution channel :

• Send email :[email protected] 

• Please note your immediate contact information ( WeChat ), So that we can contact the author as soon as we choose the manuscript

• You can also directly add Xiaobian wechat (pwbot02) Quick contribution , remarks : full name - contribute

413b8fd46464847c1311d00b439152bb.png

△ Long press add PaperWeekly Small make up

Now? , stay 「 You know 」 We can also be found

Go to Zhihu home page and search 「PaperWeekly」

Click on 「 Focus on 」 Subscribe to our column

·

192dd68bcda46b2769ee11016399770e.png

原网站

版权声明
本文为[PaperWeekly]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/177/202206261647597718.html

随机推荐