当前位置：网站首页>ACL 2022 | zero sample multilingual extracted text summarization based on neural label search

ACL 2022 | zero sample multilingual extracted text summarization based on neural label search

2022-06-26 17:05:00 【PaperWeekly】

author | Machine center editorial department

source | Almost Human

This study aims to solve the problem of French under zero samples 、 German 、 Spanish 、 Multi lingual Abstract tasks such as Russian and Turkish , And in the multilingual summary dataset MLSUM Significantly improved the score of the baseline model .

Abstract text summarization has achieved good performance in English , This is mainly due to the large-scale pre-training language model and abundant annotated corpus . But for other small languages , At present, it is difficult to obtain large-scale annotation data .

Institute of information engineering, Chinese Academy of Sciences and Microsoft Research Asia The joint proposal is based on Zero-Shot Multi language extraction text summarization model . The specific method is to use the extracted text summarization model pre trained in English to extract abstracts directly from other low resource languages ; And for multilingualism Zero-Shot Single language label deviation in , Put forward Multilingual tags （Multilingual Label） Annotation algorithm and Neural label search model （Neural Label Search for Summarization, NLSSum）.

Experimental results show that , Model NLSSum In a multilingual summary dataset MLSUM In all languages Baseline The score of the model . Among them, in Russian （Ru） On dataset , The performance of the zero sample model is close to that of the model obtained by using the full amount of supervised data .

The study was published in ACL 2022 On the long article of the main meeting .

Paper title ：

Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

Thesis link ：

https://aclanthology.org/2022.acl-long.42.pdf

introduction

With BERT Development in the field of naturallanguageprocessing , The model of pre training on large scale unlabeled data has been widely concerned .

In recent years, , There is a lot of research work on training unlabeled corpus in multiple languages , Thus, a pre training model supporting multiple languages is obtained . These multilingual text-based pre training models can achieve good performance in cross language downstream tasks , for example mBERT、XLM and XLMR. Based on Zero-Shot Multilingual tasks for , The above multi language model can also achieve good results . among ,XLMR Model Zero-Shot The effect is XNLI Other models can already be reached on the dataset Fine-tune The level of . Therefore, it is necessary for us to carry out the task based on Zero-Shot Provides the basis for the exploration of .

In a single language extract text summary , Datasets usually contain only original documents and artificially written abstracts , Therefore, we need to use the sentence labeling algorithm based on greedy algorithm to label each sentence in the original text . But this algorithm is a single language oriented annotation method , The result will lead to the problem of monolingual label deviation , Optimization is still needed for multilingual tasks . The following chart shows the single language label deviation problem .

▲ surface 1. Multilingual Zero-Shot Single language label deviation in

Above table 1 The example shows , This example is by far the most common in the field of abstracts CNN/DM Some documents selected in the dataset .CNN/DM Is an English data set , The upper part of the example is the English representation in the original document and the manually written English abstract ; The lower half of the example uses Microsoft's open source industrial translation model Marian, Translate all English documents and abstracts into German . The sentence in the example has a high similarity with the manually written summary , Therefore, you will get a higher ROUGE fraction .

But for documents translated into German, sentences and Abstracts , We find that the similarity between the two is low , Corresponding ROUGE The score will also be lower . In this case , A multilingual text summarization model trained directly with labeled tags in English language environment , It is not optimal in the language environment of other languages .

The above examples show that the same sentence may have label deviation in different language environments , That is, the current greedy labeling method can not meet the requirements based on Zero-Shot Multilingual text summarization task for .

In order to solve the above problems, based on Zero-Shot The problem of monolingual label bias in multilingual extracted text abstracts , We propose a multi language tagging algorithm . On the basis of the original single language label , Through the use of translation and bilingual dictionaries in CNN/DM Several other groups of sentence tags for multilingual interaction are constructed on the data set . For these groups of language tags , Design a neural language label search model (NLSSum) To make full use of them to supervise the learning of the abstract model .

stay NLSSum In the model , The sentence level of these groups of tags is calculated by using the weight of hierarchy (Sentence-Level) And group level (Set-Level) Weight assignment of . During the training of the extraction model , Sentence-Level and Set-Level The weight predictor is trained on English labeled corpus together with the abstract extractor . During model inference test , In other languages, only the summary extractor is used for summary extraction .

Technology Overview

We aim at Zero-Shot Monolingual label shift in multilingual summarization tasks , Put forward The neural label search model uses neural network to search the weights of multi language labels , The weighted tags are used to supervise the extractor . The specific process is divided into the following five steps ：

Multilingual data enhancement ： At present, the original English documents are translated 、 Bilingual dictionaries are used to reduce the deviation from the target language ;
Multilingual tags ： Our abstract model is finally supervised by multilingual tags , The multilingual tags contain a total of 4 Group tags , this 4 Group labels are labeled according to different strategies ;
Neural tag search ： In this step, a hierarchical weight prediction is designed for different groups of labels , Including sentence level (Sentence-Level) And group level (Set-Level), Finally, weighted tags are used to supervise the abstract model ;
Fine tuning training / Fine-Tunig： Use enhanced document data and weighted average multilingual tags to Fine-Tune Neural digest extraction model ;
be based on Zero-Shot Multi language abstract extraction ： Using the model trained on English annotation data, we can extract abstract sentences directly from low resource language documents .

▲ chart 1： Multilingual tags

Pictured above 1 Shown , In the original English document D And manually writing summaries s Four groups of multi language tags are designed on the (Ua,Ub,Uc,Ud), The specific construction method is as follows ：

1. Label set Ua： Definition Ua=GetPosLabel (D,s) To use documents D And manually writing summaries s A set of sentences extracted as abstracts using a greedy algorithm , among GetPosLabel The returned tag is 1 An index of sentences . Use (D,s) What you get is a summary sentence from the English document , This result is not optimal for other languages , So we also designed another three groups of labels .

2. Label set Ub： First of all, the original English document and the manually written abstract are both used in the machinetranslation model MarianMT Translate it into the target language , Marked as DMT and sMT, And then use Ub=GetPosLabel (DMT,sMT) To get the index set of abstract sentences on the translated document . This method with the help of machinetranslation model is equivalent to using the syntactic structure of the target language to express the semantics of the original English , Therefore, the obtained abstract sentences can reflect the bias of the target language syntactic structure on the abstract information .

3. Label set Uc： In the construction of this set of labels , First, the original English document is automatically translated into the target language DMT, Then, the manual English abstract is replaced by the target language with a bilingual dictionary SWR ( Replace words in all summaries ), And then we use Uc=GetPosLabel (DMT,SWR) To get the abstract sentence index set of translation and word replacement interaction . This method uses machinetranslation to replace the syntactic structure of the original document , Bilingual dictionary translation is used to preserve the syntactic structure of the original language and keep consistent with the document language , Therefore, we can get the interaction between the target language and the original language in extracting abstract sentences .

4. Label set Ud： In this method , The document is in original English D; First, it is translated into the target language through machinetranslation , Then the word is replaced by a bilingual dictionary and converted back to English , Use S′ To express . Finally we use Ud=GetPosLabel (D,S′) To get a set of abstract sentence tags . In this way , The original document remains unchanged , The abstract uses the syntactic structure of the target language , Therefore, the interaction between the target language and the original language in extracting abstract sentences can be obtained again .

It should be noted that , Use GetPosLabel (D,S) When , Make sure that D and S Is the expression of the same language , Because the label labeling algorithm based on greedy algorithm is essentially to match the word level . in addition , There are many other ways to construct multilingual tags , We just selected a few representative methods . The machinetranslation model and bilingual dictionary replacement used in these methods may introduce additional errors , Therefore, it is necessary to learn appropriate weights for these groups of tags .

Here's the picture 2 Shown , For several groups of multilingual tags that have been obtained (Ua,Ub,Uc,Ud), We need to design a neural label search model to set weights for different groups of labels . The weight consists of two parts , Sentence level (Sentence-Level) And group level (Set-Level). The weights corresponding to these two levels , We define two weight predictors respectively , Sentence level weight prediction Transformeralpha And group level weight prediction Transformerbeta.

▲ chart 2： Multilingual neural tag search summary model

experimental result

NLSSum It is through the way of neural search to MultilingualLabel Different tag sets in are given different weights , And finally get the weighted average label . This final tag is used to train the abstract model on the English dataset . Compared with monolingual tags , There is more cross language semantic and grammatical information in multi language tags , Therefore, this model can be used in Baseline Get a big improvement on the basis of .

The following table 2 Shown , The data sets used in the experiment include CNN/DM and MLSUM, The specific data set is described in table 6.2 Shown .MLSUM It is the first large-scale multilingual text summary data set , It crawled from the Xinwang website 150 Million documents and abstracts , Contains five languages ： French (French,Fr)、 German (German,De)、 Spanish (Spanish,ES)、 Russian (Russian,Ru) And Turkish (Turkish,Tr).MLSUM It is to verify when testing inference Zero-Shot Cross language transfer capability of multilingual models . In the training phase, we use the most common in the field of text summarization CNN/DM English data set .

▲ surface 2：MLSUM On dataset ROUGE result

Here to MLSUM Of each baseline model on the dataset ROUGE The results are compared . The table is divided into three parts .

The first part The show is Oracle and Lead These simple baseline models ;
The second part Some baseline models based on supervised learning are shown , among (TrainAll) Is to train on datasets in all languages ,(TrainOne) Is to train separately on the data set of each language ;
The third part It shows the results of unsupervised learning , All models are trained only on English datasets .

among , According to the results of the second part, it is easy to find , In supervised learning , The generation based summarization is more appropriate than the extraction based summarization . In the third part , Baseline model XLMRSum The performance of can surpass that of the generative model MARGE, This shows that it is more appropriate to use the extraction method in unsupervised learning .

in addition , When using machinetranslation and bilingual dictionary replacement to enhance the data of the original document ( Baseline model XLMRSum-MT and XLMRSum-WR), You can find XLMRSum-MT The model will degrade the performance of the model , and XLMRSum-WR It will improve the performance , Therefore, in the final model, the word replacement method based on bilingual dictionary is selected for data enhancement .

So for us NLSSum Model , We also have two configurations ,NLSSum-Sep Yes, it will CNN/DM Replace individual words with corresponding target language and conduct fine-tuning training ;NLSSum yes CNN/DM Words are replaced with all target languages respectively and fine tuning training is carried out on the data set of all languages after replacement .

Final results showed , Trained in all languages NLSSum better . From the table, we can draw the following conclusions ：

The input data enhancement based on the translation model will introduce errors , Therefore, the use of translation models in input should be avoided ; contrary , Word replacement in bilingual dictionaries is a good data enhancement method ;
The label construction process does not involve model input , So we can use machinetranslation model to assist tag generation .

Here's the picture 3 Shown , Through visual analysis, we can further study the distribution of important information between different languages , It can be seen that the distribution of important information in English language is relatively high , The important information in other languages is scattered , This is also an important reason why multilingual tags can improve model performance .

▲ chart 3： The distribution of abstract sentences in different languages

Future research will focus on ：1. To find a more reasonable algorithm for multi language sentence level tagging ;2. Study how to improve the results of low resource language summarization , At the same time, it does not reduce the results of English corpus .

Read more

# cast draft through Avenue #

Let your words be seen by more people

How to make more high-quality content reach the reader group in a shorter path , How about reducing the cost of finding quality content for readers ？ The answer is ： People you don't know .

There are always people you don't know , Know what you want to know .PaperWeekly Maybe it could be a bridge , Push different backgrounds 、 Scholars and academic inspiration in different directions collide with each other , There are more possibilities .

PaperWeekly Encourage university laboratories or individuals to , Share all kinds of quality content on our platform , It can be Interpretation of the latest paper , It can also be Analysis of academic hot spots 、 Scientific research experience or Competition experience explanation etc. . We have only one purpose , Let knowledge really flow .

The basic requirements of the manuscript ：

• The article is really personal Original works , Not published in public channels , For example, articles published or to be published on other platforms , Please clearly mark

• It is suggested that markdown Format writing , The pictures are sent as attachments , The picture should be clear , No copyright issues

• PaperWeekly Respect the right of authorship , And will be adopted for each original first manuscript , Provide Competitive remuneration in the industry , Specifically, according to the amount of reading and the quality of the article, the ladder system is used for settlement

Contribution channel ：

• Send email ：[email protected]

• Please note your immediate contact information （ WeChat ）, So that we can contact the author as soon as we choose the manuscript

• You can also directly add Xiaobian wechat （pwbot02） Quick contribution , remarks ： full name - contribute

△ Long press add PaperWeekly Small make up

Now? , stay 「 You know 」 We can also be found

Go to Zhihu home page and search 「PaperWeekly」

Click on 「 Focus on 」 Subscribe to our column

原网站

版权声明
本文为[PaperWeekly]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/177/202206261647597718.html

当前位置：网站首页>ACL 2022 | zero sample multilingual extracted text summarization based on neural label search

ACL 2022 | zero sample multilingual extracted text summarization based on neural label search

边栏推荐

猜你喜欢

随机推荐