当前位置：网站首页>How to solve the problem that iterative semi supervised training is difficult to implement in ASR training? RTC dev Meetup

How to solve the problem that iterative semi supervised training is difficult to implement in ASR training? RTC dev Meetup

2022-06-26 03:49:00 【Acoustic network】

Preface

「 Voice Processing 」 It is a very important scene in the field of real-time interaction , Launched on the sound network 「RTC Dev Meetup Technical practice and application of voice processing in the field of real-time interaction 」 In the activity , From Microsoft Research Asia 、 Acoustic network 、 Technical experts of Sumi technology , Relevant sharing was conducted around this topic .

This paper is based on Sumi technology NLP Li Tian, the technical director, shared the contents in the activity . Official account 「 Sound network developer 」, Reply key words 「DM0428」 You can download activity related PPT Information .

01 Semi supervised training in ASR The necessity of the field

Universal ASR Although the accuracy of Chinese characters is already very high , But in specific scenarios （ Game scenario 、 Private chat scene 、 Group chat scene 、 Anchor scenario ） when , There is still a scenario mismatch problem , Because the universal ASR The application in these fields is relatively difficult , There are mainly the following problems .

1、 Mark the scarcity of resources

It is difficult to obtain the annotation of the corresponding scene , Usually, it is impossible to quickly obtain a large number of annotation samples required by business scenarios . Even if the sample acquisition is simple , However, it is still very difficult to obtain annotation samples , Because the marking cost is very high . When creating a project or determining the product direction , You will find that there are ASR The data problem should be solved before the task . In the past, when using phoneme and text splitting , The data volume is required to be small , Now, end-to-end technology is often used , Hold a candle to 1000 The amount of data started in an hour , Whether it is self labeling or with the help of well-known data companies , Before the product starts , The cost is hard to accept .

2、 Instability of dimension quality

Wake up 、Siri Interaction and other scenarios , The user knows that the back end will transcribe , But in most business scenarios, people are concerned about ASR Transcription is imperceptible .

For example, with Siri When communicating , If Siri I didn't hear the speaker clearly , Then people will try again , Make the expression more clear . But the real business level , Most of the time, the customer doesn't know that the back end is ASR Transcription , For example, live broadcasting platform . It may provide audit level requirements , It is impossible to inform the anchor that the voice is being transcribed , You need to pronounce more clearly . The annotation quality caused by unclear enunciation and broken syntactic components is very unstable .

So how to solve these problems when labeling ？ For the US business , Because it covers a large number of similar social scenes throughout the Internet , Faced with a wide variety of data and specific terms , Therefore, it is very difficult to obtain such annotations , At the same time, it is difficult to guarantee the marking quality , But the data of the scene can be easily obtained from the homologous data , We believe that the semi supervised scheme is an ideal choice .

If you have ever touched NLP perhaps CV, I believe you will have a clear definition of semi supervision . stay ASR This field , Especially based on end-to-end , At present, it is generally divided into two types ：Self-training and Pre-training, Others are less common , Or it can't be in at present ASR The field has achieved a good landing .

Self-training The system mainly revolves around the well-known Pseudo labeling. The core scheme is mainly based on consistency regularization Logic . In theory ,Pseudo label It's actually true label A kind of noise , During model training , take Pseudo label and true label Train together , This itself is the process of training anti noise , It can make the model learn step by step .Pre-training It's simple . If you do NLP You will know better from birth , It was originally to train the more appropriate representation of the corresponding field in the corresponding field . This task generally revolves around the reconstruction of the meaning or content of representation , No need for extra labels , This data can be constructed without labels / Having no artificially transcribed words Pre-training The training task of , Then use the artificially transcribed data of the corresponding scene to ASR Task training .

01 Semi supervised training in ASR The development of the field

1、Self-training

Generally speaking ,Self-training From CV. from 2013 Year of Pseudo label ICML Put forward for the first time Pseudo label since , Various new systems have emerged , Such as 2014 year Learning with pseudo-ensembles（ The first system ）, take Pseudo label And model Ensemble To merge ;2016 year Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning Think Pseudo label Its own generation logic should also be different disturbances of the same model ;2017 year Mean teachers are better role models: Weight-averaged consistency targets Focus on how to generate higher quality labels , It uses model averaging to get better teacher Model , So as to ensure the quality of false labels .

As early as 2014 year 、2016 Two papers in , It has already been mentioned in CV Compare and learn from the popular fields in the , The formula argument in this paper is almost the same from many aspects , It can be said that the development of technology is a historical cycle .

2、Pre-training

Pre-training Mainly focused on NLP field , Of course. CV There are also areas such as ladder network system , contain Pre-training Concept . however Pre-training The better developed fields are still NLP. The core problem is NLP The underlying feature of is the character , This itself is a very discrete system , It is difficult to communicate with CV This dense data input for comparison .

In terms of this system ,NLP After years of development , from 1994 Year of N-gram-based features , Based on the NN system , And then later on NN Generated by the design of the internal framework of the system RNN and LSTM And other language models ,2017 year ELMO Born in the sky , Until then 2018 year transformer The architecture appears . Now? , Whether it's BERT Or is it GPT Waiting NLP Various downstream businesses in the field have been fully verified .

3、ASR Semi supervised development in the field

Generally speaking, according to ASR Its own times split it into two sections ：

① Based on phonemes / The era of text splitting ： In many cases, people still use kaidi As a business level ASR Underlying technical solutions . The semi supervised training logic of the scheme is , An acoustic model can be trained to general Phoneme model , Then through the downstream language model or rescore The model outputs the text required by the specific business , So as to achieve the function of partial semi supervision . From the process , It is more like a kind of transfer learning . But as the Alex Graves stay 2013 Years to complete CTC After my doctoral thesis , The end-to-end system began to emerge gradually . Two years later , EESEN The team renewed CTC To the phoneme level , Make phonemes / The text splitting system returns briefly .

② End to end era ：LAS（listen attendance style） The rise of the system , as well as CTC/LAS + LM hybrid The rise of the system , Make the end-to-end effect 、 data 、 Model quality and reasoning speed , Begin to surpass Kaldi Or traditional phonemes / Text split model architecture , The industry has also begun to step into an end-to-end era . The time sequence is CTC,Deep speech,Listen,attend and spell, as well as Hybrid CTC/attention.

stay 2017 Years later , With Watanabi Put forward CTC/attention hybrid and ESPNET Release of the frame , The end-to-end system has been preliminarily improved and can be applied to various businesses in industry . It provides a set of the same Lattice The same flexible combination decode frame ： be based on hypotheses route The design of the , Give follow-up shallow fusion More flexible integration solutions . In fact, if you've ever used ESPnet, You can see the whole hypotheses Path design is very flexible , Various technical solutions can be introduced to route Joint scoring or rescore.

Because no longer use phoneme and other basic , And CTC and Seq2Seq The training cost is very high , In addition, it is difficult to obtain the actual annotation data , The short board that the end-to-end system relies on data has gradually become the core bottleneck of its implementation . If in the early days, especially 2015 year -2016 I worked in a big factory in ASR, Your actual landing experience is , stay 1000 Consider end-to-end after hours .

thus , How to constrain end-to-end data requirements becomes a late problem （ from 2019 year -2020 Year begins ） Optimize end-to-end , And then solve the problem of end-to-end landing , It is also the core consideration of academia and industry . Since then , be based on ASR Of Pre-training and Self-training Began to step onto the stage of history . before , Although relevant research has been done , But the scope of influence is small , until 2019 Years and 2020 year ,Facebook AI It is proposed that these two fields can be industrialized , Two papers with great development prospects were published , People began to pay attention to .

wav2vec: Unsupervised pre-training for speech recognition yes Facebook Based on Pre-training Technical solution . The principle is the same as word2vec Very close to , Using the negative sampling technique to train a future time representation prediction task . Because the training results can be used as the characteristics of any audio downstream task , So this system is a very important audio technology foundation used by many large factories in the industry .

Self-training for end to end speech recognition yes Facebook AI Of Jacob Team research , Aimed at a comprehensive analysis Pseudo label System for ASR The actual application effect . They gave Pseudo label The system is in English ASR On several core data sets in the field strong baseline, And the first systematic exposition of Pseudo label The system is ASR Several core problems that need to be solved in the field landing .

4、Pre-training VS Self-training in ASR

stay 2020 year , As the number of customers gradually increases , More and more scenes are covered , We also face ： You need to do a separate... For some specific scenarios ASR structure , To obtain better model effect than competitive products . Simply use phonemes / Text structure , We can not achieve the desired effect by replacing the language model to meet the needs of various fields . But at the same time , Build your own end-to-end for each scenario individually ASR, It is difficult to accept from the data annotation . So we began to consider the choice Pre-training still Self-training.

Originally, we considered to choose similar systems of other large factories , such as Pre-training Of wav2vec, But we tried many times wav2vec Actual operation of , The cost is very high , The downstream Post-pretraining Training in the corresponding field plus Pre-training The training time itself is also very long , As a result, the iteration cycle of the model will be prolonged . It is important to , stay Pre-training+Post-pretraining There is nothing at this stage ASR The output of the model , For new business scenarios that require rapid iteration , It's hard to accept .

Based on the above contradiction , We ultimately prefer to use... In our business Self-training Technical solution . because Self-training The technical scheme of can be evaluated for each training model , Use first and optimize later , This is a friendly system for business .

5、 In the near future ASR field Self-training Development track

Anchored Self-training After the goal , from 2020 Since, we have been conducting research and follow-up in this field . We found that , In this field, it is mainly Facebook,Google, Mitsubishi It has been done quite well , Others such as old brands ASR company Nuance And some colleges and universities will also publish some improvement plans or problem research for some specific problems . stay 2020 year , Their main research interests are as follows ：

(1) 2020 year

Facebook：

SELF-TRAINING FOR END-TO-END SPEECH RECOGNITION,

END-TO-END ASR: FROM SUPERVISED TO SEMI-SUPERVISED LEARNING WITH MODERN ARCHITECTURES,

ITERATIVE PSEUDO-LABELING FOR SPEECH RECOGNITION

The research context is simple Pseudo label stay CTC In frame strong baseline And research ; simple Pseudo label stay CTC/Attention hybrid Architectural effects ; Multi round iteration Pseudo label The study of systems .

Google：

because Google Of Iterative pseudo-labeling stay CV The field has a very strong technical background , So as soon as they came up, they gave their multi round iterative formula Pseudo label+model ensemble programme ：Noisy Student Training, And take the year Librispeech100 + 860 SOTA. Of course ,Iterative There are many holes in training , In particular, the explosion in the number of data experiments caused by multiple rounds of iteration . This is clearly stated in our plan .

Mitsubishi ：

Iterative Pattern , In the process, the first step is to teacher Carry out multiple rounds of pseudo-labeling Training , Every training one pseudo-labeling, The interior needs to be labeled , Such multiple rounds will make the training very cumbersome . So from 2021 Year begins , We have also gradually seen in various fields on-the-fly The way . For example, Mitsubishi is 2012 Put forward in MPL（ be based on mean teacher Evolved ）. however on-the-fly This means that real-time generation is required label, and ASR Of label The generation quality is the same as decode The calculation of costs is directly related to . ordinary CTC Of greedy search Faster , However, the quality of the transcribed text produced by it is poor ; And the more common shallow fusion programme , Only multiple models are used for scoring decode Transcribe to produce words , It is basically impossible to generate in real time during training . So generally speaking ,on-the-fly The final effect of the mode is actually not as good as Iterative Pattern .

other ：

Saleforce Once “ The Renaissance ”, The pseudo tag training is used again in Essen On the frame . Its label generation uses CTC greedy search.Nuance As the old ASR Technology manufacturers , By expounding FixMatch The theory interprets the theoretical essence of semi supervision, which is actually Consistency Training.

(2) 2021 year

Mitsubishi ：

because on-the-fly The defects of the model , Mitsubishi in 2021 In published advanced MPL, It's back Iterative Pattern . They will teacher Model and subsequent on the flying Split the training process , At the same time, it switches to a more robust audio effect Conformer frame . Finally, it surpassed Google Of NST programme , Become the second place at present .

Facebook：

Facebook AI stay 2021 Used in cache Mechanism , Synchronize another process during model training decode, If cache decode Full of , Just cut the training into cache Data and label Data for joint training ,N After a step catch Empty , And then do it again decode. so , although Facebook AI Said he is on-the-fly Pattern , But in essence, it is the concept of rounds . Its use 36 layer transformer, Got it so far Librispeech100+860 Of SOTA, It can even be balanced ESPnet Direct training Librispeech960 了 .

03 Our semi supervised solution solves the problem

1、Iterative or on-the-fly

It is in the effect demand and the conclusion of the current academic and industrial circles , Our technical direction is finally anchored Iterative Pattern .

2、Iterative The problem of

but Iterative Pattern training is very cumbersome , Since the generation of pseudo tag data needs to be regenerated after each round of training , And to achieve good results , according to Google and Facebook Experience , Multiple iterations are required .

So each iteration has three problems , First of all , How to generate high-quality data on pseudo tags ？ This is essentially the simplest problem , We have all kinds of decode Algorithm , Use whichever algorithm is good . second , How to filter out high-quality pseudo tag data ？ Because we don't know which label is right , No matter how high the quality , There will be some problems , At this point, we need to study how to reduce the proportion of problems , What can be done to reduce . Third , Whole Iterative The biggest problem in the pattern is , How to balance labeled data and non labeled data .

Google Of NST Our system needs five iterations , It means that the ratio of marked and unmarked in each round is different . The second round is probably 2:7, The third round is 1:3, stay librispeech 100+860, This one has a label ： No label Maintenance in 1：3 The upper and lower ratios are verified to be reasonable . But on different mission lines , The ratio is also different .Facebook stay Librispeech+LibriVox The experimental results on the data set show that the ratio needs to be within 1:10 above . This leads to the final landing in the business , The cost of the experiment is huge . For example, there are five experiments , Each round of training requires multiple data experiments with different ratios , After the training, select the model for decode assessment , Then in the next round, multiple data experiments with different ratios will be carried out again , This iterates over five rounds . because ASR Training is expensive , The pace of each round of training is very painful .

in addition , At the limited dimension level , How to start the model cold ？ Generally speaking , The initial training data is labeled , Training data are very few . such as Iterative The initial tag data in is generally very small , Only of the available data 1/10 about , So how to carry out cold start has become a core problem .

04 Improved NLPL Solution

Based on these questions , We have come up with our own solutions , Published in Improved noisy Iterative Pseudo-Labeling for Semi-superivised Speech Recogntion in . Now let's briefly explain our solution in advance .

1、 Model framework

from 2020 Years later , We will no longer use Kaldi System , Instead, you switch to a class ESPnet Self research framework . On the model frame , about CTC The front end of the sharedEncoder and LAS Of decoder, We all use transformer, chart 1 On the left is Watanabi stay CTC/Attention hybrid The figures in that paper , On the right is an introduction to the model framework , Model parameters ,SharedEncoder There was one before subLayer, It's using 2 layer (33+512) Of CNN, Step by step to 2, This may be related to ESPnet The frames in are slightly different , But it's basically the same .ransformer We currently use 128 Of transformer,512 dimension ,FFN yes 2048, This is similar to most formerbase The model is almost the same . in addition ,AttentionDecoder We're going to use 6 layer transformer, Its parameter configuration is similar to Encoder It's the same . Language model ,LT people ！ Inserted 4 We added an extra 6 Layer of transformer Language model , Other parameter configurations are the same as BERT It's the same ,12 head ,768dims,FFN by 3072, This is the overall model framework .

from 2020 Years later , We will no longer use Kaldi System , Instead, you switch to a class ESPnet Self research framework . On the model frame , about CTC The front end of the sharedEncoder and LAS Of decoder, We all use transformer, chart 1 On the left is Watanabi That article CTC/Attention hybrid Graph in the paper , On the right is an introduction to our model framework . Model parameters ,SharedEncoder Of sublayer At present, it is 2 layer (3*3+512) Of CNN, Step by step to 2,Transformer We currently use 12 layer 8 head ,512 dimension ,FFN yes 2048, This is similar to most Transformer-based The acoustic model is almost the same . in addition ,AttentionDecoder We're going to use 6 layer transformer, Its parameter configuration is similar to Encoder It's the same .

For the language model , We added an extra 6 Layer of transformer Language model , Other parameter configurations are the same as BERT It's the same ,12 head ,768dims,FFN by 3072.

■ chart 1

2、 Other general settings

Our experimental data are Librrispeech 100+860,100 As marked data ,860 As dimensionless data .LM The data is Librispeech Your training data , And the official 800W Text corpus . Our vocal features are 100 dimension Fbank+3 dimension pitch. To reduce the number of text labels , We used BPE, hold word The quantity is reduced to 7002 individual pieces To reduce the final output , At the same time accelerate CTC Training for .

Training configuration involves learning rate , Learning rate and transformer be similar , But there are differences , Is in the decay To the last position , We'll advance 5000step decay To the final stable value , Then hold it slowly for a while . This is directly related to the following technologies to maintain model stability , Let it train steadily for a period of time during that period , So that the average model can keep up with .

3、 How to generate false labels on unlabeled data

At present, it is common in the industry decode Algorithmic and relatively high-quality methods are shadow fusion and deep fusion system . We used shadow fusion, And the acoustic model CTC、LAS as well as LM Merge to search ,bean size by 50. The general process is the same as ESPNET almost , But we have two small changes ：

The first is that we use CTC The way of greedy search is to judge the end of the sentence , and ESPNET It's not done , It has its own end detact Algorithm .

The second is that we will not prune the path too much , Instead, keep as many paths as possible .

4、 How to select high-quality pseudo tag data for the next round of semi supervised training

When generating pseudo tags , In fact, the quality of many data is not flattering , Especially the early training , such as NST perhaps Iterative Labeling The first or second round of , At this time, the model is librispeech dev and test Upper WER May be close to 9 perhaps 10 More than one point .

In this case ,Google and Facebook Adopt the method of rough sorting to take the percentile , Be similar to ESPNET Medium hypothesis The score of , And then in decode During the process, the probability is added , Rank probabilities from small to large , Then take one of them 90%. There may be a cliff like confidence rate , Like the front 85% The probability distribution of the data is very similar , And then in 85%～95% The location of , The probability suddenly shows a very big difference , The probability of falling to more than a few possible points . In response to the above problems , We use the method of distribution test to extract samples ： Let us first assume that it obeys the Gaussian distribution , Then only the bilateral confidence intervals of Gaussian distribution are preserved 90% perhaps 95% For training . Here is the bilateral confidence interval 90%/95%, Does not mean data retention 90% and 95%, But in the case of Gaussian distribution, keep the data in the confidence interval , So it's probably less than direct retention 90% Data .

5、 mark / How to balance the ratio of unmarked data , Only in this way can the model not be over fitted to label data without label data

mark / How to balance the ratio of unlabeled data is the biggest problem in multi round iterative semi supervised training , All the previous studies did not show how to conduct proportional screening , Only the approximate proportion of corresponding tasks is given ,Facebook What they do is Librispeed 960+LibriVOX, Its ratio is 1:10~1:54 Between .Google yes Librispeech 100 +800, Scale in 1:3 about .

The above opinions can not guide the actual production to determine the proportion of land use . For example, live broadcast scenes ASR, With 100 Hours as the starting price , At the same time, it may be easy to obtain many homologous unlabeled data . But in what proportion should these unlabeled data and labeled data be put together , So that the model will not be trained to unlabeled data ; How to train the model to ensure its stability and better effect , This will require endless data experiments . Of course , If there are enough machine resources in the company , It is indeed possible to do these experiments , But most of the time, people are not like Google and Facebook There are so many machines , It can be directly and violently exhausted .

So how can we get guidance from each business line at this time ？ We are Librispeech 100/860 Detailed experiments and qualitative and quantitative analysis were carried out , Got a guide , At present, this guidance is very accurate , It can teach you how to choose data balance . Let's make a hypothesis here , This is directly related to why we should do semi supervised training of false labels . We think that when training pseudo tags , Because tagged data and unlabeled data are mixed together , So for some pseudo tag data , We don't know if it's marked correctly , Model training should be made as much as possible on certain characteristics “ conservative ”, Don't over fit to the wrong data or trailer data . But it also ensures a certain diversity of samples , Because if you are completely conservative , Model training will fall into what it thinks is the optimal result of the data level , And then step into the local optimal solution . Multiple rounds of iterative training will exacerbate this process , This leads to model over training and over fitting .

To identify where to be conservative , Where to ensure diversity , We divide the data into three portrait dimensions , The first portrait dimension is audio length , The second portrait dimension is text /pieces length , The third dimension is the distribution of the tag itself . The problem can be transformed into , In which dimensions should we try to keep the training conservative , Which dimensions should ensure the diversity of samples as much as possible . Based on this , We conducted a large-scale experiment , Every time a new pseudo tag is generated , We will according to different proportions , Build multiple training samples candidate, That is, the alternative set , This candidate Each batch of training data in . Before each round of training , We all share every catenary cadidate Same as our last training data Compare these three dimensions , And for all candidate Rank . such as 1:2 Of candidate Rank in three dimensions with the upstream ,1：4 Of candidate There will also be a ranking ,1:5 and 1:6 There will also be a ranking , wait .

On the evaluation ranking scheme , because frame lenth and pieces length Is a single dimensional statistic , So we used KS test . but label The distribution itself is multidimensional , So let's normalize TF, Then the Euclidean distance is used to evaluate the distribution difference between the current round data and the previous round data , And for each candidate ranking .

After a lot of experiments , Found a very clear rule , Namely pieces The smaller the difference of distribution itself , Bigger frame lenth Distribution differences and pieces length The difference in distribution will generally lead to a better new round of model effect . The above logic can be described as a general paradigm , Pictured 2 Shown .

■ chart 2

6、 How to ensure that the model will not be over fitted to the wrong pseudo label in model training trick

This is a key point we found in the whole system . Here we have two dimensions . The first dimension is the data dimension , We joined in specAug and specAug++ Make the whole data more generalized . At the same time, at the model level , Be similar to MPL, We will generate online and offline Generation , Choose... In the early stage online Result , Post selection offline Result , Generally speaking, after the fifth round offline The result will be stable higher than online Result . in addition , We will also carry out dropout promote , about dropout From 0.1 Gradually upgrade to 0.3, because Pseudo tag training There is a great risk of over fitting , But it's basically up to 0.4 There will be no new income in the future .

7、 Under a limited number of labeled samples , How to carry out the cold start supervision training of the model can obtain the best effect

We also used two-stage training . The first stage of training starts with dropout0.1 30epoch Match to the second level dropout0.13 100epoch The best effect . The specific experimental results are shown in the figure 3 Shown . It also shows a problem , In cold start, you should start with a few epoch, The smaller dropout, Fast fit target , Then raise dropout, Let it be a relatively generalized training configuration , Train more rounds , Make the model optimal . This cold start mode can basically be compared with Google Of NST The cold start result of the system model is flat .

■ chart 3

Finally, the whole improved NIPL The final effect of . At present, the deadline for submission is interspeech 2022 Look at , stay Librispeech 100+860 At present, there are two companies that are better than us , The first is Mitsubishi MPL Of conformer yes 3.8%/8.2%. But if the control variable is the same transformer, Mitsubishi has only 4.8%/10.1%, And we are 3.93%/9.59%. The other is Facebook Of simIPL, its 36 layer transformer It can be done 3.8%/7.5%, And no language model is required , If you add a language model and rescore It can be done 2.7%/5.2%. This effect already belongs to the effect beyond our cognition . Because we trained 960 The data of ,ESPnet librispeech 960 The supervision training results in 96.96 Should be 3.04%, It means Facebook no need 860 The data of , only 100 Of label You can do that 2.7%/5.2%.

05 Q & A

1、 contrast WER What's the effect ？

our test clean yes 3.93,test other yes 9.59, But then we went on NIPL Training rounds 7 and 8 ,test other It can be reduced . although test clean Still maintained at 3.93, but test other So far, it has been reduced to about 9.3. Mitsubishi's conformer yes 3.8%/ 8.2%, Better than ours 3.93 low , But their transformer yes 4.8%/10.1%.Facebook Of simIPL yes 3.8%/7.5%, about Facebook simIPL We expressed a little disbelief , The effect is a little scary . So we should be the third in the world , Than Google stay 2020 The article published in NST Better .

2、 Introduce to you CTC Use

CTC When it first appeared , Because of the difficulty of training optimization , The requirements for data volume are also strict , So at that time CTC The use of is all some strange skills . As mentioned above ESSEN, hold CTC For training phonemes , Then I still pick up the same as everyone else WFST. Because the number of phonemes is relative to word Much smaller , Greatly reduced CTC The difficulty of training , So that it can be the same in some fields MMI,LFMMI And so on . Directly naked CTC End to end ASR Data costs can be very high .

If you are in the 2020 Ask this question in , I will recommend you to try in the new business ESSEN project . But now it's 2022 Years. ,CTC Great changes have taken place in the industrial use of .Watanabi The paper tells you ,CTC and LAS hybrid This system can have a very good effect , And the data quality will not be the same as before CTC That requires so much , because LAS The system has many optimization techniques that can be used to help train . therefore CTC LAS It is a relatively standard use scheme at present . If you don't have your own ASR Training platform , I suggest you try ESPnet/Wenet, If flow identification is the core business demand ,Wenet As the first choice .

Activity Notice

「RTC Dev Meetup - Hangzhou station 」, We will focus on big front-end technology , The invitation comes from Acoustic network 、 Ant group and Hikvision Technical experts , Share with us the business architecture and cross end practices in the field of real-time interaction in the era of big front-end .

Action is better than action , Scan the QR code or click here Sign up ！

原网站

版权声明
本文为[Acoustic network]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/177/202206260337168190.html