当前位置:网站首页>The challenge of image based voice processing in real-time audio and video -- RTC dev Meetup

The challenge of image based voice processing in real-time audio and video -- RTC dev Meetup

2022-06-22 08:19:00 Acoustic network

Preface

「 Voice Processing 」 It is a very important scene in the field of real-time interaction , Launched on the sound network 「RTC Dev Meetup Technical practice and application of voice processing in the field of real-time interaction 」 In the activity , From Baidu 、 Technical experts of Huanyu technology and Yitu , Relevant sharing was conducted around this topic .

This paper is based on the diagram AI SaaS Zhouyuanjian, the technical director, shared the contents in the activity . Official account 「 Sound network developer 」, Reply key words 「DM0428」 You can download activity related PPT Information .

 Insert picture description here

Yitu is a company that does AI Infrastructure and AI Solution providers , With the AI The technical capability is relatively extensive , Including pictures 、 video 、 voice 、 Natural language processing, etc , In addition to having AI Beyond the ability of algorithm , It can also provide AI Calculate the force .

After you know the background of Yitu , Let me talk about the challenges related to audio content audit in the live broadcast scenario .

01 The business process of live content audit

 Insert picture description here

■ chart 1

chart 1 It shows the business process of content audit in the live broadcast scenario .

The basic process is : The anchor will broadcast live on the TV first , Then the stream will be pushed to the platform , The platform sends the audit request to the supplier , Approved supplier ( For example, according to the picture ) Get the stream by address , And decode it , Real time analysis to find out the content of violations , And then return the data to the customer in the form of callback . After the customer receives the data , Generally, it is necessary to conduct secondary manual review , If it is confirmed that it is a violation , Then it will be processed in the background , For example, stop the live broadcast or delete the account .

02 Live audio audit algorithm module

Expand the algorithm module inside the system , Pictured 2 It can be divided into three categories , One is basic speech recognition ( ASR ); The second category is text classification , It is mainly used to judge what illegal content is contained in the recognized text . The third category is nonverbal recognition , If the illegal content is not expressed in words , This part can be used to identify .

 Insert picture description here

■ chart 2

2.1 speech recognition ( ASR ) Technical difficulties

First of all, it introduces in ASR Challenges encountered in .

On the whole , There are two main challenges : The first point is the interference of strong background sound , In the voice scene of the Internet , Usually accompanied by background music or game sound , The environment is generally noisy , There will even be many people talking , Compared with ordinary scenes , These features will greatly increase the difficulty of speech recognition .

The second point is the identification of specific proprietary words . Some illegal words don't often appear in life , So in speech recognition , If there is no special optimization , Tend to recognize syllables as more common words , Thus leading to the omission of illegal words .

2.1.1 Strong background sound performance optimization

that , How to deal with such problems ? For strong background interference , We have tried everything , To sum up, the most effective method is to solve the problem from the aspect of data .

There are two main optimizations in data : The first is to create a more sophisticated ambient sound simulator based on business scenarios , Data enhancement through simulator , This method has been proved in other fields , For example, Tesla's autopilot model uses similar technologies to improve performance during training .

Imitate phonation according to the picture 、 Room simulation 、 Sound reception simulation 、 Channel simulation and other dimensions build a simulator . Parameters can be adjusted under each dimension , For example, the number of speakers 、 Speed, intonation or background sound 、 Position and direction of sound source 、 Aphonia effect 、 Reverberation, etc . On the whole , There are probably hundreds of parameters that can be adjusted . The simulator can improve the richness of the original relatively simple training data , Make the training data closer to the specific scene , So as to achieve a good performance improvement effect .

Another way to improve is to train through difficult case mining . In the training process of the normal model, there are both positive and negative case data , In the case of large amounts of data , There will always be some cases where the positive case data is similar to the negative case data , Such data are often referred to as hard cases , It is difficult data . Online difficult case mining is in the process of model training , Add difficult case data to the training repeatedly . Similar to the wrong question book , You can improve your grades by recording the problems you don't know well in the mistake book .

This method is applied to difficult training , The model can learn more details that are not easy to distinguish , And get good performance improvement . Through the above technology , Under the data distribution with strong background sound , The model can also achieve good performance .

2.1.2 Specific proprietary word recognition

Another challenge mentioned earlier is the identification of proper words . Here is an example , Pictured 3 Shown , Here is the translation of a paragraph of audio text , You can see , If you haven't heard of it before “ Knock bubbles ” The word , Then the big probability can't recognize the meaning of this passage . It is possible that “ Blister ” Sounds like “ terrible ”.

 Insert picture description here

■ chart 3

In response to this question , We tried , It is found that two methods are more effective : The first method is during model training , Of a proper word loss Strength to increase weight , in other words , If you make a mistake , A higher penalty will be imposed . Take the example above , Under normal circumstances , If you say a wrong word, it will be deducted 1 branch , If “ Knock bubbles ” Wrong , It is set to buckle 2 branch . Through this mode , The model will work harder to avoid proper word recognition errors .

The second method is to adjust the range of candidate words in the search thesaurus when decoding . Pictured 4 Shown , When the speech recognition algorithm works , The first is to recognize each phoneme through the signal of speech spectrum , Then convert the phonemes into possible text .

 Wechat pictures _20220616205446

■ chart 4

Optimization for proper words , When translating a series of phonemes into text , You can choose more candidate words . For example, in the previous example , If “ Blister ” These two words are not in the list of candidate words . In any case, it is impossible to correctly identify “ Bubble blasting ” The word .

This idea is relatively intuitive , But a new problem will be introduced after the implementation , That is, the amount of computation will increase significantly , Basically, the increase of computation is in the order of square . If it is in a non real-time business scenario , The impact of the increase in the amount of calculation may not be particularly significant . But if it's a live broadcast , An increase in the amount of computation may lead to a longer delay .

This has a great impact when the live broadcast is sensitive to delay , So we need to solve the problem of speed , Generally speaking, a good live broadcast is audited at the second level , The worst requirement is the minute level . The acceleration scheme of graph is to dynamically determine the search scope of candidate words , Back to the business scenario . Content auditing does not require that all statements be identified accurately , The key problem is to accurately identify the illegal words , Then you can use this to optimize .

say concretely , When it is found that there may be illegal words in the previous phoneme , Expand the search scope for decoding the subsequent candidate words . In this way, the low-frequency illegal words will not be missed , At the same time, it can also avoid calculations that have no impact on the final business results , Thus, the overall amount of calculation is greatly reduced , Ensure real-time business .

2.2 Nonverbal recognition

In the live scene , The demand for nonverbal recognition mainly focuses on the voiceprint recognition of important people 、 Sensitive audio detection 、 Language classification and result fusion .

2.2.1 Sensitive audio detection

First, we introduce the sensitive audio detection , Sensitive audio detection is to identify whether a piece of audio contains ASMR And so on . There are two technical difficulties in sensitive sound detection : The first is that sensitive content is short and of variable length , During the live broadcast , Publishers avoid censorship , May mix sensitive sounds with normal speech , As a result, the duration of sensitive sound is generally short , Thus, it has concealment . The second is that the violation concentration of data is low , Low violation concentration means that there must be low false positives to reduce the cost of manual audit . In the case of low false positives , At the same time, it is necessary to maintain a high recall , This has higher requirements for the robustness of the algorithm .

For sensitive audio detection, sensitive content is relatively short , Pictured 5 Shown , Mainly from the algorithm network level .

 Wechat pictures _20220616205448

■ chart 5

Usually, the algorithm performs detection , Will treat a piece of data as a whole . When the violation content is short , Then the sound signals of other normal contents will cover the abnormal and illegal signals ,recall It will lower .

The way to avoid this , Generally, the whole data is cut into smaller pieces , This can really avoid the interference of normal sound , But at the same time, it also loses the original context information of audio , This leads to false positives . According to the plan, through many attempts and investigations , Used Attention Mechanism to solve such problems .

Attention In these years of development , Not just in machinetranslation , In the text 、 Images 、 Voice and other aspects have achieved good results . Simply speaking , Is to give a sequence of data , First, figure out which positions in the sequence are important , Then, more attention will be paid to the data in these important locations .

Corresponding to the scene , When receiving an audio data , adopt Attention The mechanism can keep the complete information , At the same time, it can determine which places are more likely to be sensitive sounds , To allocate more recognition attention , Improve the performance of the algorithm as a whole .

Another challenge is the challenge of requiring low false alarm and high recall under low concentration . Our solution is to use the method of transfer learning pre training to improve the performance , Pictured 6 Shown . Transfer learning has also been widely used in various fields . We are based on other models that have been well trained , Do extra training on the model you want , Finally, we get a better model , It is equivalent to that we have done the follow-up work on the shoulders of giants .

 Wechat pictures _20220616205451

■ chart 6

before , Yitu has made good achievements in voiceprint competitions at home and abroad , Because sensitive audio is actually related to voiceprint , The voiceprint itself is also an algorithm task of the same type , So it is natural for us to consider transferring this advantage to the task of sensitive audio detection .

Pictured 7 Shown , The characteristic of graph based voiceprint model is that it can learn the channel 、 Invariance of environment, etc , Thus, it has the blocking property of algorithm for a variety of channel environments . We choose our own voiceprint model as the initialization model of the sensitive tone detection model , In this way, the sensitive tone detection model inherits the characteristics of the voiceprint model , The algorithm has good robustness in a variety of channel environments .

 Wechat pictures _20220616205454

■ chart 7

2.2.2 Language classification

The task of language classification is to judge the language types contained in the input audio . Generally speaking , In the live broadcast scene , It is dangerous for the platform that the anchor speaks content in a language other than Chinese . For example, the anchor who specializes in English teaching on Tiktok dare not use English all the time , If you use it all the time , For example, it lasts for one or two minutes , You will soon receive a violation reminder from the platform .

If you have the function of language classification , For the platform, this risk will be greatly reduced . The platform can quickly find out the risky live broadcasting rooms . If the audit team of the platform can understand the language of the anchor, it can carefully observe whether there is any illegal content ; If the audit team does not understand . Then the easiest way is to close the live room , The platform can avoid this risk .

There are three main challenges in language classification :

The first is that data with low signal-to-noise ratio is prone to false positives or missing positives . The reason may be environmental noise 、 Reverberation echo 、 Far field radio distortion 、 Channel distortion, etc , If you add the interference of background music or live effects , It also increases the difficulty of language classification .

The second challenge is that the large number of languages makes it difficult to train . There may be thousands of languages in the world , As a result, it is very difficult to collect or label data , It is difficult for us to obtain a large number of high-quality training data .

The third challenge is that the traditional algorithm generally has limitations . If a person can speak many languages , It may not be possible to judge only by voiceprint information ; And in the classification of singing and other scenes , The model is easy to fit to the background music , This leads to poor generalization ; When the language segment is short , It may be difficult to extract more accurate pronunciation features .

These questions are similar to the challenges described earlier , There is no analysis here , Pictured 8 Shown , Enhance... With data , And algorithm network improvement , Pre training and other means can solve . At present, ETO online customers have been using the language classification function , Observe from the actual combat scene , The overall accuracy is good .

 Wechat pictures _20220616205458

■ chart 8

About the voice network cloud market

The voice network cloud market is a real-time interactive one-stop solution launched by the voice network , By integrating the capabilities of technology partners , Provide developers with a one-stop development experience , Solve the selection of real-time interaction module 、 Price match 、 Integrate 、 Account opening and purchase , Help developers quickly add all kinds of RTE function , Quickly bring applications to market , save 95% Integrate RTE Function time .

Real time voice transcribing according to the picture ( chinese ) At present, it has been put on the voice network cloud market . Provide streaming speech recognition capability based on real-time speech transcribing , Support Chinese Mandarin , And compatible with multiple accents . While receiving audio data , While providing the transcribe results , Enables you to access and utilize text messages in real time .

原网站

版权声明
本文为[Acoustic network]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206220816120152.html