当前位置:网站首页>The challenge of image based voice processing in real-time audio and video -- RTC dev Meetup
The challenge of image based voice processing in real-time audio and video -- RTC dev Meetup
2022-06-22 08:19:00 【Acoustic network】
Preface
「 Voice Processing 」 It is a very important scene in the field of real-time interaction , Launched on the sound network 「RTC Dev Meetup Technical practice and application of voice processing in the field of real-time interaction 」 In the activity , From Baidu 、 Technical experts of Huanyu technology and Yitu , Relevant sharing was conducted around this topic .
This paper is based on the diagram AI SaaS Zhouyuanjian, the technical director, shared the contents in the activity . Official account 「 Sound network developer 」, Reply key words 「DM0428」 You can download activity related PPT Information .

Yitu is a company that does AI Infrastructure and AI Solution providers , With the AI The technical capability is relatively extensive , Including pictures 、 video 、 voice 、 Natural language processing, etc , In addition to having AI Beyond the ability of algorithm , It can also provide AI Calculate the force .
After you know the background of Yitu , Let me talk about the challenges related to audio content audit in the live broadcast scenario .
01 The business process of live content audit

■ chart 1
chart 1 It shows the business process of content audit in the live broadcast scenario .
The basic process is : The anchor will broadcast live on the TV first , Then the stream will be pushed to the platform , The platform sends the audit request to the supplier , Approved supplier ( For example, according to the picture ) Get the stream by address , And decode it , Real time analysis to find out the content of violations , And then return the data to the customer in the form of callback . After the customer receives the data , Generally, it is necessary to conduct secondary manual review , If it is confirmed that it is a violation , Then it will be processed in the background , For example, stop the live broadcast or delete the account .
02 Live audio audit algorithm module
Expand the algorithm module inside the system , Pictured 2 It can be divided into three categories , One is basic speech recognition ( ASR ); The second category is text classification , It is mainly used to judge what illegal content is contained in the recognized text . The third category is nonverbal recognition , If the illegal content is not expressed in words , This part can be used to identify .

■ chart 2
2.1 speech recognition ( ASR ) Technical difficulties
First of all, it introduces in ASR Challenges encountered in .
On the whole , There are two main challenges : The first point is the interference of strong background sound , In the voice scene of the Internet , Usually accompanied by background music or game sound , The environment is generally noisy , There will even be many people talking , Compared with ordinary scenes , These features will greatly increase the difficulty of speech recognition .
The second point is the identification of specific proprietary words . Some illegal words don't often appear in life , So in speech recognition , If there is no special optimization , Tend to recognize syllables as more common words , Thus leading to the omission of illegal words .
2.1.1 Strong background sound performance optimization
that , How to deal with such problems ? For strong background interference , We have tried everything , To sum up, the most effective method is to solve the problem from the aspect of data .
There are two main optimizations in data : The first is to create a more sophisticated ambient sound simulator based on business scenarios , Data enhancement through simulator , This method has been proved in other fields , For example, Tesla's autopilot model uses similar technologies to improve performance during training .
Imitate phonation according to the picture 、 Room simulation 、 Sound reception simulation 、 Channel simulation and other dimensions build a simulator . Parameters can be adjusted under each dimension , For example, the number of speakers 、 Speed, intonation or background sound 、 Position and direction of sound source 、 Aphonia effect 、 Reverberation, etc . On the whole , There are probably hundreds of parameters that can be adjusted . The simulator can improve the richness of the original relatively simple training data , Make the training data closer to the specific scene , So as to achieve a good performance improvement effect .
Another way to improve is to train through difficult case mining . In the training process of the normal model, there are both positive and negative case data , In the case of large amounts of data , There will always be some cases where the positive case data is similar to the negative case data , Such data are often referred to as hard cases , It is difficult data . Online difficult case mining is in the process of model training , Add difficult case data to the training repeatedly . Similar to the wrong question book , You can improve your grades by recording the problems you don't know well in the mistake book .
This method is applied to difficult training , The model can learn more details that are not easy to distinguish , And get good performance improvement . Through the above technology , Under the data distribution with strong background sound , The model can also achieve good performance .
2.1.2 Specific proprietary word recognition
Another challenge mentioned earlier is the identification of proper words . Here is an example , Pictured 3 Shown , Here is the translation of a paragraph of audio text , You can see , If you haven't heard of it before “ Knock bubbles ” The word , Then the big probability can't recognize the meaning of this passage . It is possible that “ Blister ” Sounds like “ terrible ”.

■ chart 3
In response to this question , We tried , It is found that two methods are more effective : The first method is during model training , Of a proper word loss Strength to increase weight , in other words , If you make a mistake , A higher penalty will be imposed . Take the example above , Under normal circumstances , If you say a wrong word, it will be deducted 1 branch , If “ Knock bubbles ” Wrong , It is set to buckle 2 branch . Through this mode , The model will work harder to avoid proper word recognition errors .
The second method is to adjust the range of candidate words in the search thesaurus when decoding . Pictured 4 Shown , When the speech recognition algorithm works , The first is to recognize each phoneme through the signal of speech spectrum , Then convert the phonemes into possible text .

■ chart 4
Optimization for proper words , When translating a series of phonemes into text , You can choose more candidate words . For example, in the previous example , If “ Blister ” These two words are not in the list of candidate words . In any case, it is impossible to correctly identify “ Bubble blasting ” The word .
This idea is relatively intuitive , But a new problem will be introduced after the implementation , That is, the amount of computation will increase significantly , Basically, the increase of computation is in the order of square . If it is in a non real-time business scenario , The impact of the increase in the amount of calculation may not be particularly significant . But if it's a live broadcast , An increase in the amount of computation may lead to a longer delay .
This has a great impact when the live broadcast is sensitive to delay , So we need to solve the problem of speed , Generally speaking, a good live broadcast is audited at the second level , The worst requirement is the minute level . The acceleration scheme of graph is to dynamically determine the search scope of candidate words , Back to the business scenario . Content auditing does not require that all statements be identified accurately , The key problem is to accurately identify the illegal words , Then you can use this to optimize .
say concretely , When it is found that there may be illegal words in the previous phoneme , Expand the search scope for decoding the subsequent candidate words . In this way, the low-frequency illegal words will not be missed , At the same time, it can also avoid calculations that have no impact on the final business results , Thus, the overall amount of calculation is greatly reduced , Ensure real-time business .
2.2 Nonverbal recognition
In the live scene , The demand for nonverbal recognition mainly focuses on the voiceprint recognition of important people 、 Sensitive audio detection 、 Language classification and result fusion .
2.2.1 Sensitive audio detection
First, we introduce the sensitive audio detection , Sensitive audio detection is to identify whether a piece of audio contains ASMR And so on . There are two technical difficulties in sensitive sound detection : The first is that sensitive content is short and of variable length , During the live broadcast , Publishers avoid censorship , May mix sensitive sounds with normal speech , As a result, the duration of sensitive sound is generally short , Thus, it has concealment . The second is that the violation concentration of data is low , Low violation concentration means that there must be low false positives to reduce the cost of manual audit . In the case of low false positives , At the same time, it is necessary to maintain a high recall , This has higher requirements for the robustness of the algorithm .
For sensitive audio detection, sensitive content is relatively short , Pictured 5 Shown , Mainly from the algorithm network level .

■ chart 5
Usually, the algorithm performs detection , Will treat a piece of data as a whole . When the violation content is short , Then the sound signals of other normal contents will cover the abnormal and illegal signals ,recall It will lower .
The way to avoid this , Generally, the whole data is cut into smaller pieces , This can really avoid the interference of normal sound , But at the same time, it also loses the original context information of audio , This leads to false positives . According to the plan, through many attempts and investigations , Used Attention Mechanism to solve such problems .
Attention In these years of development , Not just in machinetranslation , In the text 、 Images 、 Voice and other aspects have achieved good results . Simply speaking , Is to give a sequence of data , First, figure out which positions in the sequence are important , Then, more attention will be paid to the data in these important locations .
Corresponding to the scene , When receiving an audio data , adopt Attention The mechanism can keep the complete information , At the same time, it can determine which places are more likely to be sensitive sounds , To allocate more recognition attention , Improve the performance of the algorithm as a whole .
Another challenge is the challenge of requiring low false alarm and high recall under low concentration . Our solution is to use the method of transfer learning pre training to improve the performance , Pictured 6 Shown . Transfer learning has also been widely used in various fields . We are based on other models that have been well trained , Do extra training on the model you want , Finally, we get a better model , It is equivalent to that we have done the follow-up work on the shoulders of giants .

■ chart 6
before , Yitu has made good achievements in voiceprint competitions at home and abroad , Because sensitive audio is actually related to voiceprint , The voiceprint itself is also an algorithm task of the same type , So it is natural for us to consider transferring this advantage to the task of sensitive audio detection .
Pictured 7 Shown , The characteristic of graph based voiceprint model is that it can learn the channel 、 Invariance of environment, etc , Thus, it has the blocking property of algorithm for a variety of channel environments . We choose our own voiceprint model as the initialization model of the sensitive tone detection model , In this way, the sensitive tone detection model inherits the characteristics of the voiceprint model , The algorithm has good robustness in a variety of channel environments .

■ chart 7
2.2.2 Language classification
The task of language classification is to judge the language types contained in the input audio . Generally speaking , In the live broadcast scene , It is dangerous for the platform that the anchor speaks content in a language other than Chinese . For example, the anchor who specializes in English teaching on Tiktok dare not use English all the time , If you use it all the time , For example, it lasts for one or two minutes , You will soon receive a violation reminder from the platform .
If you have the function of language classification , For the platform, this risk will be greatly reduced . The platform can quickly find out the risky live broadcasting rooms . If the audit team of the platform can understand the language of the anchor, it can carefully observe whether there is any illegal content ; If the audit team does not understand . Then the easiest way is to close the live room , The platform can avoid this risk .
There are three main challenges in language classification :
The first is that data with low signal-to-noise ratio is prone to false positives or missing positives . The reason may be environmental noise 、 Reverberation echo 、 Far field radio distortion 、 Channel distortion, etc , If you add the interference of background music or live effects , It also increases the difficulty of language classification .
The second challenge is that the large number of languages makes it difficult to train . There may be thousands of languages in the world , As a result, it is very difficult to collect or label data , It is difficult for us to obtain a large number of high-quality training data .
The third challenge is that the traditional algorithm generally has limitations . If a person can speak many languages , It may not be possible to judge only by voiceprint information ; And in the classification of singing and other scenes , The model is easy to fit to the background music , This leads to poor generalization ; When the language segment is short , It may be difficult to extract more accurate pronunciation features .
These questions are similar to the challenges described earlier , There is no analysis here , Pictured 8 Shown , Enhance... With data , And algorithm network improvement , Pre training and other means can solve . At present, ETO online customers have been using the language classification function , Observe from the actual combat scene , The overall accuracy is good .

■ chart 8
About the voice network cloud market
The voice network cloud market is a real-time interactive one-stop solution launched by the voice network , By integrating the capabilities of technology partners , Provide developers with a one-stop development experience , Solve the selection of real-time interaction module 、 Price match 、 Integrate 、 Account opening and purchase , Help developers quickly add all kinds of RTE function , Quickly bring applications to market , save 95% Integrate RTE Function time .
Real time voice transcribing according to the picture ( chinese ) At present, it has been put on the voice network cloud market . Provide streaming speech recognition capability based on real-time speech transcribing , Support Chinese Mandarin , And compatible with multiple accents . While receiving audio data , While providing the transcribe results , Enables you to access and utilize text messages in real time .
边栏推荐
- C语言实现往MySQL插入和读取图片
- Object to string pit
- 【Oracle 数据库】奶妈式教程 day13 日期函数
- Detailed explanation of the underlying principle of concurrent thread pool and source code analysis
- 找出不是两个数组共有的元素
- The jdbcurl is configured correctly in the project, but the jdbcurl is the wrong path after the project is started
- Chmod Chmod command
- Mt4/mql4 getting started to mastering EA tutorial lesson 5 - common functions of MQL language (V) - common functions of account information
- QT control adds double click event
- Type of sub database and sub table
猜你喜欢

The solution to the problem of the first screen picture loading flicker

MySQL transactions

同态加密的基本概念

【Oracle 数据库】奶妈式教程 day14 转换函数

FastCorrect:语音识别快速纠错模型丨RTC Dev Meetup

Interview shock 59: can there be multiple auto increment columns in a table?

swagger中的枚举、自定义类型和swaggerignore

Example of QT combox
关于菲涅尔现象

Example of multipoint alarm clock
随机推荐
同态加密的基本概念
[Oracle database] mammy tutorial day13 date function
MySQL queries data within one hour
golang中使用swagger遇到的一些问题
Android kotlin Camera2预览功能实现
Mt4/mql4 getting started to mastering EA tutorial lesson 8 - common functions of MQL language (VIII) - common time function
Permission Operation of MySQL
先锋期货安全么?期货开户都是哪些流程?期货手续费怎么降低?
Record once · ulimit: open files: cannot modify limit: operation not allowed
MySQL master-slave replication
Concatenate the specified character at the end of a number in a string
Chmod Chmod command
MySQL query database capacity
Is pioneer futures safe? What are the procedures for opening futures accounts? How to reduce the futures commission?
Note pad replaces all contents after a character in all lines
QT 自定义组合控件(类提升功能)
Calculation days ()
Five skills to be an outstanding cloud architect
Bee framework, an ORM framework that can be learned in ten minutes --bee
MySQL index