当前位置:网站首页>Dialog+: Audio dialogue enhancement technology based on deep learning

Dialog+: Audio dialogue enhancement technology based on deep learning

2022-06-25 22:13:00 User 1324186

source :IBC2021 Speaker :Matteo Torcoli Content arrangement : Chen Ziyu The researchers found that , Nowadays, the audience is often troubled by not being able to hear the dialogue between the characters in the audio , To provide the audience with personalized sound balance scheme , This paper mainly presents a sound balance scheme using deep learning to improve the relative level of character dialogue and ambient sound in audio Dialog+, The effectiveness of the scheme is verified by online research and field broadcast test .

Catalog

  • Summary of problems
  • Dialog+
  • WDR Online survey
  • Field broadcast test
  • summary

Summary of problems

The main problems to be solved in this work are , How to balance the audio level of dialogue voice and that of other background components in the process of audio broadcasting . This balance is very personal , Because of the personal preferences of different individuals 、 Radio environment 、 Differences in hearing ability and many other aspects will have an important impact on the balance of optimal solutions , There is no balanced solution that can meet the needs of all people at the same time . The traditional broadcasting mechanism is WDR During the test, we often receive negative feedback about the difficulty of listening to the conversation voice .

Next generation audio (Next Generation Audio,NGA) MPEG-H Audio It provides a very good solution to the above problems , The voice broadcast scheme provides a very good personalized selection scheme for terminal equipment , End users can independently choose the balance mode of dialogue voice and environment voice in voice broadcasting in different environments . Next generation audio has very good application potential , It has been adopted by mainstream broadcast and streaming media application standards , for example DVB,ATSC,TTA,SBTVD.

In order to get a personalized speech component balance scheme , The core problem to be solved is how to separate the dialogue component from the background component in a speech , Thus in the occurrence of voice transmission 、 spread 、 Reception and other stages , Flexibly handle the balance between dialogue components and background sound components . The broader problem that corresponds to the problem of separating dialogue speech alone is how to separate various components in a piece of audio , Get the attribute data of each component . In order to be in an audio clip packed with multiple components , Accurately decompose the various components and corresponding attributes required by the next generation of audio , So as to provide better audio balance scheme for mobile end users , The author puts forward Dialog+.

Dialog+

Dialog+ Take advantage of the latest advances in deep learning methods , Considering the robustness of the algorithm, in order to get better algorithm performance , The training data used is the broadcast content of the real world , Mostly from WDR and BR. The workers have carefully processed the data , Select the most helpful training data for algorithm training and optimization .

The following figure shows Dialog+ Process framework for , The first step in this process is to separate the unknown sound source . The audio data in frequency domain can be obtained by short-time Fourier transform of the input stereo mixed audio file , Then, the deep convolution network is used to predict the separated dialogue sound and ambient sound from the audio data in the frequency domain . The author thinks that the structure of the deep convolution network is more sensitive to separating data with different characteristics from the original data , The author proves that compared with other more complex network structures , Using deep convolution neural network can get better performance .

Dialog+ Process framework

Dialog+ It includes two parts: automatic separation of dialogue sound and ambient sound and automatic mixing of separated audio , It can highlight the content of the dialogue and reduce the unnatural sense of hearing caused by the reduction of environmental sound components . After predicting the separated dialogue sound components and background sound components , Adjust the frequency response of the two components with an equalizer , You can get a new mix audio that is different from the original input audio , There are two ways to mix : Global mix and time-varying mix . Global mixing reduces the relative volume of background noise , Keep the conversation volume and ambient volume at a relative level ; Time varying mixing will automatically adjust the relative value of the ambient volume and the conversation volume over time according to the actual situation of the environment , Flexible way to change the volume balance . The benefits of time variant mixing are , When there is no dialogue sound in the audio , Do not reduce the proportion of ambient sound , Do not destroy the atmosphere created by ambient sound in audio , And when the dialogue tone in the audio is detected , Smoothly reduce the proportion of ambient sound to highlight the content of the dialogue . The two mixing methods can also be combined to achieve better balance effect .

Last , The remixed audio file and the attribute data corresponding to the audio are automatically generated , These audio can be directly applied to next generation audio , Or it can be applied to traditional channel based broadcast channels after rendering , These audio files highlight the dialogue sound of the original audio .

WDR Online survey

WDR It is to provide the audience with daily accessible broadcasting services , And get their feedback and suggestions . Use WDR The goal of online testing and investigation is to better understand and deal with the concerns of the broadcasting mechanism , Evaluate from two aspects of user acceptance and satisfaction Dialog+. The researchers provided the subjects with three segments of speech , Each voice segment is divided into the original version without processing and the original version with Dialog+ Processed version . In order to get more objective test results , Three voice segments cover different scenes , During the test, we often get the negative feedback that it is difficult to hear the dialogue of the characters , The subjects watched all the videos , They will be asked about their true feelings and opinions when watching the video , And record in the online questionnaire .

The subjects exceeded 2000 people , about 80% Of the subjects were aged 41~80 year , The following figure shows how often the subjects have difficulty listening to the dialogue when watching the video , Of all subjects , about 68% Percent of the subjects thought they had this problem often or very often , Older than 60 About... Of the subjects aged 90% Think you often or very often have this problem . The researchers found that , As the subjects age , It is more and more easy to hear the dialogue between the characters in the video , This shows that a single audio can not meet the reception feelings of audiences of all ages , A sound channel that caters to younger audiences may have radio barriers in older audiences , The audio that can meet the needs of the older audience may be boring among the younger audience due to the fact that too much emphasis on the dialogue between the characters destroys the atmosphere created by the ambient sound .

How often did the subjects have trouble listening to the dialogue while watching the video

The main question the subjects were asked was whether they were more willing to switch the tone balance to Dialog+ Pattern , The survey results show that most viewers are willing to switch the channel to Dialog+ Pattern , Even listeners who never or rarely fail to keep up with their characters tend to switch the sound balance to Dialog+ Pattern , The second question the subjects were asked was which type of sound balance mode they preferred , about 46% Of the subjects prefer Dialog+ Sound balance mode , Older listeners tend to use Dialog+.

The subject switches to Dialog+ Tendencies

Field broadcast test

be based on WDR The results of the online test , The researchers conducted field broadcast tests in two ways .

  • be based on DVB And streaming channels WDR Field testing : The test was conducted on 2020 year 12 In June, it was launched for two days on a German TV channel , Viewers can choose from video options Dialog+ Sound balance mode .
  • be based on HbbTV2 Of BR Field testing :HbbTV2 Can be based on DVB Broadcast regular video and voice , At the same time, additional sound versions can be added to the network , The researchers added two additional Dialog+ Version source , A prominent version of a conversation , An enhanced version for dialogue highlighting , It provides more choices for the light, so that the audience can choose the prominence of the dialogue according to their own preferences .

summary

Nowadays, the audience is often troubled by not being able to hear the dialogue between the characters in the audio , The researchers interviewed more than 2000 Famous spectator , Find that as you age , The more troubled the problem is . However, the existing broadcasting mechanism is difficult to provide a highly personalized voice balance scheme to meet the needs of audience of different ages to hear the dialogue of characters in audio clearly , Based on this question , The researcher of this work proposed Dialog+, This is a method of deep learning , From the original sound clip, the environment sound and the character dialogue sound are separated first , Then combine the enhanced character dialogue with the ambient sound , A sound balance scheme to highlight the dialogue sound of the characters , Through online research and actual broadcast test , about 83% Of listeners prefer to switch to Dialog+ Pattern , The effectiveness of the scheme is proved .

Finally, the video of the speech is attached :

http://mpvideo.qpic.cn/0bc3eeaaaaaauiacbtpnuzqvaiodaaqqaaaa.f10002.mp4?dis_k=299af3f9e691bca560aafddf872d6f5f&dis_t=1645151068&vid=wxv_2237041039578710020&format_id=10002&support_redirect=0&mmversion=false

原网站

版权声明
本文为[User 1324186]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202181125168415.html