当前位置:网站首页>Technical practice and development trend of video conference all in one machine

Technical practice and development trend of video conference all in one machine

2022-06-25 10:59:00 Advanced audio and video development

  author |   Waylon   Spike Hummer audio lab Algorithm expert

Under the normal trend of Mixed Office , The efficiency of remote communication and cooperation is very important . However , At present, there are still many problems affecting communication in teleconference , Such as the lack of meeting room pickup and playback equipment 、 Software and hardware devices are incompatible 、 Unable to hear clearly due to far-field pickup , These questions will kill the patience of the participants , Affect the effect of the meeting , Let the team gradually lose the passion of discussion .

therefore , Whether it is Microsoft abroad 、Zoom, Or domestic nails 、 Tencent Conference , Are building their own hardware terminal ecology , It is expected to solve online problems through hardware 、 The pickup problem in offline mixed office , Like a microphone 、 Audio and video all-in-one machine 、 Conference Board, etc . But even so , One of the most common phenomena in offline meetings , Still can't hear clearly or even . The key to solving this problem , Is to solve the problem of far-field pickup .

actually , Since the last century 80 s , Far field pickup is a pain point in industry and a difficulty in academia , The difficulties mainly come from three aspects of audio problems : reverberation 、 noise 、 Echoes , Which remove “ reverberation ” It has been listed as “ One of the ten unsolved engineering problems in contemporary times ”.

at present , There is no mature in the industry 、 Mass production solutions . Based on this , The nail buzzer Audio Lab has developed a differential microphone array algorithm , And take the lead in F2 A single machine is realized in the video conference all-in-one machine 10 The breakthrough of m far field pickup , And this technical solution , Modular splitting is possible , Share with hardware manufacturers , To improve the ability of their hardware devices to pick up sound or video .

1 What are the technical difficulties to be overcome for far-field pickup ?

The audio and video industry often says “no video, we talk; no audio, we walk”, mean , Audio is more important than video in audio and video conference , But audio has always been a weak point .

In large and medium-sized meeting scenarios , Such as business meetings 、 Reporting meetings, etc , The physical distance of the meeting room will cause the attenuation of sound energy .

To solve this problem , The mainstream products on the market before were mainly split equipment , By deploying multiple microphones to pick up sound at the conference table . The video conference all-in-one machine needs to realize single machine far-field pickup , Overcome long-distance transmission 、 reverberation 、 noise 、 Technical difficulties such as echo , So that participants can better hear and be heard , Express yourself in every meeting 、 Communicate fully .

1、 Long distance transmission

When communicating in a large conference room, I can't hear each other clearly , Only “ Hello, hello. ” Repeated confirmation , Sometimes I have to go to the equipment , Confirm whether the communication is normal .

In fact, in this scenario , Communication links are often normal , The problem is that the pickup quality of the equipment is not high 、 It is caused by the distance between people and equipment . The attenuation of sound energy is proportional to the square of the propagation distance , relative 1 Pickup energy at meters ,4 It will decay to 1/16、10 Attenuation occurs at meters 100 times . The physical attenuation of long-distance sound will cause some components of the target speech to disappear in the spectrum . therefore , Once the distance is far away , The target signal in the original microphone signal will be covered by the noise at a closer distance .

96bf79bd0a7eac58ac2f9c19f1682d01.png

2、 reverberation (reverberation)

We sometimes hear each other's voices when we are in a meeting and feel very muddy , Like from a distant valley , This is the problem caused by reverberation .

Reverberation occurs in a confined space , The sound received by the receiving end is transmitted through multiple channels , The multi-channel transmission caused by the reflection of the wall surface , The reflection is divided into low-order reflection and high-order reflection , Early reverberation and late reverberation are formed respectively . These reverberations have two obvious subjective auditory effects on people (perceptual effect):

  • The box effect (box effect): Feel the sound coming from all directions , Let the listener seem to be in a box (“inside a box”), It sounds cloudy and uncomfortable .

  • Long distance speaker effect (distant talker effect): Feel the sound coming from far away , Even farther than the actual distance .

2 The exploration and application of the nailed Hummer laboratory in the far field

In the process of far-field pickup or far-field voice interaction , In recent years, microphone array technology has played an indispensable role .

The microphone array technology developed by the laboratory is the first practice in the industry to combine the microphone acoustic characteristics with the advantages of the differential beam theory , The white noise gain of differential beam in low frequency band is significantly improved , Thus, the robustness of low frequency speech pickup is obviously improved , bring F2 The speech quality of far-field pickup is significantly improved .

F2 Microphone array technology mainly includes differential beamforming technology (differential beamforming) And multi-channel de reverberation algorithm .

1、 Differential directional microphone array beamforming technology

dd79cc095058a93dd8737c4a91e66d8e.png

Beamforming (beamforming) From radar antenna technology - Sensor array , Pictured above , In the field of Communications , Beamforming can bring more signal coverage to the base station . alike , In recent years, microphone array technology has played an indispensable role , Beamforming based on microphone array forms a spatial filter in space , A pickup beam is ready-made in the direct direction of the target sound , The speech of the submerged target direction is losslessly recovered from other interference signals .

Differential microphone array technology (DMA,differential microphone array) Or differential beamforming (differential beamforming), Because it has more physical characteristics , Especially suitable for speech signal processing , In recent years, it has become a research hotspot in the field of signal processing , It is also widely used in industry .

About the differential microphone array , The pinned hummingbird laboratory is the first time in the industry to integrate and optimize the microphone acoustic characteristics and the differential beam theory , A self-developed differential directional microphone array is proposed (differential directional microphone array), The pain point problem in the technical field is obviously improved : Robustness of speech low frequency pickup , The white noise gain of differential beam in low frequency band is significantly improved 20db.

993575dad60c454800fcf37f9f722ad2.png

The research work of the laboratory was published in the form of a series of papers in INTERSPEECH、ICASSP Wait for the international voice summit , Recognized by peer review ( See the paper at the end of the article list). Independent tests show that , Whether in objective tests - Speech recognition accuracy and subjective test - Sound quality evaluation , Its far-field pickup performance is leading in the industry :

The far-field speech recognition accuracy is higher than that of the industry benchmark competitors 7~9 percentage , The sound quality and definition surpass all the world famous brands that can be found in the market ; Nail audio and video all-in-one machine F2 Is another landing product of the theory .

2、 Multi channel de reverberation technology

At present, most speech de reverberation algorithms can be divided into three categories : Spectral enhancement (spectral enhancement), Indirect inverse filtering (indirect inverse filtering), Direct inverse filtering (direct inverse filtering).

  • Spectral enhancement (spectral enhancement) A real or complex number mask is often used in the reverberation speech spectrum (mask), Treat reverberation as noise and suppress it , But this method has limited performance and brings some distortion , Because reverberation is not an additive noise .

  • Indirect inverse filtering (indirect inverse filtering) The propagation function between the sound source and the receiver is often required , This method can perfectly reverberate , But in practical applications , These propagation functions are not available .

  • Direct inverse filtering (direct inverse filtering) Reverberation prediction often depends on the microphone array signal itself rather than the propagation function , Suitable for practical application . The most widely used direct inverse filter in the industry (direct inverse filtering) The method is based on multi-channel linear prediction (MCLP:multichannel linear prediction).

The laboratory is based on MCLP Algorithms continue to be studied , The research reproduced the latest research results , stay F2 It has solved the problems of MCLP Many practical problems in : The computational complexity of more microphones , Performance degradation with fewer microphones , The accuracy explosion of filter , It has basically formed its own robust multi-channel de reverberation algorithm with low complexity and high performance .

3 How will the video conferencing hardware industry develop ?

What is the essence of video conferencing hardware ? At the same time 、 The collaboration efficiency of multiple people in different spaces is higher . In the beginning, remote interaction only needs email 、 The telephone can satisfy , With the continuous development of Technology , People began to pursue more immersive real-time audio and video interactive experience , The hardware provides a more professional polar microphone 、 HD camera and rich interfaces , The integrated software and hardware solution provides a higher quality guarantee for the conference .

We believe that video conferencing hardware will develop in two directions with the deepening of the industry : First, it is highly integrated 、 Second, intelligence . High integration takes into account the performance 、 Aesthetics and ease of use , This will become an important indicator in the future enterprise products ; And intelligence is the general trend of the software and hardware industry , Technology makes the pickup more accurate , Noise reduction is more intelligent , Let audio and video hardware better serve all kinds of work 、 Life scene .

3114726fa005d766a639bf1fa38726bc.png

nailing F2 It is the first single machine in China 10 Migao's all-in-one video conference machine for audio and video experience , Based on software and hardware algorithms 、AI Breakthroughs in technology and engineering design , It realizes stand-alone 10 M clear polar 、 Smart Guide ( Close up of the speaker )、 Two person split screen layout 、4K HD image quality and other features , Meet the meeting demand of online and offline mixed office , Greatly improve the efficiency and immersion in large and medium-sized meeting scenes .

Before a product goes on the market , Must go through a certain range of applications or tests , nailing F2 No exception . Nail Conference Rooms The product team once took our audio scientists all over the conference rooms of Alibaba group , To record various sizes 、 Test data of conference rooms with different structures , So as to improve the robustness of the product .

Ali has a culture of inviting enterprises to create new products ,F2 In order to further verify the suitability of user needs and scenarios , Often apply to sit directly in the customer meeting room to listen , Observe whether the user's application of the equipment conforms to the initial design idea 、 Have you had any problems 、 Is there any new demand .

In terms of technical capacity enhancement , For challenging scenarios , We may consider adding directional polar in the next step 、 Intelligent sound screen and other functions . for example , When the device is used in a noisy environment , Opening the intelligent voice screen can make the voice of the target speaker in a specific area more clearly picked up , So that participants can communicate more easily in the complex acoustic environment .

In the enterprise ,80% The meeting may be offline 、20% It's an online conference . We have been exploring how to realize the digitalization of offline meetings , For example, role-based meeting minutes , Sound source location is used here 、 Voiceprint recognition and other technologies .

F2 The positioning of is a hardware carrier , It's a container . We will use the audio module 、 Audio and video module 、 Board module, whole machine integration and other cooperation methods , Open products nailed in the field of audio and video to hardware manufacturers 、 Technology and algorithm , Help partners build a combination of software and hardware 、 Online and offline mixed conference experience .

Based on far field pickup 、 Breakthroughs in audio and video technologies such as intelligent noise reduction , The integration of hardware and software products and an open digital platform , Nailing can help users better digitize online and offline meetings , And become the assets of the enterprise .

attach : Related papers on self-developed microphone array published at the international summit by nailing hummingbird laboratory :

1.Weilong Huang,Jinwei Feng, ‘Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones’,Interspeech 2021;

2.Weilong Huang,Jinwei Feng, ‘Differential Beamforming for Uniform Circular Array with Directional Microphones’, Interspeech 2020

3.Cheng Xue, Weilong Huang, Weiguang Chen, Jinwei Feng, ‘Real-time Multi-channel Speech Enhancement Based on Neural Network Masking with Attention Model’, Interspeech 2021;

4.ShiLiang Zhang, Siqi Zheng, Weilong Huang, Ming Lei, Hongbin Suo, Jinwei Feng and Zhijie Yan, ‘Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings’, Interspeech 2021;

5.Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng, Zhijie Yan, ‘A real-time speaker diarization system based on spatial spectrum’, ICASSP 2021;

6.Weiguang Chen (intern), Cheng Xue(intern), Xionghu Zhong“Cramer-Rao Lower Bound for DOA Estimation with an Array of Directional ´ Microphones in Reverberant Environments”; InterSpeech 2021

7.Fan Yu, .., Weilong Huang, etc“M2MET: THE ICASSP 2022 MULTI-CHANNEL MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE”, ICASSP 2022

8.Fan Yu, .., Weilong Huang, etc“ Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge”, ICASSP2022

9.Pengyu Wang,  Feifei Xiong, Zhongfu Ye and Jinwei Feng, “Joint Estimation of Direction-Of-Arrival and Distance for Arrays with Directional Sensors Based on Sparse Bayesian Learning”, Accepted for Publication at Inter-Speech 2022

10.Feifei Xiong, Weiguang Chen, Pengyu Wang, Xiaofei Li and Jinwei Feng, “Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation”, Accepted for Publication at Inter-Speech 2022

原网站

版权声明
本文为[Advanced audio and video development]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206251031514617.html