当前位置：网站首页>Speech enhancement - spectrum mapping

Speech enhancement - spectrum mapping

2022-06-28 06:24:00 【Salute=】

Catalog

One 、 introduction
Two 、 A speech enhancement method based on mapping
- 2.1 Spectrum mapping system model
3、 ... and 、 experimental analysis
Four 、 reference

One 、 introduction

The main goal of speech enhancement is to extract pure speech signals from noisy speech signals , In automatic speech recognition 、 The hearing aid has A wide range of applications . Deep speech enhancement methods can be divided into two categories ：1) A speech enhancement method based on mapping ; 2) Speech enhancement method based on mask .

Two 、 A speech enhancement method based on mapping

The speech enhancement method based on mapping is divided into different domains ( Time domain / frequency domain ) Handle , It can be divided into two categories ：
1) Speech enhancement method based on spectrum mapping ： The mapping relationship between noisy speech signal spectrum and clean speech signal spectrum is learned through neural network .
2) End to end speech enhancement methods ： The mapping relationship between the time domain waveform of noisy speech signal and the time domain waveform of clean speech signal is learned through neural network .

2.1 Spectrum mapping system model

The spectrum mapping system model is shown in the figure below ,

Speech feature extraction and Time domain reconstruction The specific process is as follows ,

Training phase ：
1) Input ： The input feature used in this experiment is noisy speech signal Logarithmic amplitude spectrum . It is worth noting that , With reference to the literature [1] Frame expansion technology is adopted , Such as the input 5 Frame log amplitude spectrum data when , The network output is predicted The first 3 Frame log amplitude spectrum data , As shown in the figure below .

2) label ： Is the logarithmic amplitude spectrum of a clean speech signal , For example, when entering 5 Frame log amplitude spectrum data , The output is the predicted 3 Frame log amplitude spectrum data .
3) Loss function ：MSE Loss function , $L_{\text {Loss }}=\|\hat{\mathbf{L}}-\mathbf{L}\|_{2}^{2}$
remarks ： Normalizing the input logarithmic amplitude spectrum can accelerate the convergence of the network , And In this paper, the experimental method is BN Layer normalizes the input features .

3、 ... and 、 experimental analysis

3.1 Experimental data set and parameter setting

Clean voice signals for training ：TIMIT-TRAIN in DR1 All clean voice signals ; Clean voice signals used for testing ：TIMIT-TEST in DR1 front 10 A clean voice signal ; SNR of synthetic noisy speech signal (dB)：[-5, 0, 5, 10]; The noise source used to synthesize noisy speech signals ：NoiseX-92 Medium 3 Kind of noise [‘babble’, ‘destroyerengine’, ‘factory1’] .
Parameter setting ： Short time Fourier transform length ：N_fft = 512, Window length ：win_length=512, Window movement ：hop_length=128 , Window function ：‘hamming’; Training related parameters epoch=30, lr=1e-4, batch_size=16.

3.1 experimental result

3.1.1 Framing parameters (n_expand=3)

Frame expansion parameters $n\_expand=3$ , That is, the number of frames input to the network is $2*n\_expand+1=7$ , $n\_expand=3$ At the time of the PESQ Scoring and STOI The values are as follows .

3.1.2 Different framing parameters (n_expand=1, 3, 5, 7)

The influence of frame expansion parameters on the performance of spectrum mapped speech enhancement is discussed ：
(1) n_expand=1, 3, 5, 7 when , each snr Under the PESQ Values and STOI value , As shown in the figure below .