当前位置：网站首页>[thesis study] vqmivc

[thesis study] vqmivc

2022-06-25 08:11:00 【FallenDarkStar】

《VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion》 Thesis study

List of articles

《VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion》 Thesis study

Abstract

Unwrapping by voice , It can effectively realize the one-time speech conversion between any speaker (one-shot VC). The existing work often ignores the correlation between different speech representations in training , Cause content information to leak into speaker representation , To reduce the VC Performance of . To alleviate this problem , We use vector quantization (VQ) Encode the content , And the interactive information is introduced in the training process (MI) As a measure of relevance , By reducing content in an unsupervised way 、 The interdependence between the speaker and the tonal representation , Achieve proper unwrapping . Experimental results show that , This method is useful in learning effective unwrapping speech representation , Keep the content and intonation of the source language , At the same time, it captures the characteristics of the target speaker . Compared with the most advanced one-shot VC System comparison , The proposed method achieves higher speech naturalness and speaker similarity .

key word ： Vector quantization , Interactive information , Unsupervised unwrapping , disposable , Voice conversion

1 Introduce

Voice conversion (VC) It's a technology , Paralinguistic factors used to modify speech sound like the target speaker from the source speaker . Paralinguistic factors include the identity of the speaker (《An overview of voice conversion systems, Speech Communication》)、 rhythm (《Transformation of speaker characteristics for voice conversion》)、 accent (《Non-native speech conversion with consistencyaware recursive network and generative adversarial network》) etc. . In this paper , We focused on one scenario ( That is, only the voice of a target speaker is given as a reference ) The transformation of speaker identity between arbitrary speakers (《Voice conversion across arbitrary speakers based on a single target-speaker utterance》,《Oneshot voice conversion with global speaker embeddings》).

Previous work used speech based representation to unwrap (SRD) Methods (《Autovc: Zero-shot voice style transfer with only autoencoder loss》,《One-shot voice conversion by separating speaker and content representations with instance normalization》,《Vqvc+: One-shot voice conversion by vector quantization and u-net architecture》) This paper attempts to solve the problem by decomposing speech into speaker and content representation one-shot VC, Then by converting the representation of the source speaker to the representation of the target speaker , Realize the transformation of the speaker's identity . However ,SRD It is difficult to measure the degree of . Besides , The previous methods do not impose relevant constraints between the speaker and the content representation , Cause content information to leak into the speaker's presentation , Lead to VC Performance degradation . To alleviate these problems , In this paper, a method based on vector quantization and mutual information is proposed VC (VQMIVC) Method , Mutual information (MI) Measures the dependencies between different representations , Can be effectively integrated into the training process , In an unsupervised manner SRD. say concretely , We first decompose speech into content 、 The speaker and pitch are three factors , Then propose a VC System , It consists of four parts ：(1) Using contrast prediction coding vector quantization (VQCPC) Content encoder for (《Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge》,《vq-wav2vec: Selfsupervised learning of discrete speech representations》), Extract frame level content representation from acoustic features ;(2) Speaker encoder , It absorbs acoustic features to generate a single fixed dimensional vector as a speaker representation ;(3) Pitch extractor , It is used to calculate the normalized fundamental frequency of speech level (F0) Denote as a pitch ;(4) Put the content 、 The speaker and tone representation are mapped to the decoder of acoustic features . In the process of training , By minimizing VQCPC、 Refactoring and MI Loss to optimize VC System .VQCPC To explore the local structure of speech , and MI It reduces the interdependence of different speech representations . In the process of reasoning , Replacing the source speaker representation with the target speaker representation of a single target language can be achieved once one-shot VC. The main contribution of this work is the adoption of VQCPC and MI The combination of SRD, It does not need any text transcribing or monitoring information such as speaker tags . We did a lot of experiments , In depth analysis MI Importance , enhance SRD It can significantly alleviate the problem of information leakage .

2 Related work

VC The performance of training depends heavily on the availability of the target speaker's voice data (《Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory》,《Voice conversion using partial least squares regression》,《Inca algorithm for training voice conversion systems from nonparallel corpora》,《Phonetic posteriorgrams for many-to-one voice conversion without parallel data training》,《Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks》,《Parallel-data-free voice conversion using cycle-consistent adversarial networks》,《Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks》). therefore ,one-shot VC The challenge is to switch between any speaker that may not be seen in training , And there is only one target speaker's voice for reference . Former one-shot VC Methods are based on SRD Of , Its purpose is to separate the speaker's information from the content as much as possible . Related work includes ： Adjustable information constraint bottleneck (《Autovc: Zero-shot voice style transfer with only autoencoder loss》,《F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder》,《Unsupervised speech decomposition via triple information bottleneck》)、 Example normalization technology (《One-shot voice conversion by separating speaker and content representations with instance normalization》,《Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization》) And vector quantization (VQ)(《Vqvc+: One-shot voice conversion by vector quantization and u-net architecture》,《Unsupervised speech representation learning using wavenet autoencoders》). We use VQ Improved version VQCPC(《Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge》,《vq-wav2vec: Selfsupervised learning of discrete speech representations》) To extract accurate content representation . There are no explicit constraints between different speech representations , Easy to cause information leakage , To reduce the VC Performance of . We start with information theory (《Mutual information analysis》) And get inspiration , take MI As a regularizer for constraining the correlation between variables . For variables whose distribution is unknown ,MI Computing is challenging , So based on SRD In the voice task of (《Learning speaker representations with mutual information》,《Intra-class variation reduction of speaker representation in disentanglement framework》,《Unsupervised style and content separation by minimizing mutual information for speech synthesis》), Various methods have been explored to estimate MI Lower bound (《Estimating divergence functionals and the likelihood ratio by convex risk minimization》,《Noise-contrastive estimation: A new estimation principle for unnormalized statistical models》,《Mutual information neural estimation》). In order to ensure MI Decrease in value , We propose to use the variational comparison to the upper bound of logarithm (vCLUB)(《Club: A contrastive log-ratio upper bound of mutual information》). And a recent study (《Improving zero-shot voice style transfer via disentangled representation learning》) by VC use MI, Using speaker tags as a monitor for learning speaker representation, we propose a method and (《Improving zero-shot voice style transfer via disentangled representation learning》) The difference is the combination of VQCPC and MI Conduct completely unsupervised training , It combines pitch representation to keep the source intonation changing .

3 Proposed method

This section first introduces VQMIVC Method system architecture , Then it elaborates that MI Minimize integration into the training process , Finally, it shows how to implement one-shot VC.

3.1 VQMIVC System architecture

Pictured 1 Shown , Proposed VQMIVC The system consists of four modules ： Content encoder 、 Speaker encoder 、 Pitch extractor and decoder . The first three modules extract content from the input speech respectively 、 The speaker and the pitch indicate ; And the fourth module , decoder , Mapping these representations to ECHOLOGICAL features . Suppose there is $K$ A word , We use the Mel spectrum as an acoustic feature , And select randomly from each voice $T$ Frame for training . The first $k^{th}$ A Mel chart is marked as $X_k=\{ x_{k,1},x_{k,2},...,x_{k,T} \}$ .
chart 1
Content encoder $\theta_c$ ：
chart 2
Content coders strive to use VQCPC from $X_k$ Extract language content information from , Pictured 2 Shown , There are two networks h-net： $X_k \to Z_k$ and g-net： $\hat{Z}_k \to R_k$ and VQ operation $q$ : $Z_k \to \hat{Z}_k$ .h-net utilize $X_k$ Derive dense feature sequences $Z_k=\{ z_{k,1},z_{k,2},...,z_{k,T/2} \}$ , And the length is from $T$ Reduced to $T / 2$ . Then the quantizer $q$ Use a trainable codebook $B$ take $Z_k$ Discrete as $\hat{Z}_k=\{ \hat{z}_{k,1},\hat{z}_{k,2},...,\hat{z}_{k,T/2} \}$ , And $\hat{z}_{k,t} \in B$ Is the closest to $z_{k,t}$ Vector .VQ stay $Z_k$ Introducing information bottlenecks in , Remove unnecessary details , send $\hat{Z}_k$ Associated with the underlying language information . Then by minimizing VQ Loss (《Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge》) To train the content coder $\theta_c$ .
$L_{VQ}=\frac {2} {KT} \sum^K_{k=1} \sum^{T/2}_{t=1} || z_{k,t} - sg(\hat{z}_{k,t}) ||^2_2 \tag{1}$ In style , $s g (\cdot)$ To stop the gradient operator . To further encourage $\hat{Z}_k$ Capture local structures , stay $\hat{Z}_k$ Based on RNN Of $g - n e t$ Get aggregation $R_k=\{ r_{k,1},r_{k,2},...,r_{k,T/2} \}$ , Using contrast prediction coding (CPC). Given $r_{k,t}$ , The training model is minimized InfoNCE Loss (《Representation learning with contrastive predictive coding》) To distinguish between $m$ Positive sample after step $\hat{z}_{k,t+m}$ And from the set $\Omega_{}$ Negative samples extracted from ：
$L_{CPC}=- \frac {1} {KT'M} \sum^K_{k=1} \sum_{t=1}^{T'} \sum^M_{m=1}log[ \frac {exp(\hat{z}^T_{k,t+m}W_m r_{k,t})} {\sum_{z \in \Omega_{k,t,m}} exp(z^T W_m r_{k,t})} ] \tag{2}$ among $T' = T/2-M, W_m(1,2,...,M)$ Is a trainable projection matrix . Compare losses with probability through prediction (2) Future samples of , Local features that span multiple time steps ( Like phonemes ) Encoded into $\hat{Z}_k=f(X_k:\theta_c)$ in , $\hat{Z}_k=f(X_k:\theta_c)$ Is a content representation for accurately reconstructing language content . In the process of training , Randomly select samples from the current discourse , Form a negative sample set $\Omega_{k,t,m}$ .

Speaker encoder $\theta_s$ ： The speaker encoder receives $X_k$ , Generate vectors $s_k=f(X_k;\theta_s)$ , Used by the speaker to express . $s_k$ Controlling the speaker identity of the generated speech by capturing global speech features .

Pitch extractor ： Pitch representation is expected to include intonation changes , But it does not contain content or speaker information , So we extract from the waveform $F_0$ , And each voice is tested independently $z$ normalization . In our experiment , We use logarithmic normalization $F_0 (log-F_0)$ As $P_k=(p_{k,1},p_{k,2},...,p_{k,T})$ , It has nothing to do with the speaker , Therefore, the speaker encoder is forced to provide the information of the speaker , Such as range, etc .

decoder $\theta_d$ ： The decoder is used to put the content 、 The speaker and tone representation are mapped to the Mel spectrum . Respectively for $\hat{Z}_k$ and $s_k$ Perform upsampling ( $\times 2$ ) And repeat ( $\times T$ ) Linear interpolation , With $p_k$ As input to the decoder , Generate Mel spectrum $\hat{X}_k=\{ \hat{x}_{k,1},\hat{x}_{k,2},...,\hat{x}_{k,T} \}$ . The decoder is jointly trained with the content encoder and the speaker encoder , Minimize reconstruction losses ：
$L_{REC}=\frac {1} {KT} \sum^K_{k=1} \sum^T_{t=1}[||\hat{x}_t - x_t||_1 + ||\hat{x}_t - x_t||_2] \tag{3}$

3.2 MI Minimize integration into VQMIVC In training

Given a random variable $u$ and $v$ ,MI For its joint distribution and marginal distribution Kullback-Leibler (KL) The divergence $I(u,v)=D_{KL}(P(u,v);P(u)P(v))$ , We use vCLUB(《Club: A contrastive log-ratio upper bound of mutual information》) Calculation MI The upper bound of is ：
$I(u,v)=E_{P(u,v)}[logQ_{\theta_{u,v}}(u|v)]-E_{P(u)}E_{P(v)}[logQ_{\theta_{u,v}}(u|v)] \tag{4}$ among $\in \{ \hat{Z}, s, p \}$ , $\hat{Z}$ , $s$ and $p$ They are content , Speaker and pitch characteristics , $Q_{\theta_{u,v}}(u|v)$ It is true that the variational approximation can be parameterized $u, v$ And the Internet $\theta_{u,v}$ . given vCLUB Unbiased estimation between different speech representations ：
$\hat{I}(\hat{Z},s)=\frac {2} {K^2T} \sum^K_{k=1} \sum^K_{l=1} \sum^{T/2}_{t=1} [logQ_{\theta_{\hat{z},s}}(\hat{z}_{k,t}|s_k)-logQ_{\theta_{\hat{z},s}}(\hat{z}_{l,t}|s_k)] \tag{5}$ $\hat{I}(p,s)=\frac {1} {K^2T} \sum^K_{k=1} \sum^K_{l=1} \sum^{T}_{t=1} [logQ_{\theta_{p,s}}(p_{k,t}|s_k)-logQ_{\theta_{\hat{z},s}}(p_{l,t}|s_k)] \tag{6}$ $\hat{I}(\hat{Z},p)=\frac {2} {K^2T} \sum^K_{k=1} \sum^K_{l=1} \sum^{T/2}_{t=1} [logQ_{\theta_{\hat{z},p}}(\hat{z}_{k,t}|\hat{p}_{k,t})-logQ_{\theta_{\hat{z},p}}(\hat{z}_{l,t}|\hat{p}_{k,t})] \tag{7}$ among $\hat{p}_{k,t} = (p_{k,2t-1}+p_{k,2t})/2$ . The formula 4 By good variational approximation , Provides a reliable MI upper bound . therefore , We can minimize (5)-(7) To reduce the correlation between different speech representations , total MI The loss is ：
$L_{MI} = \hat{I}(\hat{Z},s)+ \hat{I}(\hat{Z},p)+ \hat{I}(p,s) \tag{8}$ In the process of training , For variational approximation networks and VC The network is optimized alternately . Variational approximation network training maximum log likelihood ：
$L_{u,v}=logQ_{\theta_{u,v}}(u|v), \ u,v \in \{ \hat{Z},s,p \} \tag{9}$ And yes VC Network training , send VC minimization of loss ：
$L_{VC} = L_{VQ}+L_{CPC}+L_{REC}+\lambda_{MI}L_{MI} \tag{10}$ among $\lambda_{MI}$ Is a constant weight , To control MI How loss improves de entanglement . The final training process is summarized in the algorithm 1 in . We noticed that , No text transcription or speaker tags were used in the training process , Therefore, the proposed method is completely unsupervised .

3.3 One-shot VC

In the process of transformation , First, from the source speaker's $X_{src}$ Extract... From speech respectively $\hat{Z}_{src}=f(X_{src};\theta_c)$ and $p_{src}$ Content and pitch representation of , Only from one target speaker's words $X_{tgt}$ Extract the speaker representation as $s_{tgt}=f(X_{tgt};\theta_s)$ , Then the decoder generates the converted Mel spectrum as $f(\hat{Z}_{src},s_{tgt},p_{src};\theta_d)$ .

4 experiment

4.1 Experimental setup

All experiments were conducted in VCTK corpus (《Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit》) on ,110 English speakers were randomly divided into 90 and 20 people , As training set and test set respectively . The test speaker is regarded as a speaker who has never appeared , Used to perform One-shot VC. In acoustic feature extraction , Down sample all recorded data to 16kHz, utilize 25ms Hanning window 、10ms Frame shift sum 400 Dimensional fast Fourier transform calculation 80 Vermeer spectrum and F0.

The VC The network consists of a content encoder 、 The speaker encoder and decoder are composed of . The content encoder contains h-net、 quantizer q and g-net.h-net From one step to 2 The convolution of layer 、4 A block with layer normalization 、512 Dimensional linear layers and the ReLU The activation function consists of . The quantizer contains a containing 512 individual 64 Codebook of the vectorial learnable vectors .g-net It's a 256 One way of dimension RNN layer . about CPC, Future prediction step size M by 6, Negative samples $|\Omega_{k,t,m}|$ by 10. The speaker encoder uses (《One-shot voice conversion by separating speaker and content representations with instance normalization》), It includes 8 individual ConvBank Layer encodes long-term information ,12 A convolution layer contains 1 Average pooling layers ,4 A linear layer is derived 256 The speaker said . stay 《Autovc: Zero-shot voice style transfer with only autoencoder loss》 after , The decoder has a 1024 dimension LSTM layer , Three convolutions , Two 1024 dimension LSTM Layer and a 80 Dimensional linear layer . Besides , Utilization based on 5 Layer convolution Postnet Refine the predicted Mel spectrum , utilize VCTK Corpus trained Parallel WaveGAN Vocoder (《Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram》) Convert it to a waveform . Use Adam Optimizer (《Adam: A method for stochastic optimization》) Yes VC Network training , adopt 15 The warming up of a cycle will increase the learning rate from 1e-6 Up to 1e-3, stay 200 Every... After cycles 100 Half a cycle , Until the total 500 A cycle . Batch size is 256, Each iteration selects randomly from each voice 128 Frame for training . The learning rate of the variational approximation network is also 3e-4 Of Adam The optimizer trains . We will put forward our VQMIVC Methods and AutoVC(《Autovc: Zero-shot voice style transfer with only autoencoder loss》)、AdaIN-VC(《One-shot voice conversion by separating speaker and content representations with instance normalization》) and VQVC+(《Vqvc+: One-shot voice conversion by vector quantization and u-net architecture》) Wait for the most advanced One-shot VC Methods are compared .

4.2 Experimental results and Analysis

4.2.1 Speech representation unwrapping performance

stay VC Loss (10) in , $\lambda_{MI}$ To determine the MI Realization SRD The ability of , We change first $\lambda_{MI}$ , By calculation vCLUB To evaluate the degree of de entanglement between different speech representations extracted from all test utterances , As shown in the table 1 Shown . We can see , When $\lambda_{MI}$ increases ,MI Tend to lower , To reduce the correlation between different speech representations .
surface 1
In order to measure the entanglement between content information and speaker representation , We use two methods to generate speech , namely ：(1) identical , That is, using the content of the same discourse to represent 、 Speaker representation and pitch representation to generate speech ;(2) blend , That is, using the content representation and pitch representation of one discourse and the speaker representation of another discourse to produce discourse , Two words belong to the same speaker . then , Using automatic speech recognition (ASR) The system obtains the characters that generate speech / Word error rate (CER/WER). from Same To Mixed To increase the CER and WER Write them down as $\Delta_C$ and $\Delta_W$ . The only difference in the input of speech generation is the input of speaker representation , We can come to a conclusion , $\Delta_C$ and $\Delta_W$ The greater the value of , Explain that the more content information leaked to the speaker's representation . All test speakers are used for speech generation , The public release based on jasper Of ASR System (《Jasper: An end-to-end convolutional neural acoustic model》). Results such as table 2 Shown , We can see , When not used MI ( $\lambda_{MI}=0$ ) when , The generated voice is seriously polluted by unwanted content information , This content information exists in the speaker representation , Such as $\Delta_C$ and $\Delta_W$ The maximum value of . However , When using MI ( $\lambda_{MI}>0$ ) when , You can get $\Delta_C$ and $\Delta_W$ There's a significant decrease in . With $\lambda_{MI}$ An increase in , $\Delta_C$ and $\Delta_W$ Are decreasing , Explain the higher $\lambda_{MI}$ It can mitigate the leakage of content information to the speaker to a greater extent .
surface 2
Besides , We designed two speaker classifiers , Respectively by $\hat{Z}$ and $s$ For input ; There is also a predictor , use $\hat{Z}$ To infer $p$ . Both classifier and predictor are 4 Layer fully connected network , The hidden size is 256 dimension . The higher the accuracy of speaker classification, the more $\hat{Z}$ or $s$ The more information the speaker has , $p$ The forecast loss of ( Mean square error ) Higher indicates $\hat{Z}$ The lower the pitch of the midrange . Results such as table 3 Shown . We can observe that , When $\lambda_{MI}$ When it increases , $\hat{Z}$ Contains less speaker and pitch information , To achieve lower accuracy and higher pitch loss . For all $\lambda_{MI}$ , $s$ The accuracy of speaker classification is very high , And with the $\lambda_{MI}$ An increase in , $s$ The accuracy of classification on , explain $s$ Contains a wealth of speaker information , But higher $\lambda_{MI}$ Can make $s$ Loss of speaker information . To ensure correct unwrapping , In the next experiment , We will $\lambda_{MI}$ Set to $1 e - 2$ .
surface 3

4.2.2 Content preservation and F0 Change consistency

In order to evaluate whether the converted speech maintains the language content and intonation changes of the source speech , We tested the performance of the converted speech CER/WER, The source speech and the converted speech are calculated F0 Pearson correlation coefficient between (PCC)(《Pearson correlation coefficient》).PCC The value range of is $- 11$ , It can effectively measure the correlation between two variables , among F0-PCC The higher the , Represents the... Of the converted speech and the source speech F0 The higher the consistency of variation . Random sampling 10 Name the test speaker as the source speaker , rest 10 Test speaker as target speaker , formation 100 To convert to , All source utterances are used for conversion . The results of different methods are shown in table 4 Shown , The result of the source voice is also reported as the upper performance limit . It can be seen that , Of all the methods ,VQMIVC Of CER and WER The values are all the lowest , This shows that the proposed VQMIVC The method is robust in preserving the content of the source language . Besides , We observed that , Without using MI (w/o MI) Under the circumstances ,ASR Significant performance degradation , Because the converted voice is entangled with the speaker's expression by the unwanted content information . Besides , Indicate by providing the source pitch , We can clearly and effectively control the intonation change of the converted speech , To achieve a higher F0 Change consistency , The maximum obtained by this method F0-PCC by $0.781$ .
surface 4

4.2.3 The naturalness of speaking and the similarity of speakers

from 15 Subjects were tested subjectively , With 5 Average opinion score of (MOS), namely 1- range 、2- Bad 、3- commonly 、4- good 、5- It is good to evaluate the naturalness of speech and the similarity of speakers . We randomly select two source speakers and two target speakers from the test speakers , Each source or target language set includes a male and a female , formation 4 To change words , Each pair of converted 18 Words were evaluated by each participant . In the figure 3 The average scores of all pairs are reported in . Source language (Oracle) And the target language (Oracle) They are parallel WaveGAN Synthesize the true spectrum of the source language and the target language . We observed that , The proposed method ( Write it down as w/o MI) be better than AutoVC and VQVC+, But not as good as AdaIN-VC. Pass our official listening test ,w/o MI Pronunciation errors are often detected in the converted speech , This can be seen from the table 4 in w/o MI Higher CER/WER Reflect . These questions can be raised through MI Minimization greatly relieves , It improves the naturalness of speech and the similarity of speakers . This shows that MI Minimization is beneficial to reasonable SRD Get accurate content representation and effective speaker representation , So as to generate natural speech with high similarity with the target speaker .
chart 3

5 Conclusion

We propose a combination VQCPC and MI Unsupervised based on SRD Of One-Shot VC Method . Implementation content 、 Proper separation of speaker and pitch expression , This method not only trains VC Model to minimize reconstruction losses , And use VQCPC Loss to explore the local structure of speech to obtain content , as well as MI Loss , To reduce the correlation between different speech representations . Experiments verify the effectiveness of the proposed method , Save source language content by learning accurate content representation , Speaker representation to capture the desired speaker characteristics , And tonal representation to preserve the source tone change , So as to alleviate the problem of information leakage . So as to produce high-quality converted speech .

Wang D, Deng L, Yeung Y T, et al. VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion[J]. arXiv preprint arXiv:2106.10132, 2021.

原网站

版权声明
本文为[FallenDarkStar]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/176/202206250602315259.html