当前位置:网站首页>[thesis study] vqmivc
[thesis study] vqmivc
2022-06-25 08:11:00 【FallenDarkStar】
《VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion》 Thesis study
List of articles
Abstract
Unwrapping by voice , It can effectively realize the one-time speech conversion between any speaker (one-shot VC). The existing work often ignores the correlation between different speech representations in training , Cause content information to leak into speaker representation , To reduce the VC Performance of . To alleviate this problem , We use vector quantization (VQ) Encode the content , And the interactive information is introduced in the training process (MI) As a measure of relevance , By reducing content in an unsupervised way 、 The interdependence between the speaker and the tonal representation , Achieve proper unwrapping . Experimental results show that , This method is useful in learning effective unwrapping speech representation , Keep the content and intonation of the source language , At the same time, it captures the characteristics of the target speaker . Compared with the most advanced one-shot VC System comparison , The proposed method achieves higher speech naturalness and speaker similarity .
key word : Vector quantization , Interactive information , Unsupervised unwrapping , disposable , Voice conversion
1 Introduce
Voice conversion (VC) It's a technology , Paralinguistic factors used to modify speech sound like the target speaker from the source speaker . Paralinguistic factors include the identity of the speaker (《An overview of voice conversion systems, Speech Communication》)、 rhythm (《Transformation of speaker characteristics for voice conversion》)、 accent (《Non-native speech conversion with consistencyaware recursive network and generative adversarial network》) etc. . In this paper , We focused on one scenario ( That is, only the voice of a target speaker is given as a reference ) The transformation of speaker identity between arbitrary speakers (《Voice conversion across arbitrary speakers based on a single target-speaker utterance》,《Oneshot voice conversion with global speaker embeddings》).
Previous work used speech based representation to unwrap (SRD) Methods (《Autovc: Zero-shot voice style transfer with only autoencoder loss》,《One-shot voice conversion by separating speaker and content representations with instance normalization》,《Vqvc+: One-shot voice conversion by vector quantization and u-net architecture》) This paper attempts to solve the problem by decomposing speech into speaker and content representation one-shot VC, Then by converting the representation of the source speaker to the representation of the target speaker , Realize the transformation of the speaker's identity . However ,SRD It is difficult to measure the degree of . Besides , The previous methods do not impose relevant constraints between the speaker and the content representation , Cause content information to leak into the speaker's presentation , Lead to VC Performance degradation . To alleviate these problems , In this paper, a method based on vector quantization and mutual information is proposed VC (VQMIVC) Method , Mutual information (MI) Measures the dependencies between different representations , Can be effectively integrated into the training process , In an unsupervised manner SRD. say concretely , We first decompose speech into content 、 The speaker and pitch are three factors , Then propose a VC System , It consists of four parts :(1) Using contrast prediction coding vector quantization (VQCPC) Content encoder for (《Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge》,《vq-wav2vec: Selfsupervised learning of discrete speech representations》), Extract frame level content representation from acoustic features ;(2) Speaker encoder , It absorbs acoustic features to generate a single fixed dimensional vector as a speaker representation ;(3) Pitch extractor , It is used to calculate the normalized fundamental frequency of speech level (F0) Denote as a pitch ;(4) Put the content 、 The speaker and tone representation are mapped to the decoder of acoustic features . In the process of training , By minimizing VQCPC、 Refactoring and MI Loss to optimize VC System .VQCPC To explore the local structure of speech , and MI It reduces the interdependence of different speech representations . In the process of reasoning , Replacing the source speaker representation with the target speaker representation of a single target language can be achieved once one-shot VC. The main contribution of this work is the adoption of VQCPC and MI The combination of SRD, It does not need any text transcribing or monitoring information such as speaker tags . We did a lot of experiments , In depth analysis MI Importance , enhance SRD It can significantly alleviate the problem of information leakage .
2 Related work
VC The performance of training depends heavily on the availability of the target speaker's voice data (《Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory》,《Voice conversion using partial least squares regression》,《Inca algorithm for training voice conversion systems from nonparallel corpora》,《Phonetic posteriorgrams for many-to-one voice conversion without parallel data training》,《Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks》,《Parallel-data-free voice conversion using cycle-consistent adversarial networks》,《Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks》). therefore ,one-shot VC The challenge is to switch between any speaker that may not be seen in training , And there is only one target speaker's voice for reference . Former one-shot VC Methods are based on SRD Of , Its purpose is to separate the speaker's information from the content as much as possible . Related work includes : Adjustable information constraint bottleneck (《Autovc: Zero-shot voice style transfer with only autoencoder loss》,《F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder》,《Unsupervised speech decomposition via triple information bottleneck》)、 Example normalization technology (《One-shot voice conversion by separating speaker and content representations with instance normalization》,《Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization》) And vector quantization (VQ)(《Vqvc+: One-shot voice conversion by vector quantization and u-net architecture》,《Unsupervised speech representation learning using wavenet autoencoders》). We use VQ Improved version VQCPC(《Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge》,《vq-wav2vec: Selfsupervised learning of discrete speech representations》) To extract accurate content representation . There are no explicit constraints between different speech representations , Easy to cause information leakage , To reduce the VC Performance of . We start with information theory (《Mutual information analysis》) And get inspiration , take MI As a regularizer for constraining the correlation between variables . For variables whose distribution is unknown ,MI Computing is challenging , So based on SRD In the voice task of (《Learning speaker representations with mutual information》,《Intra-class variation reduction of speaker representation in disentanglement framework》,《Unsupervised style and content separation by minimizing mutual information for speech synthesis》), Various methods have been explored to estimate MI Lower bound (《Estimating divergence functionals and the likelihood ratio by convex risk minimization》,《Noise-contrastive estimation: A new estimation principle for unnormalized statistical models》,《Mutual information neural estimation》). In order to ensure MI Decrease in value , We propose to use the variational comparison to the upper bound of logarithm (vCLUB)(《Club: A contrastive log-ratio upper bound of mutual information》). And a recent study (《Improving zero-shot voice style transfer via disentangled representation learning》) by VC use MI, Using speaker tags as a monitor for learning speaker representation, we propose a method and (《Improving zero-shot voice style transfer via disentangled representation learning》) The difference is the combination of VQCPC and MI Conduct completely unsupervised training , It combines pitch representation to keep the source intonation changing .
3 Proposed method
This section first introduces VQMIVC Method system architecture , Then it elaborates that MI Minimize integration into the training process , Finally, it shows how to implement one-shot VC.
3.1 VQMIVC System architecture
Pictured 1 Shown , Proposed VQMIVC The system consists of four modules : Content encoder 、 Speaker encoder 、 Pitch extractor and decoder . The first three modules extract content from the input speech respectively 、 The speaker and the pitch indicate ; And the fourth module , decoder , Mapping these representations to ECHOLOGICAL features . Suppose there is K K K A word , We use the Mel spectrum as an acoustic feature , And select randomly from each voice T T T Frame for training . The first k t h k^{th} kth A Mel chart is marked as X k = { x k , 1 , x k , 2 , . . . , x k , T } X_k=\{ x_{k,1},x_{k,2},...,x_{k,T} \} Xk={ xk,1,xk,2,...,xk,T}.
Content encoder θ c \theta_c θc:
Content coders strive to use VQCPC from X k X_k Xk Extract language content information from , Pictured 2 Shown , There are two networks h-net: X k → Z k X_k \to Z_k Xk→Zk and g-net: Z ^ k → R k \hat{Z}_k \to R_k Z^k→Rk and VQ operation q q q: Z k → Z ^ k Z_k \to \hat{Z}_k Zk→Z^k.h-net utilize X k X_k Xk Derive dense feature sequences Z k = { z k , 1 , z k , 2 , . . . , z k , T / 2 } Z_k=\{ z_{k,1},z_{k,2},...,z_{k,T/2} \} Zk={ zk,1,zk,2,...,zk,T/2}, And the length is from T T T Reduced to T / 2 T/2 T/2. Then the quantizer q q q Use a trainable codebook B B B take Z k Z_k Zk Discrete as Z ^ k = { z ^ k , 1 , z ^ k , 2 , . . . , z ^ k , T / 2 } \hat{Z}_k=\{ \hat{z}_{k,1},\hat{z}_{k,2},...,\hat{z}_{k,T/2} \} Z^k={ z^k,1,z^k,2,...,z^k,T/2}, And z ^ k , t ∈ B \hat{z}_{k,t} \in B z^k,t∈B Is the closest to z k , t z_{k,t} zk,t Vector .VQ stay Z k Z_k Zk Introducing information bottlenecks in , Remove unnecessary details , send Z ^ k \hat{Z}_k Z^k Associated with the underlying language information . Then by minimizing VQ Loss (《Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge》) To train the content coder θ c \theta_c θc.
L V Q = 2 K T ∑ k = 1 K ∑ t = 1 T / 2 ∣ ∣ z k , t − s g ( z ^ k , t ) ∣ ∣ 2 2 (1) L_{VQ}=\frac {2} {KT} \sum^K_{k=1} \sum^{T/2}_{t=1} || z_{k,t} - sg(\hat{z}_{k,t}) ||^2_2 \tag{1} LVQ=KT2k=1∑Kt=1∑T/2∣∣zk,t−sg(z^k,t)∣∣22(1) In style , s g ( ⋅ ) sg(·) sg(⋅) To stop the gradient operator . To further encourage Z ^ k \hat{Z}_k Z^k Capture local structures , stay Z ^ k \hat{Z}_k Z^k Based on RNN Of g − n e t g-net g−net Get aggregation R k = { r k , 1 , r k , 2 , . . . , r k , T / 2 } R_k=\{ r_{k,1},r_{k,2},...,r_{k,T/2} \} Rk={ rk,1,rk,2,...,rk,T/2}, Using contrast prediction coding (CPC). Given r k , t r_{k,t} rk,t, The training model is minimized InfoNCE Loss (《Representation learning with contrastive predictive coding》) To distinguish between m m m Positive sample after step z ^ k , t + m \hat{z}_{k,t+m} z^k,t+m And from the set Ω \Omega_{} Ω Negative samples extracted from :
L C P C = − 1 K T ′ M ∑ k = 1 K ∑ t = 1 T ′ ∑ m = 1 M l o g [ e x p ( z ^ k , t + m T W m r k , t ) ∑ z ∈ Ω k , t , m e x p ( z T W m r k , t ) ] (2) L_{CPC}=- \frac {1} {KT'M} \sum^K_{k=1} \sum_{t=1}^{T'} \sum^M_{m=1}log[ \frac {exp(\hat{z}^T_{k,t+m}W_m r_{k,t})} {\sum_{z \in \Omega_{k,t,m}} exp(z^T W_m r_{k,t})} ] \tag{2} LCPC=−KT′M1k=1∑Kt=1∑T′m=1∑Mlog[∑z∈Ωk,t,mexp(zTWmrk,t)exp(z^k,t+mTWmrk,t)](2) among T ′ = T / 2 − M , W m ( 1 , 2 , . . . , M ) T' = T/2-M, W_m(1,2,...,M) T′=T/2−M,Wm(1,2,...,M) Is a trainable projection matrix . Compare losses with probability through prediction (2) Future samples of , Local features that span multiple time steps ( Like phonemes ) Encoded into Z ^ k = f ( X k : θ c ) \hat{Z}_k=f(X_k:\theta_c) Z^k=f(Xk:θc) in , Z ^ k = f ( X k : θ c ) \hat{Z}_k=f(X_k:\theta_c) Z^k=f(Xk:θc) Is a content representation for accurately reconstructing language content . In the process of training , Randomly select samples from the current discourse , Form a negative sample set Ω k , t , m \Omega_{k,t,m} Ωk,t,m.
Speaker encoder θ s \theta_s θs: The speaker encoder receives X k X_k Xk, Generate vectors s k = f ( X k ; θ s ) s_k=f(X_k;\theta_s) sk=f(Xk;θs), Used by the speaker to express . s k s_k sk Controlling the speaker identity of the generated speech by capturing global speech features .
Pitch extractor : Pitch representation is expected to include intonation changes , But it does not contain content or speaker information , So we extract from the waveform F 0 F_0 F0, And each voice is tested independently z z z normalization . In our experiment , We use logarithmic normalization F 0 ( l o g − F 0 ) F_0 (log-F_0) F0(log−F0) As P k = ( p k , 1 , p k , 2 , . . . , p k , T ) P_k=(p_{k,1},p_{k,2},...,p_{k,T}) Pk=(pk,1,pk,2,...,pk,T), It has nothing to do with the speaker , Therefore, the speaker encoder is forced to provide the information of the speaker , Such as range, etc .
decoder θ d \theta_d θd: The decoder is used to put the content 、 The speaker and tone representation are mapped to the Mel spectrum . Respectively for Z ^ k \hat{Z}_k Z^k and s k s_k sk Perform upsampling ( × 2 \times 2 ×2) And repeat ( × T \times T ×T) Linear interpolation , With p k p_k pk As input to the decoder , Generate Mel spectrum X ^ k = { x ^ k , 1 , x ^ k , 2 , . . . , x ^ k , T } \hat{X}_k=\{ \hat{x}_{k,1},\hat{x}_{k,2},...,\hat{x}_{k,T} \} X^k={ x^k,1,x^k,2,...,x^k,T}. The decoder is jointly trained with the content encoder and the speaker encoder , Minimize reconstruction losses :
L R E C = 1 K T ∑ k = 1 K ∑ t = 1 T [ ∣ ∣ x ^ t − x t ∣ ∣ 1 + ∣ ∣ x ^ t − x t ∣ ∣ 2 ] (3) L_{REC}=\frac {1} {KT} \sum^K_{k=1} \sum^T_{t=1}[||\hat{x}_t - x_t||_1 + ||\hat{x}_t - x_t||_2] \tag{3} LREC=KT1k=1∑Kt=1∑T[∣∣x^t−xt∣∣1+∣∣x^t−xt∣∣2](3)
3.2 MI Minimize integration into VQMIVC In training
Given a random variable u u u and v v v,MI For its joint distribution and marginal distribution Kullback-Leibler (KL) The divergence I ( u , v ) = D K L ( P ( u , v ) ; P ( u ) P ( v ) ) I(u,v)=D_{KL}(P(u,v);P(u)P(v)) I(u,v)=DKL(P(u,v);P(u)P(v)), We use vCLUB(《Club: A contrastive log-ratio upper bound of mutual information》) Calculation MI The upper bound of is :
I ( u , v ) = E P ( u , v ) [ l o g Q θ u , v ( u ∣ v ) ] − E P ( u ) E P ( v ) [ l o g Q θ u , v ( u ∣ v ) ] (4) I(u,v)=E_{P(u,v)}[logQ_{\theta_{u,v}}(u|v)]-E_{P(u)}E_{P(v)}[logQ_{\theta_{u,v}}(u|v)] \tag{4} I(u,v)=EP(u,v)[logQθu,v(u∣v)]−EP(u)EP(v)[logQθu,v(u∣v)](4) among u , v ∈ { Z ^ , s , p } u,v \in \{ \hat{Z}, s, p \} u,v∈{ Z^,s,p}, Z ^ \hat{Z} Z^, s s s and p p p They are content , Speaker and pitch characteristics , Q θ u , v ( u ∣ v ) Q_{\theta_{u,v}}(u|v) Qθu,v(u∣v) It is true that the variational approximation can be parameterized u , v u,v u,v And the Internet θ u , v \theta_{u,v} θu,v. given vCLUB Unbiased estimation between different speech representations :
I ^ ( Z ^ , s ) = 2 K 2 T ∑ k = 1 K ∑ l = 1 K ∑ t = 1 T / 2 [ l o g Q θ z ^ , s ( z ^ k , t ∣ s k ) − l o g Q θ z ^ , s ( z ^ l , t ∣ s k ) ] (5) \hat{I}(\hat{Z},s)=\frac {2} {K^2T} \sum^K_{k=1} \sum^K_{l=1} \sum^{T/2}_{t=1} [logQ_{\theta_{\hat{z},s}}(\hat{z}_{k,t}|s_k)-logQ_{\theta_{\hat{z},s}}(\hat{z}_{l,t}|s_k)] \tag{5} I^(Z^,s)=K2T2k=1∑Kl=1∑Kt=1∑T/2[logQθz^,s(z^k,t∣sk)−logQθz^,s(z^l,t∣sk)](5) I ^ ( p , s ) = 1 K 2 T ∑ k = 1 K ∑ l = 1 K ∑ t = 1 T [ l o g Q θ p , s ( p k , t ∣ s k ) − l o g Q θ z ^ , s ( p l , t ∣ s k ) ] (6) \hat{I}(p,s)=\frac {1} {K^2T} \sum^K_{k=1} \sum^K_{l=1} \sum^{T}_{t=1} [logQ_{\theta_{p,s}}(p_{k,t}|s_k)-logQ_{\theta_{\hat{z},s}}(p_{l,t}|s_k)] \tag{6} I^(p,s)=K2T1k=1∑Kl=1∑Kt=1∑T[logQθp,s(pk,t∣sk)−logQθz^,s(pl,t∣sk)](6) I ^ ( Z ^ , p ) = 2 K 2 T ∑ k = 1 K ∑ l = 1 K ∑ t = 1 T / 2 [ l o g Q θ z ^ , p ( z ^ k , t ∣ p ^ k , t ) − l o g Q θ z ^ , p ( z ^ l , t ∣ p ^ k , t ) ] (7) \hat{I}(\hat{Z},p)=\frac {2} {K^2T} \sum^K_{k=1} \sum^K_{l=1} \sum^{T/2}_{t=1} [logQ_{\theta_{\hat{z},p}}(\hat{z}_{k,t}|\hat{p}_{k,t})-logQ_{\theta_{\hat{z},p}}(\hat{z}_{l,t}|\hat{p}_{k,t})] \tag{7} I^(Z^,p)=K2T2k=1∑Kl=1∑Kt=1∑T/2[logQθz^,p(z^k,t∣p^k,t)−logQθz^,p(z^l,t∣p^k,t)](7) among p ^ k , t = ( p k , 2 t − 1 + p k , 2 t ) / 2 \hat{p}_{k,t} = (p_{k,2t-1}+p_{k,2t})/2 p^k,t=(pk,2t−1+pk,2t)/2. The formula 4 By good variational approximation , Provides a reliable MI upper bound . therefore , We can minimize (5)-(7) To reduce the correlation between different speech representations , total MI The loss is :
L M I = I ^ ( Z ^ , s ) + I ^ ( Z ^ , p ) + I ^ ( p , s ) (8) L_{MI} = \hat{I}(\hat{Z},s)+ \hat{I}(\hat{Z},p)+ \hat{I}(p,s) \tag{8} LMI=I^(Z^,s)+I^(Z^,p)+I^(p,s)(8) In the process of training , For variational approximation networks and VC The network is optimized alternately . Variational approximation network training maximum log likelihood :
L u , v = l o g Q θ u , v ( u ∣ v ) , u , v ∈ { Z ^ , s , p } (9) L_{u,v}=logQ_{\theta_{u,v}}(u|v), \ u,v \in \{ \hat{Z},s,p \} \tag{9} Lu,v=logQθu,v(u∣v), u,v∈{ Z^,s,p}(9) And yes VC Network training , send VC minimization of loss :
L V C = L V Q + L C P C + L R E C + λ M I L M I (10) L_{VC} = L_{VQ}+L_{CPC}+L_{REC}+\lambda_{MI}L_{MI} \tag{10} LVC=LVQ+LCPC+LREC+λMILMI(10) among λ M I \lambda_{MI} λMI Is a constant weight , To control MI How loss improves de entanglement . The final training process is summarized in the algorithm 1 in . We noticed that , No text transcription or speaker tags were used in the training process , Therefore, the proposed method is completely unsupervised .
3.3 One-shot VC
In the process of transformation , First, from the source speaker's X s r c X_{src} Xsrc Extract... From speech respectively Z ^ s r c = f ( X s r c ; θ c ) \hat{Z}_{src}=f(X_{src};\theta_c) Z^src=f(Xsrc;θc) and p s r c p_{src} psrc Content and pitch representation of , Only from one target speaker's words X t g t X_{tgt} Xtgt Extract the speaker representation as s t g t = f ( X t g t ; θ s ) s_{tgt}=f(X_{tgt};\theta_s) stgt=f(Xtgt;θs), Then the decoder generates the converted Mel spectrum as f ( Z ^ s r c , s t g t , p s r c ; θ d ) f(\hat{Z}_{src},s_{tgt},p_{src};\theta_d) f(Z^src,stgt,psrc;θd).
4 experiment
4.1 Experimental setup
All experiments were conducted in VCTK corpus (《Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit》) on ,110 English speakers were randomly divided into 90 and 20 people , As training set and test set respectively . The test speaker is regarded as a speaker who has never appeared , Used to perform One-shot VC. In acoustic feature extraction , Down sample all recorded data to 16kHz, utilize 25ms Hanning window 、10ms Frame shift sum 400 Dimensional fast Fourier transform calculation 80 Vermeer spectrum and F0.
The VC The network consists of a content encoder 、 The speaker encoder and decoder are composed of . The content encoder contains h-net、 quantizer q and g-net.h-net From one step to 2 The convolution of layer 、4 A block with layer normalization 、512 Dimensional linear layers and the ReLU The activation function consists of . The quantizer contains a containing 512 individual 64 Codebook of the vectorial learnable vectors .g-net It's a 256 One way of dimension RNN layer . about CPC, Future prediction step size M by 6, Negative samples ∣ Ω k , t , m ∣ |\Omega_{k,t,m}| ∣Ωk,t,m∣ by 10. The speaker encoder uses (《One-shot voice conversion by separating speaker and content representations with instance normalization》), It includes 8 individual ConvBank Layer encodes long-term information ,12 A convolution layer contains 1 Average pooling layers ,4 A linear layer is derived 256 The speaker said . stay 《Autovc: Zero-shot voice style transfer with only autoencoder loss》 after , The decoder has a 1024 dimension LSTM layer , Three convolutions , Two 1024 dimension LSTM Layer and a 80 Dimensional linear layer . Besides , Utilization based on 5 Layer convolution Postnet Refine the predicted Mel spectrum , utilize VCTK Corpus trained Parallel WaveGAN Vocoder (《Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram》) Convert it to a waveform . Use Adam Optimizer (《Adam: A method for stochastic optimization》) Yes VC Network training , adopt 15 The warming up of a cycle will increase the learning rate from 1e-6 Up to 1e-3, stay 200 Every... After cycles 100 Half a cycle , Until the total 500 A cycle . Batch size is 256, Each iteration selects randomly from each voice 128 Frame for training . The learning rate of the variational approximation network is also 3e-4 Of Adam The optimizer trains . We will put forward our VQMIVC Methods and AutoVC(《Autovc: Zero-shot voice style transfer with only autoencoder loss》)、AdaIN-VC(《One-shot voice conversion by separating speaker and content representations with instance normalization》) and VQVC+(《Vqvc+: One-shot voice conversion by vector quantization and u-net architecture》) Wait for the most advanced One-shot VC Methods are compared .
4.2 Experimental results and Analysis
4.2.1 Speech representation unwrapping performance
stay VC Loss (10) in , λ M I \lambda_{MI} λMI To determine the MI Realization SRD The ability of , We change first λ M I \lambda_{MI} λMI, By calculation vCLUB To evaluate the degree of de entanglement between different speech representations extracted from all test utterances , As shown in the table 1 Shown . We can see , When λ M I \lambda_{MI} λMI increases ,MI Tend to lower , To reduce the correlation between different speech representations .
In order to measure the entanglement between content information and speaker representation , We use two methods to generate speech , namely :(1) identical , That is, using the content of the same discourse to represent 、 Speaker representation and pitch representation to generate speech ;(2) blend , That is, using the content representation and pitch representation of one discourse and the speaker representation of another discourse to produce discourse , Two words belong to the same speaker . then , Using automatic speech recognition (ASR) The system obtains the characters that generate speech / Word error rate (CER/WER). from Same To Mixed To increase the CER and WER Write them down as Δ C \Delta_C ΔC and Δ W \Delta_W ΔW. The only difference in the input of speech generation is the input of speaker representation , We can come to a conclusion , Δ C \Delta_C ΔC and Δ W \Delta_W ΔW The greater the value of , Explain that the more content information leaked to the speaker's representation . All test speakers are used for speech generation , The public release based on jasper Of ASR System (《Jasper: An end-to-end convolutional neural acoustic model》). Results such as table 2 Shown , We can see , When not used MI ( λ M I = 0 \lambda_{MI}=0 λMI=0) when , The generated voice is seriously polluted by unwanted content information , This content information exists in the speaker representation , Such as Δ C \Delta_C ΔC and Δ W \Delta_W ΔW The maximum value of . However , When using MI ( λ M I > 0 \lambda_{MI}>0 λMI>0) when , You can get Δ C \Delta_C ΔC and Δ W \Delta_W ΔW There's a significant decrease in . With λ M I \lambda_{MI} λMI An increase in , Δ C \Delta_C ΔC and Δ W \Delta_W ΔW Are decreasing , Explain the higher λ M I \lambda_{MI} λMI It can mitigate the leakage of content information to the speaker to a greater extent .
Besides , We designed two speaker classifiers , Respectively by Z ^ \hat{Z} Z^ and s s s For input ; There is also a predictor , use Z ^ \hat{Z} Z^ To infer p p p. Both classifier and predictor are 4 Layer fully connected network , The hidden size is 256 dimension . The higher the accuracy of speaker classification, the more Z ^ \hat{Z} Z^ or s s s The more information the speaker has , p p p The forecast loss of ( Mean square error ) Higher indicates Z ^ \hat{Z} Z^ The lower the pitch of the midrange . Results such as table 3 Shown . We can observe that , When λ M I \lambda_{MI} λMI When it increases , Z ^ \hat{Z} Z^ Contains less speaker and pitch information , To achieve lower accuracy and higher pitch loss . For all λ M I \lambda_{MI} λMI, s s s The accuracy of speaker classification is very high , And with the λ M I \lambda_{MI} λMI An increase in , s s s The accuracy of classification on , explain s s s Contains a wealth of speaker information , But higher λ M I \lambda_{MI} λMI Can make s s s Loss of speaker information . To ensure correct unwrapping , In the next experiment , We will λ M I \lambda_{MI} λMI Set to 1 e − 2 1e-2 1e−2.
4.2.2 Content preservation and F0 Change consistency
In order to evaluate whether the converted speech maintains the language content and intonation changes of the source speech , We tested the performance of the converted speech CER/WER, The source speech and the converted speech are calculated F0 Pearson correlation coefficient between (PCC)(《Pearson correlation coefficient》).PCC The value range of is − 1 1 -1 ~ 1 −1 1, It can effectively measure the correlation between two variables , among F0-PCC The higher the , Represents the... Of the converted speech and the source speech F0 The higher the consistency of variation . Random sampling 10 Name the test speaker as the source speaker , rest 10 Test speaker as target speaker , formation 100 To convert to , All source utterances are used for conversion . The results of different methods are shown in table 4 Shown , The result of the source voice is also reported as the upper performance limit . It can be seen that , Of all the methods ,VQMIVC Of CER and WER The values are all the lowest , This shows that the proposed VQMIVC The method is robust in preserving the content of the source language . Besides , We observed that , Without using MI (w/o MI) Under the circumstances ,ASR Significant performance degradation , Because the converted voice is entangled with the speaker's expression by the unwanted content information . Besides , Indicate by providing the source pitch , We can clearly and effectively control the intonation change of the converted speech , To achieve a higher F0 Change consistency , The maximum obtained by this method F0-PCC by 0.781 0.781 0.781.
4.2.3 The naturalness of speaking and the similarity of speakers
from 15 Subjects were tested subjectively , With 5 Average opinion score of (MOS), namely 1- range 、2- Bad 、3- commonly 、4- good 、5- It is good to evaluate the naturalness of speech and the similarity of speakers . We randomly select two source speakers and two target speakers from the test speakers , Each source or target language set includes a male and a female , formation 4 To change words , Each pair of converted 18 Words were evaluated by each participant . In the figure 3 The average scores of all pairs are reported in . Source language (Oracle) And the target language (Oracle) They are parallel WaveGAN Synthesize the true spectrum of the source language and the target language . We observed that , The proposed method ( Write it down as w/o MI) be better than AutoVC and VQVC+, But not as good as AdaIN-VC. Pass our official listening test ,w/o MI Pronunciation errors are often detected in the converted speech , This can be seen from the table 4 in w/o MI Higher CER/WER Reflect . These questions can be raised through MI Minimization greatly relieves , It improves the naturalness of speech and the similarity of speakers . This shows that MI Minimization is beneficial to reasonable SRD Get accurate content representation and effective speaker representation , So as to generate natural speech with high similarity with the target speaker .
5 Conclusion
We propose a combination VQCPC and MI Unsupervised based on SRD Of One-Shot VC Method . Implementation content 、 Proper separation of speaker and pitch expression , This method not only trains VC Model to minimize reconstruction losses , And use VQCPC Loss to explore the local structure of speech to obtain content , as well as MI Loss , To reduce the correlation between different speech representations . Experiments verify the effectiveness of the proposed method , Save source language content by learning accurate content representation , Speaker representation to capture the desired speaker characteristics , And tonal representation to preserve the source tone change , So as to alleviate the problem of information leakage . So as to produce high-quality converted speech .
Wang D, Deng L, Yeung Y T, et al. VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion[J]. arXiv preprint arXiv:2106.10132, 2021.
边栏推荐
- [red flag Cup] Supplementary questions
- Allgero reports an error: program has encoded a problem and must exit The design will be saved as a . SAV file
- Anaconda based module installation and precautions
- 网络模型——OSI模型与TCP/IP模型
- 57. insert interval
- Logu P2486 [sdoi2011] coloring (tree chain + segment tree + merging of intervals on the tree)
- DNS协议及其DNS完整的查询过程
- 函数尽量不要通过变量指定操作类型
- Functions should not specify operation types through variables
- c#ColorDialog更改文本颜色和FontDialog更改文本字体的使用示例
猜你喜欢
Application of can optical transceiver of ring network redundant can/ optical fiber converter in fire alarm system
使用apt-get命令如何安装软件?
Can bus working condition and signal quality "physical examination"
电子学:第014课——实验 15:防入侵报警器(第一部分)
DNS协议及其DNS完整的查询过程
Sword finger offer (simple level)
Electronics: Lesson 013 - Experiment 14: Wearable pulsed luminaries
力扣 272. 最接近的二叉搜索树值 II 递归
Anaconda based module installation and precautions
双周投融报:资本埋伏Web3基础设施
随机推荐
Overview of image super score: the past and present life of image super score in a single screen (with core code)
[Mobius inversion]
Luogu p2839 [national training team]middle (two points + chairman tree + interval merging)
2265. number of nodes with statistical value equal to the average value of subtree
Use the frame statistics function of the message and waveform recording analyzer royalscope to troubleshoot the accidental faults of the CAN bus
深度学习系列45:图像恢复综述
六月集训(第25天) —— 树状数组
RMQ区间最大值下标查询,区间最值
Talk about the future of cloud native database
50 pieces of professional knowledge of Product Manager (IV) - from problem to ability improvement: amdgf model tool
allgero报错:Program has encountered a problem and must exit. The design will be saved as a .SAV file
Sword finger offer (simple level)
TCP 加速小记
Self made ramp, but it really smells good
电子学:第014课——实验 15:防入侵报警器(第一部分)
Ffmpeg+sdl2 for audio playback
CVPR 2022 Oral 2D图像秒变逼真3D物体
417-二叉树的层序遍历1(102. 二叉树的层序遍历、107.二叉树的层次遍历 II、199.二叉树的右视图、637.二叉树的层平均值)
2021ICPC网络赛第一场
云计算考试版本1.0