VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

----------------------------> Model Architecture <-----------------------

Fig.1 The training phase of the proposed VAW-GAN-based emotional VC framework with WORLD vocoder. Blue boxes are involved in the training, while white boxes are not.

Fig.2 The run-time conversion phase of the proposed VAW-GAN-based emotional VC framework. Green boxes represent the networks that are already trained.

VAW-GAN (SP+CWT): VAW-GAN system that converts spectrum and CWT-based F0 (with no conditioning on decoder);

VAW-GAN (SP+F0+C): Converts the spectrum with VAW-GAN conditioned on LG-based F0 without CWT decomposition, where F0 is converted with LG-based linear transformation;

VAW-GAN (SP+CWT+C)(Proposed): Converts the spectrum with VAW-GAN conditioned on CWT-based F0, where F0 is converted with VAW-GAN with CWT decomposition.

-----------------------> Emotional Speech Samples <-----------------------

(1) VAW-GAN (SP+CWT) vs. VAW-GAN (SP+CWT+C)(Proposed)

	Source	CWT-VAWGAN	C-CWT-VAWGAN (Proposed)	Target
Neutral-to-Angry











Neutral-to-Sleepy

(2) VAW-GAN （SP+F0+C) vs. VAW-GAN (SP+CWT+C)(Proposed)

	Source	CWT-VAWGAN	C-CWT-VAWGAN (Proposed)	Target
Neutral-to-Angry













Neutral-to-Sleepy