----------------------------> Model Architecture <-----------------------



Fig.1 The training phase of the proposed VAW-GAN-based emotional VC framework with WORLD vocoder. Blue boxes are involved in the training, while white boxes are not.


Fig.2 The run-time conversion phase of the proposed VAW-GAN-based emotional VC framework. Green boxes represent the networks that are already trained.
VAW-GAN (SP+CWT): VAW-GAN system that converts spectrum and CWT-based F0 (with no conditioning on decoder);
VAW-GAN (SP+F0+C): Converts the spectrum with VAW-GAN conditioned on LG-based F0 without CWT decomposition, where F0 is converted with LG-based linear transformation;
VAW-GAN (SP+CWT+C)(Proposed): Converts the spectrum with VAW-GAN conditioned on CWT-based F0, where F0 is converted with VAW-GAN with CWT decomposition.


-----------------------> Emotional Speech Samples <-----------------------


(1) VAW-GAN (SP+CWT) vs. VAW-GAN (SP+CWT+C)(Proposed)

Source CWT-VAWGAN C-CWT-VAWGAN (Proposed) Target
Neutral-to-Angry
Neutral-to-Sleepy


(2) VAW-GAN (SP+F0+C) vs. VAW-GAN (SP+CWT+C)(Proposed)

Source CWT-VAWGAN C-CWT-VAWGAN (Proposed) Target
Neutral-to-Angry
Neutral-to-Sleepy