Training Universal Vocoders with Feature Smoothing-based Augmentation Methods for High-quality TTS Systems

Anonymous submission to Interspeech 2024

Abstract

While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.

Last update: 26 Mar 2024

Systems
Training Method Comparison
Vocoder Model Comparison

Systems

Vocoder Training Methods

ST: Universal vocoder separately trained (ST) from acoustic models using the ground-truth acoustic features
FT: Speaker-dependent vocoder fine-tuned using the corresponding acoustic model's generation
ST-SA (Proposed): Universal vocoder separately trained with the proposed smoothing augmentation (SA) method (shown in Figure 1(b))

Block diagram of conventional separate training method. The vocoder is trained with ground-truth acoustic features, and inferenced with acoustic model's generations.

(a)

Block diagram of proposed smoothing augmentation method. The vocoder is trained with acoustic features that are augmented by random smoothing filters.

(b)

Figure 1: Block diagram of the vocoding process in the TTS framework: (a) ST and (b) ST-SA (Proposed).

Vocoder Models

HiFi-GAN V1 [1]
UnivNet-c32 [2]
eUnivNet (Proposed): UnivNet-c16 + harmonic-noise generator (HN-G) + MS-STFT/CoMB discriminators (M/C-D) (shown in Figure 2(b))
eUnivNet-HN-G: eUnivNet only with HN-G (without M/C-D)
eUnivNet-M/C-D: eUnivNet only with M/C-D (without HN-G)

(a)

(b)

Figure 2: The UnivNet architectures: (a) the vanilla UnivNet-c32 model and (b) the proposed eUnivNet model. The notations c and k denote the number of channels and the kernel size of the convolution layer, respectively.

Training Method Comparison

Tacotron 2 + eUnivNet

Seen Speakers	Recording	ST	FT	ST-SA (Proposed)
F1
F2
M1
M2

Tacotron 2 + HiFi-GAN V1

Seen Speakers	Recording	*ST (GT features^)**	ST	ST-SA (Proposed)
F1
F2
M1
M2

^* A case that the model synthesize waveforms using the ground-truth acoustic features. Note that HiFi-GAN V1 ST model's synthesis quality is greatly degraded when using the acoustic model's predictions compared to ground-truth features.

FastSpeech 2 + eUnivNet

Seen Speakers	Recording	ST	ST-SA (Proposed)
F1
F2
M1
M2

Unseen Speakers (Tacotron 2 + eUnivNet)

Unseen Speakers	Recording	ST	ST-SA (Proposed)
F3
M3

Vocoder Model Comparison

Tacotron 2 + ST-SA vocoders

Speaker	Recording	UnivNet-c32	eUnivNet (Proposed)	eUnivNet-H/N-G	eUnivNet-M/C-D
F1
F2
M1
M2

[1] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020, pp. 17022–17033.
(We trained the model for 1M steps using the official implementation.)
[2] W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. INTERSPEECH, 2021, pp. 2207–2211.
(We trained the model for 1M steps using an open-source implementation.)