Audio Sample
The main processes of TTS include:
Convert the original text into characters/phonemes, through
text frontend
module.Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through
Acoustic models
.Convert acoustic features into waveforms through
Vocoders
.
When training Tacotron2
、TransformerTTS
and WaveFlow
, we use English single speaker TTS dataset LJSpeech by default. However, when training SpeedySpeech
, FastSpeech2
and ParallelWaveGAN
, we use Chinese single speaker dataset CSMSC by default.
In the future, Parakeet
will mainly use Chinese TTS datasets for default examples.
Here, we will display three types of audio samples:
Analysis/synthesis (ground-truth spectrograms + Vocoder)
TTS (Acoustic model + Vocoder)
Chinese TTS with/without text frontend (mainly tone sandhi)
Analysis/synthesis
Audio samples generated from ground-truth spectrograms with a vocoder.
LJSpeech(English)GT | WaveFlow |
---|---|
CSMSC(Chinese)
GT (convert to 24k) | ParallelWaveGAN |
---|---|
TTS
Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder.
TransformerTTS + WaveFlow | Tacotron2 + WaveFlow |
---|---|
SpeedySpeech + ParallelWaveGAN | FastSpeech2 + ParallelWaveGAN |
---|---|
Chinese TTS with/without text frontend
We provide a complete Chinese text frontend module in Parakeet
. Text Normalization
and G2P
are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare G2P
module here.
We use FastSpeech2
+ ParallelWaveGAN
here.
With Text Frontend | Without Text Frontend |
---|---|