Audio Sample

The main processes of TTS include:

  1. Convert the original text into characters/phonemes, through text frontend module.

  2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through Acoustic models.

  3. Convert acoustic features into waveforms through Vocoders.

When training Tacotron2TransformerTTS and WaveFlow, we use English single speaker TTS dataset LJSpeech by default. However, when training SpeedySpeech, FastSpeech2 and ParallelWaveGAN, we use Chinese single speaker dataset CSMSC by default.

In the future, Parakeet will mainly use Chinese TTS datasets for default examples.

Here, we will display three types of audio samples:

  1. Analysis/synthesis (ground-truth spectrograms + Vocoder)

  2. TTS (Acoustic model + Vocoder)

  3. Chinese TTS with/without text frontend (mainly tone sandhi)

Analysis/synthesis

Audio samples generated from ground-truth spectrograms with a vocoder.

LJSpeech(English)

GT WaveFlow


CSMSC(Chinese)

GT (convert to 24k) ParallelWaveGAN

TTS

Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder.

TransformerTTS + WaveFlow Tacotron2 + WaveFlow
SpeedySpeech + ParallelWaveGAN FastSpeech2 + ParallelWaveGAN

Chinese TTS with/without text frontend

We provide a complete Chinese text frontend module in Parakeet. Text Normalization and G2P are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare G2P module here.

We use FastSpeech2 + ParallelWaveGAN here.

With Text Frontend Without Text Frontend
Read the Docs v: latest
Versions
latest
stable
Downloads
On Read the Docs
Project Home
Builds