Audio Sample

The main processes of TTS include:

Convert the original text into characters/phonemes, through text frontend module.
Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through Acoustic models.
Convert acoustic features into waveforms through Vocoders.

When training Tacotron2、TransformerTTS and WaveFlow, we use English single speaker TTS dataset LJSpeech by default. However, when training SpeedySpeech, FastSpeech2 and ParallelWaveGAN, we use Chinese single speaker dataset CSMSC by default.

In the future, Parakeet will mainly use Chinese TTS datasets for default examples.

Here, we will display three types of audio samples:

Analysis/synthesis (ground-truth spectrograms + Vocoder)
TTS (Acoustic model + Vocoder)
Chinese TTS with/without text frontend (mainly tone sandhi)

Analysis/synthesis

Audio samples generated from ground-truth spectrograms with a vocoder.

LJSpeech(English)

GT	WaveFlow

CSMSC(Chinese)

GT (convert to 24k)	ParallelWaveGAN

TTS

Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder.

TransformerTTS + WaveFlow	Tacotron2 + WaveFlow

SpeedySpeech + ParallelWaveGAN	FastSpeech2 + ParallelWaveGAN

Chinese TTS with/without text frontend

We provide a complete Chinese text frontend module in Parakeet. Text Normalization and G2P are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare G2P module here.

We use FastSpeech2 + ParallelWaveGAN here.

With Text Frontend	Without Text Frontend

Read the Docs v: latest

Versions: latest; stable

Downloads

On Read the Docs: Project Home; Builds