parakeet.models package

Submodules

parakeet.models.tacotron2 module

class parakeet.models.tacotron2.Tacotron2(vocab_size, n_tones=None, d_mels: int = 80, d_encoder: int = 512, encoder_conv_layers: int = 3, encoder_kernel_size: int = 5, d_prenet: int = 256, d_attention_rnn: int = 1024, d_decoder_rnn: int = 1024, attention_filters: int = 32, attention_kernel_size: int = 31, d_attention: int = 128, d_postnet: int = 512, postnet_kernel_size: int = 5, postnet_conv_layers: int = 5, reduction_factor: int = 1, p_encoder_dropout: float = 0.5, p_prenet_dropout: float = 0.5, p_attention_dropout: float = 0.1, p_decoder_dropout: float = 0.1, p_postnet_dropout: float = 0.5, d_global_condition=None, use_stop_token=False)[source]

Bases: paddle.fluid.dygraph.layers.Layer

Tacotron2 model for end-to-end text-to-speech (E2E-TTS).

This is a model of Spectrogram prediction network in Tacotron2 described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, which converts the sequence of characters into the sequence of mel spectrogram.

Parameters
vocab_sizeint

Vocabulary size of phons of the model.

n_tones: int

Vocabulary size of tones of the model. Defaults to None. If provided, the model has an extra tone embedding.

d_mels: int

Number of mel bands.

d_encoder: int

Hidden size in encoder module.

encoder_conv_layers: int

Number of conv layers in encoder.

encoder_kernel_size: int

Kernel size of conv layers in encoder.

d_prenet: int

Hidden size in decoder prenet.

d_attention_rnn: int

Attention rnn layer hidden size in decoder.

d_decoder_rnn: int

Decoder rnn layer hidden size in decoder.

attention_filters: int

Filter size of the conv layer in location sensitive attention.

attention_kernel_size: int

Kernel size of the conv layer in location sensitive attention.

d_attention: int

Hidden size of the linear layer in location sensitive attention.

d_postnet: int

Hidden size of postnet.

postnet_kernel_size: int

Kernel size of the conv layer in postnet.

postnet_conv_layers: int

Number of conv layers in postnet.

reduction_factor: int

Reduction factor of tacotron2.

p_encoder_dropout: float

Droput probability in encoder.

p_prenet_dropout: float

Droput probability in decoder prenet.

p_attention_dropout: float

Droput probability in location sensitive attention.

p_decoder_dropout: float

Droput probability in decoder.

p_postnet_dropout: float

Droput probability in postnet.

d_global_condition: int

Feature size of global condition. Defaults to None. If provided, The model assumes a global condition that is concatenated to the encoder outputs.

forward(text_inputs, text_lens, mels, output_lens=None, tones=None, global_condition=None)[source]

Calculate forward propagation of tacotron2.

Parameters
text_inputs: Tensor [shape=(B, T_text)]

Batch of the sequencees of padded character ids.

text_lens: Tensor [shape=(B,)]

Batch of lengths of each text input batch.

mels: Tensor [shape(B, T_mel, C)]

Batch of the sequences of padded mel spectrogram.

output_lens: Tensor [shape=(B,)], optional

Batch of lengths of each mels batch. Defaults to None.

tones: Tensor [shape=(B, T_text)]

Batch of sequences of padded tone ids.

global_condition: Tensor [shape(B, C)]

Batch of global conditions. Defaults to None. If the d_global_condition of the model is not None, this input should be provided.

use_stop_token: bool

Whether to include a binary classifier to predict the stop token. Defaults to False.

Returns
outputsDict[str, Tensor]

mel_output: output sequence of features (B, T_mel, C);

mel_outputs_postnet: output sequence of features after postnet (B, T_mel, C);

alignments: attention weights (B, T_mel, T_text);

stop_logits: output sequence of stop logits (B, T_mel)

classmethod from_pretrained(config, checkpoint_path)[source]

Build a Tacotron2 model from a pretrained model.

Parameters
config: yacs.config.CfgNode

model configs

checkpoint_path: Path or str

the path of pretrained model checkpoint, without extension name

Returns
ConditionalWaveFlow

The model built from pretrained result.

infer(text_inputs, max_decoder_steps=1000, tones=None, global_condition=None)[source]

Generate the mel sepctrogram of features given the sequences of character ids.

Parameters
text_inputs: Tensor [shape=(B, T_text)]

Batch of the sequencees of padded character ids.

max_decoder_steps: int, optional

Number of max step when synthesize. Defaults to 1000.

Returns
outputsDict[str, Tensor]

mel_output: output sequence of sepctrogram (B, T_mel, C);

mel_outputs_postnet: output sequence of sepctrogram after postnet (B, T_mel, C);

stop_logits: output sequence of stop logits (B, T_mel);

alignments: attention weights (B, T_mel, T_text). This key is only present when use_stop_token is True.

class parakeet.models.tacotron2.Tacotron2Loss(use_stop_token_loss=True, use_guided_attention_loss=False, sigma=0.2)[source]

Bases: paddle.fluid.dygraph.layers.Layer

Tacotron2 Loss module

forward(mel_outputs, mel_outputs_postnet, mel_targets, attention_weights=None, slens=None, plens=None, stop_logits=None)[source]

Calculate tacotron2 loss.

Parameters
mel_outputs: Tensor [shape=(B, T_mel, C)]

Output mel spectrogram sequence.

mel_outputs_postnet: Tensor [shape(B, T_mel, C)]

Output mel spectrogram sequence after postnet.

mel_targets: Tensor [shape=(B, T_mel, C)]

Target mel spectrogram sequence.

attention_weights: Tensor [shape=(B, T_mel, T_enc)]

Attention weights. This should be provided when use_guided_attention_loss is True.

slens: Tensor [shape=(B,)]

Number of frames of mel spectrograms. This should be provided when use_guided_attention_loss is True.

plens: Tensor [shape=(B, )]

Number of text or phone ids of each utterance. This should be provided when use_guided_attention_loss is True.

stop_logits: Tensor [shape=(B, T_mel)]

Stop logits of each mel spectrogram frame. This should be provided when use_stop_token_loss is True.

Returns
lossesDict[str, Tensor]

loss: the sum of the other three losses;

mel_loss: MSE loss compute by mel_targets and mel_outputs;

post_mel_loss: MSE loss compute by mel_targets and mel_outputs_postnet;

guided_attn_loss: Guided attention loss for attention weights;

stop_loss: Binary cross entropy loss for stop token prediction.

parakeet.models.transformer_tts module

class parakeet.models.transformer_tts.TransformerTTS(frontend: parakeet.frontend.phonectic.Phonetics, d_encoder: int, d_decoder: int, d_mel: int, n_heads: int, d_ffn: int, encoder_layers: int, decoder_layers: int, d_prenet: int, d_postnet: int, postnet_layers: int, postnet_kernel_size: int, max_reduction_factor: int, decoder_prenet_dropout: float, dropout: float, n_tones=None)[source]

Bases: paddle.fluid.dygraph.layers.Layer

decode(encoder_output, input, encoder_padding_mask)[source]
encode(text, tones=None)[source]
forward(text, mel, tones=None)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:

*inputs(tuple): unpacked tuple arguments **kwargs(dict): unpacked dict arguments

classmethod from_pretrained(frontend, config, checkpoint_path)[source]
infer(input, max_length=1000, verbose=True, tones=None)[source]

Predict log scale magnitude mel spectrogram from text input.

Args:

input (Tensor): shape (T), dtype int, input text sequencce. max_length (int, optional): max decoder steps. Defaults to 1000. verbose (bool, optional): display progress bar. Defaults to True.

set_constants(reduction_factor, drop_n_heads)[source]
class parakeet.models.transformer_tts.TransformerTTSLoss(stop_loss_scale)[source]

Bases: paddle.fluid.dygraph.layers.Layer

forward(mel_output, mel_intermediate, mel_target, stop_logits, stop_probs)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:

*inputs(tuple): unpacked tuple arguments **kwargs(dict): unpacked dict arguments

parakeet.models.waveflow module

class parakeet.models.waveflow.ConditionalWaveFlow(upsample_factors: List[int], n_flows: int, n_layers: int, n_group: int, channels: int, n_mels: int, kernel_size: Union[int, List[int]])[source]

Bases: paddle.fluid.dygraph.container.LayerList

ConditionalWaveFlow, a UpsampleNet with a WaveFlow model.

Parameters
upsample_factorsList[int]

Upsample factors for the upsample net.

n_flowsint

Number of flows in the WaveFlow model.

n_layersint

Number of ResidualBlocks in each Flow.

n_groupint

Number of timesteps to fold as a group.

channelsint

Feature size of each ResidualBlock.

n_melsint

Feature size of mel spectrogram (mel bands).

kernel_sizeUnion[int, List[int]]

Kernel size of the convolution layer in each ResidualBlock.

forward(audio, mel)[source]

Compute the transformed random variable z (x to z) and the log of the determinant of the jacobian of the transformation from x to z.

Parameters
audioTensor [shape=(B, T)]

The audio.

melTensor [shape=(B, C_mel, T_mel)]

The mel spectrogram.

Returns
zTensor [shape=(B, T)]

The inversely transformed random variable z (x to z)

log_det_jacobian: Tensor [shape=(1,)]

the log of the determinant of the jacobian of the transformation from x to z.

classmethod from_pretrained(config, checkpoint_path)[source]

Build a ConditionalWaveFlow model from a pretrained model.

Parameters
config: yacs.config.CfgNode

model configs

checkpoint_path: Path or str

the path of pretrained model checkpoint, without extension name

Returns
ConditionalWaveFlow

The model built from pretrained result.

infer(mel)[source]

Generate raw audio given mel spectrogram.

Parameters
melTensor [shape=(B, C_mel, T_mel)]

Mel spectrogram (in log-magnitude).

Returns
Tensor[shape=(B, T)]

The synthesized audio, where``T <= T_mel * upsample_factors``.

predict(mel)[source]

Generate raw audio given mel spectrogram.

Parameters
melnp.ndarray [shape=(C_mel, T_mel)]

Mel spectrogram of an utterance(in log-magnitude).

Returns
np.ndarray [shape=(T,)]

The synthesized audio.

class parakeet.models.waveflow.WaveFlow(n_flows, n_layers, n_group, channels, mel_bands, kernel_size)[source]

Bases: paddle.fluid.dygraph.container.LayerList

An Deep Reversible layer that is composed of severel auto regressive flows.

Parameters
n_flowsint

Number of flows in the WaveFlow model.

n_layersint

Number of ResidualBlocks in each Flow.

n_groupint

Number of timesteps to fold as a group.

channelsint

Feature size of each ResidualBlock.

mel_bandsint

Feature size of mel spectrogram (mel bands).

kernel_sizeUnion[int, List[int]]

Kernel size of the convolution layer in each ResidualBlock.

forward(x, condition)[source]

Probability density estimation of random variable x given the condition.

Parameters
xTensor [shape=(batch_size, time_steps)]

The audio.

conditionTensor [shape=(batch_size, condition channel, time_steps)]

The local condition (mel spectrogram here).

Returns
zTensor [shape=(batch_size, time_steps)]

The transformed random variable.

log_det_jacobian: Tensor [shape=(1,)]

The log determinant of the jacobian of the transformation from x to z.

inverse(z, condition)[source]

Sampling from the the distrition p(X).

It is done by sample a z form p(Z) and transform it into x. Each Flow transform .. math:: z_{i-1} to .. math:: z_{i} in an autoregressive manner.

Parameters
zTensor [shape=(batch, 1, time_steps]

A sample of the distribution p(Z).

conditionTensor [shape=(batch, condition_channel, time_steps)]

The local condition.

Returns
xTensor [shape=(batch_size, time_steps)]

The transformed sample (audio here).

class parakeet.models.waveflow.WaveFlowLoss(sigma=1.0)[source]

Bases: paddle.fluid.dygraph.layers.Layer

Criterion of a WaveFlow model.

Parameters
sigmafloat

The standard deviation of the gaussian noise used in WaveFlow, by default 1.0.

forward(z, log_det_jacobian)[source]

Compute the loss given the transformed random variable z and the log_det_jacobian of transformation from x to z.

Parameters
zTensor [shape=(B, T)]

The transformed random variable (x to z).

log_det_jacobianTensor [shape=(1,)]

The log of the determinant of the jacobian matrix of the transformation from x to z.

Returns
Tensor [shape=(1,)]

The loss.

Module contents