parakeet.models package

Submodules

parakeet.models.tacotron2 module

class parakeet.models.tacotron2.Tacotron2(vocab_size, n_tones=None, d_mels: int = 80, d_encoder: int = 512, encoder_conv_layers: int = 3, encoder_kernel_size: int = 5, d_prenet: int = 256, d_attention_rnn: int = 1024, d_decoder_rnn: int = 1024, attention_filters: int = 32, attention_kernel_size: int = 31, d_attention: int = 128, d_postnet: int = 512, postnet_kernel_size: int = 5, postnet_conv_layers: int = 5, reduction_factor: int = 1, p_encoder_dropout: float = 0.5, p_prenet_dropout: float = 0.5, p_attention_dropout: float = 0.1, p_decoder_dropout: float = 0.1, p_postnet_dropout: float = 0.5, d_global_condition=None, use_stop_token=False)[source]

Bases: paddle.fluid.dygraph.layers.Layer

Tacotron2 model for end-to-end text-to-speech (E2E-TTS).

This is a model of Spectrogram prediction network in Tacotron2 described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, which converts the sequence of characters into the sequence of mel spectrogram.

Parameters

vocab_sizeint: Vocabulary size of phons of the model.
n_tones: int: Vocabulary size of tones of the model. Defaults to None. If provided, the model has an extra tone embedding.
d_mels: int: Number of mel bands.
d_encoder: int: Hidden size in encoder module.
encoder_conv_layers: int: Number of conv layers in encoder.
encoder_kernel_size: int: Kernel size of conv layers in encoder.
d_prenet: int: Hidden size in decoder prenet.
d_attention_rnn: int: Attention rnn layer hidden size in decoder.
d_decoder_rnn: int: Decoder rnn layer hidden size in decoder.
attention_filters: int: Filter size of the conv layer in location sensitive attention.
attention_kernel_size: int: Kernel size of the conv layer in location sensitive attention.
d_attention: int: Hidden size of the linear layer in location sensitive attention.
d_postnet: int: Hidden size of postnet.
postnet_kernel_size: int: Kernel size of the conv layer in postnet.
postnet_conv_layers: int: Number of conv layers in postnet.
reduction_factor: int: Reduction factor of tacotron2.
p_encoder_dropout: float: Droput probability in encoder.
p_prenet_dropout: float: Droput probability in decoder prenet.
p_attention_dropout: float: Droput probability in location sensitive attention.
p_decoder_dropout: float: Droput probability in decoder.
p_postnet_dropout: float: Droput probability in postnet.
d_global_condition: int: Feature size of global condition. Defaults to None. If provided, The model assumes a global condition that is concatenated to the encoder outputs.

forward(text_inputs, text_lens, mels, output_lens=None, tones=None, global_condition=None)[source]

Calculate forward propagation of tacotron2.

Parameters

text_inputs: Tensor [shape=(B, T_text)]: Batch of the sequencees of padded character ids.
text_lens: Tensor [shape=(B,)]: Batch of lengths of each text input batch.
mels: Tensor [shape(B, T_mel, C)]: Batch of the sequences of padded mel spectrogram.
output_lens: Tensor [shape=(B,)], optional: Batch of lengths of each mels batch. Defaults to None.
tones: Tensor [shape=(B, T_text)]: Batch of sequences of padded tone ids.
global_condition: Tensor [shape(B, C)]: Batch of global conditions. Defaults to None. If the d_global_condition of the model is not None, this input should be provided.
use_stop_token: bool: Whether to include a binary classifier to predict the stop token. Defaults to False.

Returns

outputsDict[str, Tensor]

mel_output: output sequence of features (B, T_mel, C);

mel_outputs_postnet: output sequence of features after postnet (B, T_mel, C);

alignments: attention weights (B, T_mel, T_text);

stop_logits: output sequence of stop logits (B, T_mel)

classmethod from_pretrained(config, checkpoint_path)[source]

Build a Tacotron2 model from a pretrained model.

Parameters

config: yacs.config.CfgNode: model configs
checkpoint_path: Path or str: the path of pretrained model checkpoint, without extension name

Returns

ConditionalWaveFlow: The model built from pretrained result.

infer(text_inputs, max_decoder_steps=1000, tones=None, global_condition=None)[source]

Generate the mel sepctrogram of features given the sequences of character ids.

Parameters

text_inputs: Tensor [shape=(B, T_text)]: Batch of the sequencees of padded character ids.
max_decoder_steps: int, optional: Number of max step when synthesize. Defaults to 1000.

Returns

outputsDict[str, Tensor]

mel_output: output sequence of sepctrogram (B, T_mel, C);

mel_outputs_postnet: output sequence of sepctrogram after postnet (B, T_mel, C);

stop_logits: output sequence of stop logits (B, T_mel);

alignments: attention weights (B, T_mel, T_text). This key is only present when use_stop_token is True.

class parakeet.models.tacotron2.Tacotron2Loss(use_stop_token_loss=True, use_guided_attention_loss=False, sigma=0.2)[source]

Bases: paddle.fluid.dygraph.layers.Layer

Tacotron2 Loss module

forward(mel_outputs, mel_outputs_postnet, mel_targets, attention_weights=None, slens=None, plens=None, stop_logits=None)[source]

Calculate tacotron2 loss.

Parameters

mel_outputs: Tensor [shape=(B, T_mel, C)]: Output mel spectrogram sequence.
mel_outputs_postnet: Tensor [shape(B, T_mel, C)]: Output mel spectrogram sequence after postnet.
mel_targets: Tensor [shape=(B, T_mel, C)]: Target mel spectrogram sequence.
attention_weights: Tensor [shape=(B, T_mel, T_enc)]: Attention weights. This should be provided when use_guided_attention_loss is True.
slens: Tensor [shape=(B,)]: Number of frames of mel spectrograms. This should be provided when use_guided_attention_loss is True.
plens: Tensor [shape=(B, )]: Number of text or phone ids of each utterance. This should be provided when use_guided_attention_loss is True.
stop_logits: Tensor [shape=(B, T_mel)]: Stop logits of each mel spectrogram frame. This should be provided when use_stop_token_loss is True.

Returns

lossesDict[str, Tensor]

loss: the sum of the other three losses;

mel_loss: MSE loss compute by mel_targets and mel_outputs;

post_mel_loss: MSE loss compute by mel_targets and mel_outputs_postnet;

guided_attn_loss: Guided attention loss for attention weights;

stop_loss: Binary cross entropy loss for stop token prediction.

parakeet.models.transformer_tts module

class parakeet.models.transformer_tts.TransformerTTS(frontend: parakeet.frontend.phonectic.Phonetics, d_encoder: int, d_decoder: int, d_mel: int, n_heads: int, d_ffn: int, encoder_layers: int, decoder_layers: int, d_prenet: int, d_postnet: int, postnet_layers: int, postnet_kernel_size: int, max_reduction_factor: int, decoder_prenet_dropout: float, dropout: float, n_tones=None)[source]

Bases: paddle.fluid.dygraph.layers.Layer

decode(encoder_output, input, encoder_padding_mask)[source]

encode(text, tones=None)[source]

forward(text, mel, tones=None)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:: *inputs(tuple): unpacked tuple arguments **kwargs(dict): unpacked dict arguments

classmethod from_pretrained(frontend, config, checkpoint_path)[source]

infer(input, max_length=1000, verbose=True, tones=None)[source]

Predict log scale magnitude mel spectrogram from text input.

Args:: input (Tensor): shape (T), dtype int, input text sequencce. max_length (int, optional): max decoder steps. Defaults to 1000. verbose (bool, optional): display progress bar. Defaults to True.

set_constants(reduction_factor, drop_n_heads)[source]

class parakeet.models.transformer_tts.TransformerTTSLoss(stop_loss_scale)[source]

Bases: paddle.fluid.dygraph.layers.Layer

forward(mel_output, mel_intermediate, mel_target, stop_logits, stop_probs)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:: *inputs(tuple): unpacked tuple arguments **kwargs(dict): unpacked dict arguments

parakeet.models.waveflow module

class parakeet.models.waveflow.ConditionalWaveFlow(upsample_factors: List[int], n_flows: int, n_layers: int, n_group: int, channels: int, n_mels: int, kernel_size: Union[int, List[int]])[source]

Bases: paddle.fluid.dygraph.container.LayerList

ConditionalWaveFlow, a UpsampleNet with a WaveFlow model.

Parameters

upsample_factorsList[int]: Upsample factors for the upsample net.
n_flowsint: Number of flows in the WaveFlow model.
n_layersint: Number of ResidualBlocks in each Flow.
n_groupint: Number of timesteps to fold as a group.
channelsint: Feature size of each ResidualBlock.
n_melsint: Feature size of mel spectrogram (mel bands).
kernel_sizeUnion[int, List[int]]: Kernel size of the convolution layer in each ResidualBlock.

forward(audio, mel)[source]

Compute the transformed random variable z (x to z) and the log of the determinant of the jacobian of the transformation from x to z.

Parameters

audioTensor [shape=(B, T)]: The audio.
melTensor [shape=(B, C_mel, T_mel)]: The mel spectrogram.

Returns

zTensor [shape=(B, T)]: The inversely transformed random variable z (x to z)
log_det_jacobian: Tensor [shape=(1,)]: the log of the determinant of the jacobian of the transformation from x to z.

classmethod from_pretrained(config, checkpoint_path)[source]

Build a ConditionalWaveFlow model from a pretrained model.

Parameters

config: yacs.config.CfgNode: model configs
checkpoint_path: Path or str: the path of pretrained model checkpoint, without extension name

Returns

ConditionalWaveFlow: The model built from pretrained result.

infer(mel)[source]

Generate raw audio given mel spectrogram.

Parameters

melTensor [shape=(B, C_mel, T_mel)]: Mel spectrogram (in log-magnitude).

Returns

Tensor[shape=(B, T)]: The synthesized audio, where``T <= T_mel * upsample_factors``.

predict(mel)[source]

Generate raw audio given mel spectrogram.

Parameters

melnp.ndarray [shape=(C_mel, T_mel)]: Mel spectrogram of an utterance(in log-magnitude).

Returns

np.ndarray [shape=(T,)]: The synthesized audio.

class parakeet.models.waveflow.WaveFlow(n_flows, n_layers, n_group, channels, mel_bands, kernel_size)[source]

Bases: paddle.fluid.dygraph.container.LayerList

An Deep Reversible layer that is composed of severel auto regressive flows.

Parameters

n_flowsint: Number of flows in the WaveFlow model.
n_layersint: Number of ResidualBlocks in each Flow.
n_groupint: Number of timesteps to fold as a group.
channelsint: Feature size of each ResidualBlock.
mel_bandsint: Feature size of mel spectrogram (mel bands).
kernel_sizeUnion[int, List[int]]: Kernel size of the convolution layer in each ResidualBlock.

forward(x, condition)[source]

Probability density estimation of random variable x given the condition.

Parameters

xTensor [shape=(batch_size, time_steps)]: The audio.
conditionTensor [shape=(batch_size, condition channel, time_steps)]: The local condition (mel spectrogram here).

Returns

zTensor [shape=(batch_size, time_steps)]: The transformed random variable.
log_det_jacobian: Tensor [shape=(1,)]: The log determinant of the jacobian of the transformation from x to z.

inverse(z, condition)[source]

Sampling from the the distrition p(X).

It is done by sample a z form p(Z) and transform it into x. Each Flow transform .. math:: z_{i-1} to .. math:: z_{i} in an autoregressive manner.

Parameters

zTensor [shape=(batch, 1, time_steps]: A sample of the distribution p(Z).
conditionTensor [shape=(batch, condition_channel, time_steps)]: The local condition.

Returns

xTensor [shape=(batch_size, time_steps)]: The transformed sample (audio here).

class parakeet.models.waveflow.WaveFlowLoss(sigma=1.0)[source]

Bases: paddle.fluid.dygraph.layers.Layer

Criterion of a WaveFlow model.

Parameters

sigmafloat: The standard deviation of the gaussian noise used in WaveFlow, by default 1.0.

forward(z, log_det_jacobian)[source]

Compute the loss given the transformed random variable z and the log_det_jacobian of transformation from x to z.

Parameters

zTensor [shape=(B, T)]: The transformed random variable (x to z).
log_det_jacobianTensor [shape=(1,)]: The log of the determinant of the jacobian matrix of the transformation from x to z.

Returns

Tensor [shape=(1,)]: The loss.

parakeet.models package

Submodules

parakeet.models.tacotron2 module

parakeet.models.transformer_tts module

parakeet.models.waveflow module

Module contents