parakeet.models package
Submodules
parakeet.models.tacotron2 module
- class parakeet.models.tacotron2.Tacotron2(vocab_size, n_tones=None, d_mels: int = 80, d_encoder: int = 512, encoder_conv_layers: int = 3, encoder_kernel_size: int = 5, d_prenet: int = 256, d_attention_rnn: int = 1024, d_decoder_rnn: int = 1024, attention_filters: int = 32, attention_kernel_size: int = 31, d_attention: int = 128, d_postnet: int = 512, postnet_kernel_size: int = 5, postnet_conv_layers: int = 5, reduction_factor: int = 1, p_encoder_dropout: float = 0.5, p_prenet_dropout: float = 0.5, p_attention_dropout: float = 0.1, p_decoder_dropout: float = 0.1, p_postnet_dropout: float = 0.5, d_global_condition=None, use_stop_token=False)[source]
Bases:
paddle.fluid.dygraph.layers.Layer
Tacotron2 model for end-to-end text-to-speech (E2E-TTS).
This is a model of Spectrogram prediction network in Tacotron2 described in Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, which converts the sequence of characters into the sequence of mel spectrogram.
- Parameters
- vocab_sizeint
Vocabulary size of phons of the model.
- n_tones: int
Vocabulary size of tones of the model. Defaults to None. If provided, the model has an extra tone embedding.
- d_mels: int
Number of mel bands.
- d_encoder: int
Hidden size in encoder module.
- encoder_conv_layers: int
Number of conv layers in encoder.
- encoder_kernel_size: int
Kernel size of conv layers in encoder.
- d_prenet: int
Hidden size in decoder prenet.
- d_attention_rnn: int
Attention rnn layer hidden size in decoder.
- d_decoder_rnn: int
Decoder rnn layer hidden size in decoder.
- attention_filters: int
Filter size of the conv layer in location sensitive attention.
- attention_kernel_size: int
Kernel size of the conv layer in location sensitive attention.
- d_attention: int
Hidden size of the linear layer in location sensitive attention.
- d_postnet: int
Hidden size of postnet.
- postnet_kernel_size: int
Kernel size of the conv layer in postnet.
- postnet_conv_layers: int
Number of conv layers in postnet.
- reduction_factor: int
Reduction factor of tacotron2.
- p_encoder_dropout: float
Droput probability in encoder.
- p_prenet_dropout: float
Droput probability in decoder prenet.
- p_attention_dropout: float
Droput probability in location sensitive attention.
- p_decoder_dropout: float
Droput probability in decoder.
- p_postnet_dropout: float
Droput probability in postnet.
- d_global_condition: int
Feature size of global condition. Defaults to None. If provided, The model assumes a global condition that is concatenated to the encoder outputs.
- forward(text_inputs, text_lens, mels, output_lens=None, tones=None, global_condition=None)[source]
Calculate forward propagation of tacotron2.
- Parameters
- text_inputs: Tensor [shape=(B, T_text)]
Batch of the sequencees of padded character ids.
- text_lens: Tensor [shape=(B,)]
Batch of lengths of each text input batch.
- mels: Tensor [shape(B, T_mel, C)]
Batch of the sequences of padded mel spectrogram.
- output_lens: Tensor [shape=(B,)], optional
Batch of lengths of each mels batch. Defaults to None.
- tones: Tensor [shape=(B, T_text)]
Batch of sequences of padded tone ids.
- global_condition: Tensor [shape(B, C)]
Batch of global conditions. Defaults to None. If the d_global_condition of the model is not None, this input should be provided.
- use_stop_token: bool
Whether to include a binary classifier to predict the stop token. Defaults to False.
- Returns
- outputsDict[str, Tensor]
mel_output: output sequence of features (B, T_mel, C);
mel_outputs_postnet: output sequence of features after postnet (B, T_mel, C);
alignments: attention weights (B, T_mel, T_text);
stop_logits: output sequence of stop logits (B, T_mel)
- classmethod from_pretrained(config, checkpoint_path)[source]
Build a Tacotron2 model from a pretrained model.
- Parameters
- config: yacs.config.CfgNode
model configs
- checkpoint_path: Path or str
the path of pretrained model checkpoint, without extension name
- Returns
- ConditionalWaveFlow
The model built from pretrained result.
- infer(text_inputs, max_decoder_steps=1000, tones=None, global_condition=None)[source]
Generate the mel sepctrogram of features given the sequences of character ids.
- Parameters
- text_inputs: Tensor [shape=(B, T_text)]
Batch of the sequencees of padded character ids.
- max_decoder_steps: int, optional
Number of max step when synthesize. Defaults to 1000.
- Returns
- outputsDict[str, Tensor]
mel_output: output sequence of sepctrogram (B, T_mel, C);
mel_outputs_postnet: output sequence of sepctrogram after postnet (B, T_mel, C);
stop_logits: output sequence of stop logits (B, T_mel);
alignments: attention weights (B, T_mel, T_text). This key is only present when use_stop_token is True.
- class parakeet.models.tacotron2.Tacotron2Loss(use_stop_token_loss=True, use_guided_attention_loss=False, sigma=0.2)[source]
Bases:
paddle.fluid.dygraph.layers.Layer
Tacotron2 Loss module
- forward(mel_outputs, mel_outputs_postnet, mel_targets, attention_weights=None, slens=None, plens=None, stop_logits=None)[source]
Calculate tacotron2 loss.
- Parameters
- mel_outputs: Tensor [shape=(B, T_mel, C)]
Output mel spectrogram sequence.
- mel_outputs_postnet: Tensor [shape(B, T_mel, C)]
Output mel spectrogram sequence after postnet.
- mel_targets: Tensor [shape=(B, T_mel, C)]
Target mel spectrogram sequence.
- attention_weights: Tensor [shape=(B, T_mel, T_enc)]
Attention weights. This should be provided when use_guided_attention_loss is True.
- slens: Tensor [shape=(B,)]
Number of frames of mel spectrograms. This should be provided when use_guided_attention_loss is True.
- plens: Tensor [shape=(B, )]
Number of text or phone ids of each utterance. This should be provided when use_guided_attention_loss is True.
- stop_logits: Tensor [shape=(B, T_mel)]
Stop logits of each mel spectrogram frame. This should be provided when use_stop_token_loss is True.
- Returns
- lossesDict[str, Tensor]
loss: the sum of the other three losses;
mel_loss: MSE loss compute by mel_targets and mel_outputs;
post_mel_loss: MSE loss compute by mel_targets and mel_outputs_postnet;
guided_attn_loss: Guided attention loss for attention weights;
stop_loss: Binary cross entropy loss for stop token prediction.
parakeet.models.transformer_tts module
- class parakeet.models.transformer_tts.TransformerTTS(frontend: parakeet.frontend.phonectic.Phonetics, d_encoder: int, d_decoder: int, d_mel: int, n_heads: int, d_ffn: int, encoder_layers: int, decoder_layers: int, d_prenet: int, d_postnet: int, postnet_layers: int, postnet_kernel_size: int, max_reduction_factor: int, decoder_prenet_dropout: float, dropout: float, n_tones=None)[source]
Bases:
paddle.fluid.dygraph.layers.Layer
- forward(text, mel, tones=None)[source]
Defines the computation performed at every call. Should be overridden by all subclasses.
- infer(input, max_length=1000, verbose=True, tones=None)[source]
Predict log scale magnitude mel spectrogram from text input.
- Args:
input (Tensor): shape (T), dtype int, input text sequencce. max_length (int, optional): max decoder steps. Defaults to 1000. verbose (bool, optional): display progress bar. Defaults to True.
parakeet.models.waveflow module
- class parakeet.models.waveflow.ConditionalWaveFlow(upsample_factors: List[int], n_flows: int, n_layers: int, n_group: int, channels: int, n_mels: int, kernel_size: Union[int, List[int]])[source]
Bases:
paddle.fluid.dygraph.container.LayerList
ConditionalWaveFlow, a UpsampleNet with a WaveFlow model.
- Parameters
- upsample_factorsList[int]
Upsample factors for the upsample net.
- n_flowsint
Number of flows in the WaveFlow model.
- n_layersint
Number of ResidualBlocks in each Flow.
- n_groupint
Number of timesteps to fold as a group.
- channelsint
Feature size of each ResidualBlock.
- n_melsint
Feature size of mel spectrogram (mel bands).
- kernel_sizeUnion[int, List[int]]
Kernel size of the convolution layer in each ResidualBlock.
- forward(audio, mel)[source]
Compute the transformed random variable z (x to z) and the log of the determinant of the jacobian of the transformation from x to z.
- Parameters
- audioTensor [shape=(B, T)]
The audio.
- melTensor [shape=(B, C_mel, T_mel)]
The mel spectrogram.
- Returns
- zTensor [shape=(B, T)]
The inversely transformed random variable z (x to z)
- log_det_jacobian: Tensor [shape=(1,)]
the log of the determinant of the jacobian of the transformation from x to z.
- classmethod from_pretrained(config, checkpoint_path)[source]
Build a ConditionalWaveFlow model from a pretrained model.
- Parameters
- config: yacs.config.CfgNode
model configs
- checkpoint_path: Path or str
the path of pretrained model checkpoint, without extension name
- Returns
- ConditionalWaveFlow
The model built from pretrained result.
- class parakeet.models.waveflow.WaveFlow(n_flows, n_layers, n_group, channels, mel_bands, kernel_size)[source]
Bases:
paddle.fluid.dygraph.container.LayerList
An Deep Reversible layer that is composed of severel auto regressive flows.
- Parameters
- n_flowsint
Number of flows in the WaveFlow model.
- n_layersint
Number of ResidualBlocks in each Flow.
- n_groupint
Number of timesteps to fold as a group.
- channelsint
Feature size of each ResidualBlock.
- mel_bandsint
Feature size of mel spectrogram (mel bands).
- kernel_sizeUnion[int, List[int]]
Kernel size of the convolution layer in each ResidualBlock.
- forward(x, condition)[source]
Probability density estimation of random variable x given the condition.
- Parameters
- xTensor [shape=(batch_size, time_steps)]
The audio.
- conditionTensor [shape=(batch_size, condition channel, time_steps)]
The local condition (mel spectrogram here).
- Returns
- zTensor [shape=(batch_size, time_steps)]
The transformed random variable.
- log_det_jacobian: Tensor [shape=(1,)]
The log determinant of the jacobian of the transformation from x to z.
- inverse(z, condition)[source]
Sampling from the the distrition p(X).
It is done by sample a
z
form p(Z) and transform it intox
. Each Flow transform .. math:: z_{i-1} to .. math:: z_{i} in an autoregressive manner.- Parameters
- zTensor [shape=(batch, 1, time_steps]
A sample of the distribution p(Z).
- conditionTensor [shape=(batch, condition_channel, time_steps)]
The local condition.
- Returns
- xTensor [shape=(batch_size, time_steps)]
The transformed sample (audio here).
- class parakeet.models.waveflow.WaveFlowLoss(sigma=1.0)[source]
Bases:
paddle.fluid.dygraph.layers.Layer
Criterion of a WaveFlow model.
- Parameters
- sigmafloat
The standard deviation of the gaussian noise used in WaveFlow, by default 1.0.
- forward(z, log_det_jacobian)[source]
Compute the loss given the transformed random variable z and the log_det_jacobian of transformation from x to z.
- Parameters
- zTensor [shape=(B, T)]
The transformed random variable (x to z).
- log_det_jacobianTensor [shape=(1,)]
The log of the determinant of the jacobian matrix of the transformation from x to z.
- Returns
- Tensor [shape=(1,)]
The loss.