parakeet.modules package

Submodules

parakeet.modules.attention module

class parakeet.modules.attention.LocationSensitiveAttention(d_query: int, d_key: int, d_attention: int, location_filters: int, location_kernel_size: int)[source]

Bases: paddle.fluid.dygraph.layers.Layer

Location Sensitive Attention module.

Reference: Attention-Based Models for Speech Recognition

Parameters

d_query: int: The feature size of query.
d_keyint: The feature size of key.
d_attentionint: The feature size of dimension.
location_filtersint: Filter size of attention convolution.
location_kernel_sizeint: Kernel size of attention convolution.

forward(query, processed_key, value, attention_weights_cat, mask=None)[source]

Compute context vector and attention weights.

Parameters

queryTensor [shape=(batch_size, d_query)]: The queries.
processed_keyTensor [shape=(batch_size, time_steps_k, d_attention)]: The keys after linear layer.
valueTensor [shape=(batch_size, time_steps_k, d_key)]: The values.
attention_weights_catTensor [shape=(batch_size, time_step_k, 2)]: Attention weights concat.
maskTensor, optional: The mask. Shape should be (batch_size, times_steps_k, 1). Defaults to None.

Returns

attention_contextTensor [shape=(batch_size, d_attention)]: The context vector.
attention_weightsTensor [shape=(batch_size, time_steps_k)]: The attention weights.

class parakeet.modules.attention.MonoheadAttention(model_dim: int, dropout: float = 0.0, k_dim: Optional[int] = None, v_dim: Optional[int] = None)[source]

Bases: paddle.fluid.dygraph.layers.Layer

Monohead Attention module.

Parameters

model_dimint: Feature size of the query.
dropoutfloat, optional: Dropout probability of scaled dot product attention and final context vector. Defaults to 0.0.
k_dimint, optional: Feature size of the key of each scaled dot product attention. If not provided, it is set to model_dim / num_heads. Defaults to None.
v_dimint, optional: Feature size of the key of each scaled dot product attention. If not provided, it is set to model_dim / num_heads. Defaults to None.

forward(q, k, v, mask)[source]

Compute context vector and attention weights.

Parameters

qTensor [shape=(batch_size, time_steps_q, model_dim)]: The queries.
kTensor [shape=(batch_size, time_steps_k, model_dim)]: The keys.
vTensor [shape=(batch_size, time_steps_k, model_dim)]: The values.
maskTensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape: The mask.

Returns

outTensor [shape=(batch_size, time_steps_q, model_dim)]: The context vector.
attention_weightsTensor [shape=(batch_size, times_steps_q, time_steps_k)]: The attention weights.

class parakeet.modules.attention.MultiheadAttention(model_dim: int, num_heads: int, dropout: float = 0.0, k_dim: Optional[int] = None, v_dim: Optional[int] = None)[source]

Bases: paddle.fluid.dygraph.layers.Layer

Multihead Attention module.

Parameters

model_dim: int: The feature size of query.
num_headsint: The number of attention heads.
dropoutfloat, optional: Dropout probability of scaled dot product attention and final context vector. Defaults to 0.0.
k_dimint, optional: Feature size of the key of each scaled dot product attention. If not provided, it is set to model_dim / num_heads. Defaults to None.
v_dimint, optional: Feature size of the key of each scaled dot product attention. If not provided, it is set to model_dim / num_heads. Defaults to None.

Raises

ValueError: If model_dim is not divisible by num_heads.

forward(q, k, v, mask)[source]

Compute context vector and attention weights.

Parameters

qTensor [shape=(batch_size, time_steps_q, model_dim)]: The queries.
kTensor [shape=(batch_size, time_steps_k, model_dim)]: The keys.
vTensor [shape=(batch_size, time_steps_k, model_dim)]: The values.
maskTensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape: The mask.

Returns

outTensor [shape=(batch_size, time_steps_q, model_dim)]: The context vector.
attention_weightsTensor [shape=(batch_size, times_steps_q, time_steps_k)]: The attention weights.

parakeet.modules.attention.drop_head(x, drop_n_heads, training=True)[source]

Drop n context vectors from multiple ones.

Parameters

xTensor [shape=(batch_size, num_heads, time_steps, channels)]: The input, multiple context vectors.
drop_n_headsint [0<= drop_n_heads <= num_heads]: Number of vectors to drop.
trainingbool: A flag indicating whether it is in training. If False, no dropout is applied.

Returns

Tensor: The output.

parakeet.modules.attention.scaled_dot_product_attention(q, k, v, mask=None, dropout=0.0, training=True)[source]

Scaled dot product attention with masking.

Assume that q, k, v all have the same leading dimensions (denoted as * in descriptions below). Dropout is applied to attention weights before weighted sum of values.

Parameters

qTensor [shape=(*, T_q, d)]: the query tensor.
kTensor [shape=(*, T_k, d)]: the key tensor.
vTensor [shape=(*, T_k, d_v)]: the value tensor.
maskTensor, [shape=(*, T_q, T_k) or broadcastable shape], optional: the mask tensor, zeros correspond to paddings. Defaults to None.

Returns

outTensor [shape=(*, T_q, d_v)]: the context vector.
attn_weightsTensor [shape=(*, T_q, T_k)]: the attention weights.

parakeet.modules.audio module

class parakeet.modules.audio.MelScale(sr, n_fft, n_mels, fmin, fmax)[source]

Bases: paddle.fluid.dygraph.layers.Layer

forward(spec)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:: *inputs(tuple): unpacked tuple arguments **kwargs(dict): unpacked dict arguments

class parakeet.modules.audio.STFT(n_fft, hop_length=None, win_length=None, window='hanning', center=True, pad_mode='reflect')[source]

Bases: paddle.fluid.dygraph.layers.Layer

A module for computing stft transformation in a differentiable way.

Parameters

n_fftint: Number of samples in a frame.
hop_lengthint: Number of samples shifted between adjacent frames.
win_lengthint: Length of the window.
windowstr, optional: Name of window function, see scipy.signal.get_window for more details. Defaults to “hanning”.
centerbool: If True, the signal y is padded so that frame D[:, t] is centered at y[t * hop_length]. If False, then D[:, t] begins at y[t * hop_length]. Defaults to True.
pad_modestring or function: If center=True, this argument is passed to np.pad for padding the edges of the signal y. By default (pad_mode=”reflect”), y is padded on both sides with its own reflection, mirrored around its first and last sample respectively. If center=False, this argument is ignored.

Notes

It behaves like librosa.core.stft. See librosa.core.stft for more details.

Given a audio which T samples, it the STFT transformation outputs a spectrum with (C, frames) and complex dtype, where C = 1 + n_fft / 2 and frames = 1 + T // hop_lenghth.

Ony center and reflect padding is supported now.

forward(x)[source]

Compute the stft transform. Parameters ———— x : Tensor [shape=(B, T)]

The input waveform.

realTensor [shape=(B, C, frames)]: The real part of the spectrogram.
imagTensor [shape=(B, C, frames)]: The image part of the spectrogram.

magnitude(x)[source]

Compute the magnitude of the spectrum. Parameters ———— x : Tensor [shape=(B, T)]

The input waveform.

Tensor [shape=(B, C, T)]: The magnitude of the spectrum.

power(x)[source]

Compute the power spectrum. Parameters ———— x : Tensor [shape=(B, T)]

The input waveform.

Tensor [shape=(B, C, T)]: The power spectrum.

parakeet.modules.audio.dequantize(quantized, n_bands, dtype=None)[source]

Linearlly dequantize an integer Tensor into a float Tensor in the range [-1, 1).

Parameters

quantizedTensor [dtype: int]: The quantized value in the range [0, n_bands).
n_bandsint: Number of bands. The input integer Tensor’s value is in the range [0, n_bans).
dtypestr, optional: Data type of the output.

Returns

Tensor: The dequantized tensor, dtype is specified by dtype. If dtype is not specified, the default float data type is used.

parakeet.modules.audio.quantize(values, n_bands)[source]

Linearlly quantize a float Tensor in [-1, 1) to an interger Tensor in [0, n_bands).

Parameters

valuesTensor [dtype: flaot32 or float64]: The floating point value.
n_bandsint: The number of bands. The output integer Tensor’s value is in the range [0, n_bans).

Returns

Tensor [dtype: int 64]: The quantized tensor.

parakeet.modules.conv module

class parakeet.modules.conv.Conv1dBatchNorm(in_channels, out_channels, kernel_size, stride=1, padding=0, weight_attr=None, bias_attr=None, data_format='NCL', momentum=0.9, epsilon=1e-05)[source]

Bases: paddle.fluid.dygraph.layers.Layer

A Conv1D Layer followed by a BatchNorm1D.

Parameters

in_channelsint: The feature size of the input.
out_channelsint: The feature size of the output.
kernel_sizeint: The size of the convolution kernel.
strideint, optional: The stride of the convolution, by default 1.
paddingint, str or Tuple[int], optional: The padding of the convolution. If int, a symmetrical padding is applied before convolution; If str, it should be “same” or “valid”; If Tuple[int], its length should be 2, meaning (pad_before, pad_after), by default 0.
weight_attrParamAttr, Initializer, str or bool, optional: The parameter attribute of the convolution kernel, by default None.
bias_attrParamAttr, Initializer, str or bool, optional: The parameter attribute of the bias of the convolution, by default None.
data_formatstr [“NCL” or “NLC”], optional: The data layout of the input, by default “NCL”
momentumfloat, optional: The momentum of the BatchNorm1D layer, by default 0.9
epsilon[type], optional: The epsilon of the BatchNorm1D layer, by default 1e-05

forward(x)[source]

Forward pass of the Conv1dBatchNorm layer.

Parameters

xTensor [shape=(B, C_in, T_in) or (B, T_in, C_in)]: The input tensor. Its data layout depends on data_format.

Returns

Tensor [shape=(B, C_out, T_out) or (B, T_out, C_out)]: The output tensor.

class parakeet.modules.conv.Conv1dCell(in_channels, out_channels, kernel_size, dilation=1, weight_attr=None, bias_attr=None)[source]

Bases: paddle.nn.layer.conv.Conv1D

A subclass of Conv1D layer, which can be used in an autoregressive decoder like an RNN cell.

When used in autoregressive decoding, it performs causal temporal convolution incrementally. At each time step, it takes a step input and returns a step output.

Parameters

in_channels: int: The feature size of the input.
out_channels: int: The feature size of the output.
kernel_size: int or Tuple[int]: The size of the kernel.
dilation: int or Tuple[int]: The dilation of the convolution, by default 1
weight_attr: ParamAttr, Initializer, str or bool, optional: The parameter attribute of the convolution kernel, by default None.
bias_attr: ParamAttr, Initializer, str or bool, optional: The parameter attribute of the bias. If False, this layer does not have a bias, by default None.

Notes

It is done by caching an internal buffer of length receptive_file - 1. when adding a step input, the buffer is shited by one step, the latest input is added to be buffer and the oldest step is discarded. And it returns a step output. For single step case, convolution is equivalent to a linear transformation. That it can be used as a cell depends on several restrictions: 1. stride must be 1; 2. padding must be a causal padding (recpetive_field - 1, 0). Thus, these arguments are removed from the __init__ method of this class.

Examples

>>> cell = Conv1dCell(3, 4, kernel_size=5)
>>> inputs = [paddle.randn([4, 3]) for _ in range(16)]
>>> outputs = []
>>> cell.eval()
>>> cell.start_sequence()
>>> for xt in inputs:
>>>     outputs.append(cell.add_input(xt))
>>> len(outputs))
16
>>> outputs[0].shape
[4, 4]

add_input(x_t)[source]

Add step input and compute step output.

Parameters

x_tTensor [shape=(batch_size, in_channels)]: The step input.

Returns

y_t :Tensor [shape=(batch_size, out_channels)]: The step output.

initialize_buffer(x_t)[source]

Initialize the buffer for the step input.

Parameters

x_tTensor [shape=(batch_size, in_channels)]: The step input.

property receptive_field: The receptive field of the Conv1dCell.

start_sequence()[source]

Prepare the layer for a series of incremental forward.

Raises

Exception: If this method is called when the layer is in training mode.

Warning

This method should be called before a sequence of calls to add_input.

update_buffer(x_t)[source]

Shift the buffer by one step.

Parameters

x_tTensor [shape=(batch_size, in_channels)]: The step input.

parakeet.modules.geometry module

parakeet.modules.geometry.shuffle_dim(x, axis, perm=None)[source]

Permute input tensor along aixs given the permutation or randomly.

Parameters

xTensor

The input tensor.

axisint

The axis to shuffle.

permList[int], ndarray, optional

The order to reorder the tensor along the axis-th dimension.

It is a permutation of [0, d), where d is the size of the axis-th dimension of the input tensor. If not provided, a random permutation is used. Defaults to None.

Returns

Tensor: The shuffled tensor, which has the same shape as x does.

parakeet.modules.losses module

parakeet.modules.losses.guided_attention_loss(attention_weight, dec_lens, enc_lens, g)[source]: Guided attention loss, masked to excluded padding parts.

parakeet.modules.losses.masked_l1_loss(prediction, target, mask)[source]

Compute maksed L1 loss.

Parameters

predictionTensor: The prediction.
targetTensor: The target. The shape should be broadcastable to prediction.
maskTensor: The mask. The shape should be broadcatable to the broadcasted shape of prediction and target.

Returns

Tensor [shape=(1,)]: The masked L1 loss.

parakeet.modules.losses.masked_softmax_with_cross_entropy(logits, label, mask, axis=- 1)[source]

Compute masked softmax with cross entropy loss.

Parameters

logitsTensor: The logits. The axis-th axis is the class dimension.
labelTensor [dtype: int]: The label. The size of the axis-th axis should be 1.
maskTensor: The mask. The shape should be broadcastable to label.
axisint, optional: The index of the class dimension in the shape of logits, by default -1.

Returns

Tensor [shape=(1,)]: The masked softmax with cross entropy loss.

parakeet.modules.losses.weighted_mean(input, weight)[source]

Weighted mean. It can also be used as masked mean.

Parameters

inputTensor: The input tensor.
weightTensor: The weight tensor with broadcastable shape with the input.

Returns

Tensor [shape=(1,)]: Weighted mean tensor with the same dtype as input.

parakeet.modules.masking module

parakeet.modules.masking.combine_mask(mask1, mask2)[source]

Combine two mask with multiplication or logical and.

Parameters

mask1Tensor: The first mask.
mask2Tensor: The second mask with broadcastable shape with mask1.
Returns
——–
Tensor: Combined mask.

Notes

It is mainly used to combine the padding mask and no future mask for transformer decoder.

Padding mask is used to mask padding positions of the decoder inputs and no future mask is used to prevent the decoder to see future information.

parakeet.modules.masking.feature_mask(input, axis, dtype='bool')[source]

Compute mask from input features.

For a input features, represented as batched feature vectors, those vectors which all zeros are considerd padding vectors.

Parameters

inputTensor [dtype: float]

The input tensor which represents featues.

axisint

The index of the feature dimension in input. Other dimensions are considered spatial dimensions.

dtypestr, optional

Data type of the generated mask, by default “bool”

Returns

——-

Tensor

The geenrated mask with spatial shape as mentioned above.

It has one less dimension than input does.

parakeet.modules.masking.future_mask(time_steps, dtype='bool')[source]

Generate lower triangular mask.

It is used at transformer decoder to prevent the decoder to see future information.

Parameters

time_stepsint: Decoder time steps.
dtypestr, optional: The data type of the generate mask, by default “bool”.

Returns

Tensor: The generated mask.

parakeet.modules.masking.id_mask(input, padding_index=0, dtype='bool')[source]

Generate mask with input ids.

Those positions where the value equals padding_index correspond to 0 or False, otherwise, 1 or True.

Parameters

inputTensor [dtype: int]: The input tensor. It represents the ids.
padding_indexint, optional: The id which represents padding, by default 0.
dtypestr, optional: Data type of the returned mask, by default “bool”.

Returns

Tensor: The generate mask. It has the same shape as input does.

parakeet.modules.positional_encoding module

parakeet.modules.positional_encoding.scaled_position_encoding(num_positions: int, feature_size: int, omega: paddle.VarBase, start_pos: int = 0, dtype=None) → paddle.VarBase[source]

parakeet.modules.positional_encoding.sinusoid_position_encoding(num_positions: int, feature_size: int, omega: float = 1.0, start_pos: int = 0, dtype=None) → paddle.VarBase[source]

parakeet.modules.transformer module

class parakeet.modules.transformer.PositionwiseFFN(input_size: int, hidden_size: int, dropout=0.0)[source]

Bases: paddle.fluid.dygraph.layers.Layer

A faithful implementation of Position-wise Feed-Forward Network in Attention is All You Need. It is basically a 2-layer MLP, with relu actication and dropout in between.

Parameters

input_size: int: The feature size of the intput. It is also the feature size of the output.
hidden_size: int: The hidden size.
dropout: float: The probability of the Dropout applied to the output of the first layer, by default 0.

forward(x)[source]

Forward pass of positionwise feed forward network.

Parameters

xTensor [shape=(*, input_size)]: The input tensor, where \* means arbitary shape.

Returns

Tensor [shape=(*, input_size)]: The output tensor.

class parakeet.modules.transformer.TransformerDecoderLayer(d_model, n_heads, d_ffn, dropout=0.0)[source]

Bases: paddle.fluid.dygraph.layers.Layer

A faithful implementation of Transformer decoder layer in Attention is All You Need.

Parameters

d_model :int: The feature size of the input. It is also the feature size of the output.
n_headsint: The number of heads of attentions (MultiheadAttention layers).
d_ffnint: The hidden size of the positional feed forward network (a PositionwiseFFN layer).
dropoutfloat, optional: The probability of the dropout in MultiHeadAttention and PositionwiseFFN, by default 0.

Notes

It uses the PostLN (post layer norm) scheme.

forward(q, k, v, encoder_mask, decoder_mask)[source]

Forward pass of TransformerEncoderLayer.

Parameters

qTensor [shape=(batch_size, time_steps_q, d_model)]: The decoder input.
kTensor [shape=(batch_size, time_steps_k, d_model)]: The keys.
vTensor [shape=(batch_size, time_steps_k, d_model)]: The values
encoder_maskTensor: Encoder padding mask, shape is (batch_size, time_steps_k, time_steps_k) or broadcastable shape.
decoder_maskTensor: Decoder mask, shape is (batch_size, time_steps_q, time_steps_k) or broadcastable shape.

Returns

qTensor [shape=(batch_size, time_steps_q, d_model)]: The decoder output.
self_attn_weightsTensor [shape=(batch_size, n_heads, time_steps_q, time_steps_q)]: Decoder self attention.
cross_attn_weightsTensor [shape=(batch_size, n_heads, time_steps_q, time_steps_k)]: Decoder-encoder cross attention.

class parakeet.modules.transformer.TransformerEncoderLayer(d_model, n_heads, d_ffn, dropout=0.0)[source]

Bases: paddle.fluid.dygraph.layers.Layer

A faithful implementation of Transformer encoder layer in Attention is All You Need.

Parameters

d_model :int: The feature size of the input. It is also the feature size of the output.
n_headsint: The number of heads of self attention (a MultiheadAttention layer).
d_ffnint: The hidden size of the positional feed forward network (a PositionwiseFFN layer).
dropoutfloat, optional: The probability of the dropout in MultiHeadAttention and PositionwiseFFN, by default 0.

Notes

It uses the PostLN (post layer norm) scheme.

forward(x, mask)[source]

Forward pass of TransformerEncoderLayer.

Parameters

xTensor [shape=(batch_size, time_steps, d_model)]: The input.
maskTensor: The padding mask. The shape is (batch_size, time_steps, time_steps) or broadcastable shape.

Returns

x :Tensor [shape=(batch_size, time_steps, d_model)]: The encoded output.
attn_weightsTensor [shape=(batch_size, n_heads, time_steps, time_steps)]: The attention weights of the self attention.

parakeet.modules package

Submodules

parakeet.modules.attention module

parakeet.modules.audio module

parakeet.modules.conv module

parakeet.modules.geometry module

parakeet.modules.losses module

parakeet.modules.masking module

parakeet.modules.positional_encoding module

parakeet.modules.transformer module

Module contents