art.estimators.speech_recognition

Module containing estimators for speech recognition.

Mixin Base Class Speech Recognizer

class art.estimators.speech_recognition.SpeechRecognizerMixin

Mix-in Base class for ART speech recognizers.

Speech Recognizer Deep Speech

class art.estimators.speech_recognition.PyTorchDeepSpeech(model: Optional[DeepSpeech] = None, pretrained_model: Optional[str] = None, filename: Optional[str] = None, url: Optional[str] = None, use_half: bool = False, optimizer: Optional[torch.optim.Optimizer] = None, use_amp: bool = False, opt_level: str = 'O1', loss_scale: Optional[Union[float, str]] = 1.0, decoder_type: str = 'greedy', lm_path: str = '', top_paths: int = 1, alpha: float = 0.0, beta: float = 0.0, cutoff_top_n: int = 40, cutoff_prob: float = 1.0, beam_width: int = 10, lm_workers: int = 4, clip_values: Optional[CLIP_VALUES_TYPE] = None, preprocessing_defences: Optional[Union[Preprocessor, List[Preprocessor]]] = None, postprocessing_defences: Optional[Union[Postprocessor, List[Postprocessor]]] = None, preprocessing: PREPROCESSING_TYPE = None, device_type: str = 'gpu')

This class implements a model-specific automatic speech recognizer using the end-to-end speech recognizer DeepSpeech and PyTorch.

__init__(model: Optional[DeepSpeech] = None, pretrained_model: Optional[str] = None, filename: Optional[str] = None, url: Optional[str] = None, use_half: bool = False, optimizer: Optional[torch.optim.Optimizer] = None, use_amp: bool = False, opt_level: str = 'O1', loss_scale: Optional[Union[float, str]] = 1.0, decoder_type: str = 'greedy', lm_path: str = '', top_paths: int = 1, alpha: float = 0.0, beta: float = 0.0, cutoff_top_n: int = 40, cutoff_prob: float = 1.0, beam_width: int = 10, lm_workers: int = 4, clip_values: Optional[CLIP_VALUES_TYPE] = None, preprocessing_defences: Optional[Union[Preprocessor, List[Preprocessor]]] = None, postprocessing_defences: Optional[Union[Postprocessor, List[Postprocessor]]] = None, preprocessing: PREPROCESSING_TYPE = None, device_type: str = 'gpu')

Initialization of an instance PyTorchDeepSpeech.

Parameters
  • model – DeepSpeech model.

  • pretrained_model – The choice of pretrained model if a pretrained model is required. Currently this estimator supports 3 different pretrained models consisting of an4, librispeech and tedlium.

  • filename – Name of the file.

  • url – Download URL.

  • use_half (bool) – Whether to use FP16 for pretrained model.

  • optimizer – The optimizer used to train the estimator.

  • use_amp (bool) – Whether to use the automatic mixed precision tool to enable mixed precision training or gradient computation, e.g. with loss gradient computation. When set to True, this option is only triggered if there are GPUs available.

  • opt_level (str) – Specify a pure or mixed precision optimization level. Used when use_amp is True. Accepted values are O0, O1, O2, and O3.

  • loss_scale – Loss scaling. Used when use_amp is True. Default is 1.0 due to warp-ctc not supporting scaling of gradients. If passed as a string, must be a string representing a number, e.g., “1.0”, or the string “dynamic”.

  • decoder_type (str) – Decoder type. Either greedy or beam. This parameter is only used when users want transcription outputs.

  • lm_path (str) – Path to an (optional) kenlm language model for use with beam search. This parameter is only used when users want transcription outputs.

  • top_paths (int) – Number of beams to be returned. This parameter is only used when users want transcription outputs.

  • alpha (float) – The weight used for the language model. This parameter is only used when users want transcription outputs.

  • beta (float) – Language model word bonus (all words). This parameter is only used when users want transcription outputs.

  • cutoff_top_n (int) – Cutoff_top_n characters with highest probs in vocabulary will be used in beam search. This parameter is only used when users want transcription outputs.

  • cutoff_prob (float) – Cutoff probability in pruning. This parameter is only used when users want transcription outputs.

  • beam_width (int) – The width of beam to be used. This parameter is only used when users want transcription outputs.

  • lm_workers (int) – Number of language model processes to use. This parameter is only used when users want transcription outputs.

  • clip_values – Tuple of the form (min, max) of floats or np.ndarray representing the minimum and maximum values allowed for features. If floats are provided, these will be used as the range of all features. If arrays are provided, each value will be considered the bound for a feature, thus the shape of clip values needs to match the total number of features.

  • preprocessing_defences – Preprocessing defence(s) to be applied by the estimator.

  • postprocessing_defences – Postprocessing defence(s) to be applied by the estimator.

  • preprocessing – Tuple of the form (subtrahend, divisor) of floats or np.ndarray of values to be used for data preprocessing. The first value will be subtracted from the input. The input will then be divided by the second one.

  • device_type (str) – Type of device to be used for model and tensors, if cpu run on CPU, if gpu run on GPU if available otherwise run on CPU.

property channel_index
Returns

Index of the axis containing the color channels in the samples x.

property channels_first
Returns

Boolean to indicate index of the color channels in the sample x.

property clip_values

Return the clip values of the input samples.

Returns

Clip values (min, max).

property device

Get current used device.

Returns

Current used device.

fit(x: numpy.ndarray, y: numpy.ndarray, batch_size: int = 128, nb_epochs: int = 10, **kwargs) → None

Fit the estimator on the training set (x, y).

Parameters
  • x (ndarray) – Samples of shape (nb_samples, seq_length). Note that, it is allowable that sequences in the batch could have different lengths. A possible example of x could be: x = np.array([np.array([0.1, 0.2, 0.1, 0.4]), np.array([0.3, 0.1])]).

  • y (ndarray) – Target values of shape (nb_samples). Each sample in y is a string and it may possess different lengths. A possible example of y could be: y = np.array([‘SIXTY ONE’, ‘HELLO’]).

  • batch_size (int) – Size of batches.

  • nb_epochs (int) – Number of epochs to use for training.

  • kwargs – Dictionary of framework-specific arguments. This parameter is not currently supported for PyTorch and providing it takes no effect.

fit_generator(generator: DataGenerator, nb_epochs: int = 20, **kwargs) → None

Fit the estimator using a generator yielding training batches. Implementations can provide framework-specific versions of this function to speed-up computation.

Parameters
  • generator – Batch generator providing (x, y) for each epoch.

  • nb_epochs (int) – Number of training epochs.

get_activations(x: numpy.ndarray, layer: Union[int, str], batch_size: int, framework: bool = False) → numpy.ndarray

Return the output of a specific layer for samples x where layer is the index of the layer between 0 and nb_layers - 1 or the name of the layer. The number of layers can be determined by counting the results returned by calling `layer_names.

Return type

ndarray

Parameters
  • x (ndarray) – Samples

  • layer – Index or name of the layer.

  • batch_size (int) – Batch size.

  • framework (bool) – If true, return the intermediate tensor representation of the activation.

Returns

The output of layer, where the first dimension is the batch size corresponding to x.

get_params() → Dict[str, Any]

Get all parameters and their values of this estimator.

Returns

A dictionary of string parameter names to their value.

property input_shape

Return the shape of one input sample.

Returns

Shape of one input sample.

property layer_names

Return the names of the hidden layers in the model, if applicable.

Returns

The names of the hidden layers in the model, input and output layers are ignored.

Warning

layer_names tries to infer the internal structure of the model. This feature comes with no guarantees on the correctness of the result. The intended order of the layers tries to match their order in the model, but this is not guaranteed either.

property learning_phase

The learning phase set by the user. Possible values are True for training or False for prediction and None if it has not been set by the library. In the latter case, the library does not do any explicit learning phase manipulation and the current value of the backend framework is used. If a value has been set by the user for this property, it will impact all following computations for model fitting, prediction and gradients.

Returns

Learning phase.

loss(x: numpy.ndarray, y: numpy.ndarray, **kwargs) → numpy.ndarray

Compute the loss of the neural network for samples x.

Parameters
  • x (ndarray) – Samples of shape (nb_samples, nb_features) or (nb_samples, nb_pixels_1, nb_pixels_2, nb_channels) or (nb_samples, nb_channels, nb_pixels_1, nb_pixels_2).

  • y (ndarray) – Target values (class labels) one-hot-encoded of shape (nb_samples, nb_classes) or indices of shape (nb_samples,).

Returns

Loss values.

Return type

Format as expected by the model

loss_gradient(x: numpy.ndarray, y: numpy.ndarray, **kwargs) → numpy.ndarray

Compute the gradient of the loss function w.r.t. x.

Return type

ndarray

Parameters
  • x (ndarray) – Samples of shape (nb_samples, seq_length). Note that, it is allowable that sequences in the batch could have different lengths. A possible example of x could be: x = np.array([np.array([0.1, 0.2, 0.1, 0.4]), np.array([0.3, 0.1])]).

  • y (ndarray) – Target values of shape (nb_samples). Each sample in y is a string and it may possess different lengths. A possible example of y could be: y = np.array([‘SIXTY ONE’, ‘HELLO’]).

Returns

Loss gradients of the same shape as x.

property model

Get current model.

Returns

Current model.

predict(x: numpy.ndarray, batch_size: int = 128, **kwargs) → Union[Tuple[numpy.ndarray, numpy.ndarray], numpy.ndarray]

Perform prediction for a batch of inputs.

Parameters
  • x (ndarray) – Samples of shape (nb_samples, seq_length). Note that, it is allowable that sequences in the batch could have different lengths. A possible example of x could be: x = np.array([np.array([0.1, 0.2, 0.1, 0.4]), np.array([0.3, 0.1])]).

  • batch_size (int) – Batch size.

  • transcription_output (bool) – Indicate whether the function will produce probability or transcription as prediction output. If transcription_output is not available, then probability output is returned.

Returns

Probability (if transcription_output is None or False) or transcription (if transcription_output is True) predictions: - Probability return is a tuple of (probs, sizes), where probs is the probability of characters of shape (nb_samples, seq_length, nb_classes) and sizes is the real sequence length of shape (nb_samples,). - Transcription return is a numpy array of characters. A possible example of a transcription return is np.array([‘SIXTY ONE’, ‘HELLO’]).

set_learning_phase(train: bool) → None

Set the learning phase for the backend framework.

Parameters

train (bool) – True if the learning phase is training, otherwise False.

set_params(**kwargs) → None

Take a dictionary of parameters and apply checks before setting them as attributes.

Parameters

kwargs – A dictionary of attributes.

transform_model_input(x: Union[numpy.ndarray, torch.Tensor], y: Optional[numpy.ndarray] = None, compute_gradient: bool = False, tensor_input: bool = False, real_lengths: Optional[numpy.ndarray] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, List]

Transform the user input space into the model input space.

Return type

Tuple

Parameters
  • x – Samples of shape (nb_samples, seq_length). Note that, it is allowable that sequences in the batch could have different lengths. A possible example of x could be: x = np.ndarray([[0.1, 0.2, 0.1, 0.4], [0.3, 0.1]]).

  • y – Target values of shape (nb_samples). Each sample in y is a string and it may possess different lengths. A possible example of y could be: y = np.array([‘SIXTY ONE’, ‘HELLO’]).

  • compute_gradient (bool) – Indicate whether to compute gradients for the input x.

  • tensor_input (bool) – Indicate whether input is tensor.

  • real_lengths – Real lengths of original sequences.

Returns

A tuple of inputs and targets in the model space with the original index (inputs, targets, input_percentages, target_sizes, batch_idx), where: - inputs: model inputs of shape (nb_samples, nb_frequencies, seq_length). - targets: ground truth targets of shape (sum over nb_samples of real seq_lengths). - input_percentages: percentages of real inputs in inputs. - target_sizes: list of real seq_lengths. - batch_idx: original index of inputs.