simulstream.server.speech_processors.base_streamatt.BaseStreamAtt

class simulstream.server.speech_processors.base_streamatt.BaseStreamAtt(config: SimpleNamespace)

A partial implementation of BaseSpeechProcessor that provides common logic for the StreamAtt policy, introduced in:

S. Papi, et al. 2024. “StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection” (https://aclanthology.org/2024.acl-long.202/)

The approach relies on selecting the audio history based on the cross-attention mechanism. Specifically, the history for the next decoding step is defined as follows:

  • First, the new textual history is selected by the text_history_method, which is in

charge of selecting the tokens to retain; - Second, the new audio history is selected according to cross-attention scores between the audio features and the retained textual history by discarding past features that are not attended by any tokens of the textual history.

The derived class should implement the following methods:
  • audio_max_len: Returns the maximum audio feature length.

  • load_model: Loads the model to device.

  • _preprocess: Preprocess the audio features before feeding them into the model.

  • _generate: Generate that also returns cross attention scores.

Parameters:

config (SimpleNamespace) –

Configuration object. The following attributes are expected: - text_history (str): config (SimpleNamespace) with the following attribute:

  • type (str): Name of the class to use to determine the text history to keep as context for next predictions.

  • audio_subsampling_factor (int): Subsampling factor of the model, if any. Defaults to 1.

  • text_history_max_len (int): The maximum length of the textual history after which the current content is cut. Defaults to 128.

  • cross_attention_layer (int): Layer from which to extract the cross-attention from.

  • cutoff_frame_num (int): Number of last frames that cannot be attended by tokens in the AlignAtt policy.

  • word_level_postprocess (bool): Whether to postprocess the generated tokens to keep only complete words in the emitted hypothesis. To be disabled when operating with character-level languages. Defaults to True.