simulstream.server.speech_processors.base_streamatt.BaseStreamAtt
- class simulstream.server.speech_processors.base_streamatt.BaseStreamAtt(config: SimpleNamespace)
A partial implementation of
BaseSpeechProcessorthat provides common logic for the StreamAtt policy, introduced in:S. Papi, et al. 2024. “StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection” (https://aclanthology.org/2024.acl-long.202/)
The approach relies on selecting the audio history based on the cross-attention mechanism. Specifically, the history for the next decoding step is defined as follows:
First, the new textual history is selected by the text_history_method, which is in
charge of selecting the tokens to retain; - Second, the new audio history is selected according to cross-attention scores between the audio features and the retained textual history by discarding past features that are not attended by any tokens of the textual history.
- The derived class should implement the following methods:
audio_max_len: Returns the maximum audio feature length.
load_model: Loads the model to device.
_preprocess: Preprocess the audio features before feeding them into the model.
_generate: Generate that also returns cross attention scores.
- Parameters:
config (SimpleNamespace) –
Configuration object. The following attributes are expected: - text_history (str): config (SimpleNamespace) with the following attribute:
type (str): Name of the class to use to determine the text history to keep as context for next predictions.
audio_subsampling_factor (int): Subsampling factor of the model, if any. Defaults to 1.
text_history_max_len (int): The maximum length of the textual history after which the current content is cut. Defaults to 128.
cross_attention_layer (int): Layer from which to extract the cross-attention from.
cutoff_frame_num (int): Number of last frames that cannot be attended by tokens in the AlignAtt policy.
word_level_postprocess (bool): Whether to postprocess the generated tokens to keep only complete words in the emitted hypothesis. To be disabled when operating with character-level languages. Defaults to True.