simulstream.server.speech_processors.base_streamatt.BaseStreamAtt
- class simulstream.server.speech_processors.base_streamatt.BaseStreamAtt(config: SimpleNamespace)
Bases:
BaseSpeechProcessorA partial implementation of
BaseSpeechProcessorthat provides common logic for the StreamAtt policy, introduced in:S. Papi, et al. 2024. “StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection” (https://aclanthology.org/2024.acl-long.202/)
The approach relies on selecting the audio history based on the cross-attention mechanism. Specifically, the history for the next decoding step is defined as follows:
First, the new textual history is selected by the text_history_method, which is in
charge of selecting the tokens to retain; - Second, the new audio history is selected according to cross-attention scores between the audio features and the retained textual history by discarding past features that are not attended by any tokens of the textual history.
- The derived class should implement the following methods:
audio_max_len: Returns the maximum audio feature length.
load_model: Loads the model to device.
_preprocess: Preprocess the audio features before feeding them into the model.
_generate: Generate that also returns cross attention scores.
- Parameters:
config (SimpleNamespace) –
Configuration object. The following attributes are expected: - text_history (str): config (SimpleNamespace) with the following attribute:
type (str): Name of the class to use to determine the text history to keep as context for next predictions.
audio_subsampling_factor (int): Subsampling factor of the model, if any. Defaults to 1.
text_history_max_len (int): The maximum length of the textual history after which the current content is cut. Defaults to 128.
cross_attention_layer (int): Layer from which to extract the cross-attention from.
cutoff_frame_num (int): Number of last frames that cannot be attended by tokens in the AlignAtt policy.
word_level_postprocess (bool): Whether to postprocess the generated tokens to keep only complete words in the emitted hypothesis. To be disabled when operating with character-level languages. Defaults to True.
- __init__(config: SimpleNamespace)
Initialize the speech processor with a given configuration.
- Parameters:
config (SimpleNamespace) – Configuration loaded from a YAML file.
Methods
__init__(config)Initialize the speech processor with a given configuration.
alignatt_policy(generated_tokens, cross_attn)Apply the AlignAtt policy by cutting off tokens whose attention falls beyond the allowed frame range. The AlignAtt policy was introduced in: S. Papi, et al. 2023. "AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation" (https://www.isca-archive.org/interspeech_2023/papi23_interspeech.html).
clear()Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.
This method is called at the end of audio chunk processing.
load_model(config)Load and initialize the underlying speech model.
normalize_attn(attn)Normalize the cross-attention scores along the frame dimension to avoid attention sinks.
process_chunk(waveform)Process a chunk of waveform and produce incremental output.
set_source_language(language)Set the source language for the speech processor.
set_target_language(language)Set the target language for the speech processor (for translation).
tokens_to_string(tokens)Converts token sequences into human-readable strings.
Attributes
Return the maximum allowed length of the audio features, beyond which the audio is cut off.
Return the size of the speech chunks to be processed (in seconds).
- alignatt_policy(generated_tokens, cross_attn) List[str]
Apply the AlignAtt policy by cutting off tokens whose attention falls beyond the allowed frame range. The AlignAtt policy was introduced in:
S. Papi, et al. 2023. “AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation” (https://www.isca-archive.org/interspeech_2023/papi23_interspeech.html)
- abstract property audio_max_len: float
Return the maximum allowed length of the audio features, beyond which the audio is cut off.
- clear() None
Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.
- end_of_stream() IncrementalOutput
This method is called at the end of audio chunk processing. It can be used to emit hypotheses at the end of the speech to conclude the output.
- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- abstractmethod classmethod load_model(config: SimpleNamespace)
Load and initialize the underlying speech model.
- Parameters:
config (SimpleNamespace) – Configuration of the speech processor.
- static normalize_attn(attn)
Normalize the cross-attention scores along the frame dimension to avoid attention sinks.
- process_chunk(waveform: float32) IncrementalOutput
Process a chunk of waveform and produce incremental output.
- Parameters:
waveform (np.float32) – A 1D NumPy array of the audio chunk. The array is PCM audio normalized to the range
[-1.0, 1.0]sampled atsimulstream.server.speech_processors.SAMPLE_RATE.- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- abstractmethod set_source_language(language: str) None
Set the source language for the speech processor.
- Parameters:
language (str) – Language code (e.g.,
"en","it").
- abstractmethod set_target_language(language: str) None
Set the target language for the speech processor (for translation).
- Parameters:
language (str) – Language code (e.g.,
"en","it").