simulstream.server.speech_processors.base_streamatt

Classes

`BaseStreamAtt`(config)	A partial implementation of `BaseSpeechProcessor` that provides common logic for the StreamAtt policy, introduced in:
`FixedWordsTextHistory`(config)	Fixed Words textual history selection method that retains a pre-defined number of words in the history (history_words).
`PunctuationTextHistory`(config)	Punctuation textual history selection method that retains the sentence before the last strong punctuation character.

class simulstream.server.speech_processors.base_streamatt.BaseStreamAtt(config: SimpleNamespace)

A partial implementation of BaseSpeechProcessor that provides common logic for the StreamAtt policy, introduced in:

S. Papi, et al. 2024. “StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection” (https://aclanthology.org/2024.acl-long.202/)

The approach relies on selecting the audio history based on the cross-attention mechanism. Specifically, the history for the next decoding step is defined as follows:

First, the new textual history is selected by the text_history_method, which is in

charge of selecting the tokens to retain; - Second, the new audio history is selected according to cross-attention scores between the audio features and the retained textual history by discarding past features that are not attended by any tokens of the textual history.

The derived class should implement the following methods:

audio_max_len: Returns the maximum audio feature length.

load_model: Loads the model to device.

_preprocess: Preprocess the audio features before feeding them into the model.

_generate: Generate that also returns cross attention scores.

Parameters:

config (SimpleNamespace) –

Configuration object. The following attributes are expected: - text_history (str): config (SimpleNamespace) with the following attribute:

type (str): Name of the class to use to determine the text history to keep as context for next predictions.

audio_subsampling_factor (int): Subsampling factor of the model, if any. Defaults to 1.
text_history_max_len (int): The maximum length of the textual history after which the current content is cut. Defaults to 128.
cross_attention_layer (int): Layer from which to extract the cross-attention from.
cutoff_frame_num (int): Number of last frames that cannot be attended by tokens in the AlignAtt policy.
word_level_postprocess (bool): Whether to postprocess the generated tokens to keep only complete words in the emitted hypothesis. To be disabled when operating with character-level languages. Defaults to True.

alignatt_policy(generated_tokens, cross_attn) → List[str]: Apply the AlignAtt policy by cutting off tokens whose attention falls beyond the allowed frame range. The AlignAtt policy was introduced in:

S. Papi, et al. 2023. “AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation” (https://www.isca-archive.org/interspeech_2023/papi23_interspeech.html)

abstract property audio_max_len: float: Return the maximum allowed length of the audio features, beyond which the audio is cut off.

clear() → None: Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.

end_of_stream() → IncrementalOutput

This method is called at the end of audio chunk processing. It can be used to emit hypotheses at the end of the speech to conclude the output.

Returns:: The incremental output (new and deleted tokens/strings).
Return type:: IncrementalOutput

static normalize_attn(attn): Normalize the cross-attention scores along the frame dimension to avoid attention sinks.

process_chunk(waveform: float32) → IncrementalOutput

Process a chunk of waveform and produce incremental output.

Parameters:: waveform (np.float32) – A 1D NumPy array of the audio chunk. The array is PCM audio normalized to the range [-1.0, 1.0] sampled at simulstream.server.speech_processors.SAMPLE_RATE.
Returns:: The incremental output (new and deleted tokens/strings).
Return type:: IncrementalOutput

class simulstream.server.speech_processors.base_streamatt.FixedWordsTextHistory(config: SimpleNamespace)

Fixed Words textual history selection method that retains a pre-defined number of words in the history (history_words).

The current implementation supports only SentencePiece.

class simulstream.server.speech_processors.base_streamatt.PunctuationTextHistory(config: SimpleNamespace)

Punctuation textual history selection method that retains the sentence before the last strong punctuation character.

The current implementation supports only SentencePiece.