simulstream.server.speech_processors.base_streamatt
Classes
|
A partial implementation of |
|
Fixed Words textual history selection method that retains a pre-defined number of words in the history (history_words). |
|
Punctuation textual history selection method that retains the sentence before the last strong punctuation character. |
- class simulstream.server.speech_processors.base_streamatt.BaseStreamAtt(config: SimpleNamespace)
A partial implementation of
BaseSpeechProcessorthat provides common logic for the StreamAtt policy, introduced in:S. Papi, et al. 2024. “StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection” (https://aclanthology.org/2024.acl-long.202/)
The approach relies on selecting the audio history based on the cross-attention mechanism. Specifically, the history for the next decoding step is defined as follows:
First, the new textual history is selected by the text_history_method, which is in
charge of selecting the tokens to retain; - Second, the new audio history is selected according to cross-attention scores between the audio features and the retained textual history by discarding past features that are not attended by any tokens of the textual history.
- The derived class should implement the following methods:
audio_max_len: Returns the maximum audio feature length.
load_model: Loads the model to device.
_preprocess: Preprocess the audio features before feeding them into the model.
_generate: Generate that also returns cross attention scores.
- Parameters:
config (SimpleNamespace) –
Configuration object. The following attributes are expected: - text_history (str): config (SimpleNamespace) with the following attribute:
type (str): Name of the class to use to determine the text history to keep as context for next predictions.
audio_subsampling_factor (int): Subsampling factor of the model, if any. Defaults to 1.
text_history_max_len (int): The maximum length of the textual history after which the current content is cut. Defaults to 128.
cross_attention_layer (int): Layer from which to extract the cross-attention from.
cutoff_frame_num (int): Number of last frames that cannot be attended by tokens in the AlignAtt policy.
word_level_postprocess (bool): Whether to postprocess the generated tokens to keep only complete words in the emitted hypothesis. To be disabled when operating with character-level languages. Defaults to True.
- alignatt_policy(generated_tokens, cross_attn) List[str]
Apply the AlignAtt policy by cutting off tokens whose attention falls beyond the allowed frame range. The AlignAtt policy was introduced in:
S. Papi, et al. 2023. “AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation” (https://www.isca-archive.org/interspeech_2023/papi23_interspeech.html)
- abstract property audio_max_len: float
Return the maximum allowed length of the audio features, beyond which the audio is cut off.
- clear() None
Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.
- end_of_stream() IncrementalOutput
This method is called at the end of audio chunk processing. It can be used to emit hypotheses at the end of the speech to conclude the output.
- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- static normalize_attn(attn)
Normalize the cross-attention scores along the frame dimension to avoid attention sinks.
- process_chunk(waveform: float32) IncrementalOutput
Process a chunk of waveform and produce incremental output.
- Parameters:
waveform (np.float32) – A 1D NumPy array of the audio chunk. The array is PCM audio normalized to the range
[-1.0, 1.0]sampled atsimulstream.server.speech_processors.SAMPLE_RATE.- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- class simulstream.server.speech_processors.base_streamatt.FixedWordsTextHistory(config: SimpleNamespace)
Fixed Words textual history selection method that retains a pre-defined number of words in the history (history_words).
The current implementation supports only SentencePiece.
- class simulstream.server.speech_processors.base_streamatt.PunctuationTextHistory(config: SimpleNamespace)
Punctuation textual history selection method that retains the sentence before the last strong punctuation character.
The current implementation supports only SentencePiece.