simulstream.server.speech_processors.sliding_window_retranslation.SlidingWindowRetranslator
- class simulstream.server.speech_processors.sliding_window_retranslation.SlidingWindowRetranslator(config: SimpleNamespace)
Bases:
BaseSpeechProcessorA speech processor that applies a fixed-length sliding window retranslation with deduplication to mitigate overlapping outputs when processing unsegmented audio streams.
This class implements the algorithm introduced in:
S. Sen, et al. 2025. “Simultaneous Translation for Unsegmented Input: A Sliding Window Approach” (https://arxiv.org/pdf/2210.09754)
The approach relies on detecting the longest common subsequence between the current window and the previous one, in order to prevent repeating tokens caused by overlapping audio windows.
- Parameters:
config (SimpleNamespace) –
Configuration object. The following attributes are expected:
window_len (int): Length of the sliding window (in seconds).
matching_threshold (float, optional): Minimum fraction of the current tokens that must match the previous history to be considered aligned. Default =
0.1.override_on_failed_match (bool, optional): If
True, the previous history is deleted from the output when no sufficient match is found. Otherwise, previous history is kept and the new output is appended to the end of the previous history. Default =False.max_tokens_per_second (int, optional): Maximum output tokens allowed per second of audio. Default =
10.
- __init__(config: SimpleNamespace)
Initialize the speech processor with a given configuration.
- Parameters:
config (SimpleNamespace) – Configuration loaded from a YAML file.
Methods
__init__(config)Initialize the speech processor with a given configuration.
clear()Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.
This method is called at the end of audio chunk processing.
load_model(config)Load and initialize the underlying speech model.
process_chunk(waveform)Process a chunk of waveform and produce incremental output.
set_source_language(language)Set the source language for the speech processor.
set_target_language(language)Set the target language for the speech processor (for translation).
tokens_to_string(tokens)Converts token sequences into human-readable strings.
Attributes
Return the size of the speech chunks to be processed (in seconds).
- clear() None
Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.
- end_of_stream() IncrementalOutput
This method is called at the end of audio chunk processing. It can be used to emit hypotheses at the end of the speech to conclude the output.
- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- abstractmethod classmethod load_model(config: SimpleNamespace)
Load and initialize the underlying speech model.
- Parameters:
config (SimpleNamespace) – Configuration of the speech processor.
- process_chunk(waveform: float32) IncrementalOutput
Process a chunk of waveform and produce incremental output.
- Parameters:
waveform (np.float32) – A 1D NumPy array of the audio chunk. The array is PCM audio normalized to the range
[-1.0, 1.0]sampled atsimulstream.server.speech_processors.SAMPLE_RATE.- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- abstractmethod set_source_language(language: str) None
Set the source language for the speech processor.
- Parameters:
language (str) – Language code (e.g.,
"en","it").
- abstractmethod set_target_language(language: str) None
Set the target language for the speech processor (for translation).
- Parameters:
language (str) – Language code (e.g.,
"en","it").