simulstream.server.speech_processors.vad_wrapper

Classes

VADWrapperSpeechProcessor(config)

A speech processor that integrates Voice Activity Detection (VAD) to filter and split continuous audio streams into meaningful speech chunks before processing them with an underlying speech processor.

class simulstream.server.speech_processors.vad_wrapper.VADWrapperSpeechProcessor(config: SimpleNamespace)

A speech processor that integrates Voice Activity Detection (VAD) to filter and split continuous audio streams into meaningful speech chunks before processing them with an underlying speech processor.

This class wraps a SpeechProcessor implementation (defined by in the configuration via the attribute base_speech_processor_class) with a Silero VAD-based iterator that detects the start and end of speech segments. Audio outside of speech is ignored, and each detected segment is passed to the underlying speech processor.

Parameters:

config (SimpleNamespace) –

Configuration object. The following attributes are used:

base_speech_processor_class (str): full name of the underlying speech processor class to use.
vad_threshold (float, optional): VAD probability threshold. Default = 0.5.
vad_min_silence_duration_ms (int, optional): Minimum silence duration (milliseconds) to consider the end of a speech segment. Default = 100.
vad_speech_pad_ms (int, optional): Padding (milliseconds) to include before and after detected speech. Default = 30.
min_speech_size (int, optional): Minimum segment size in seconds; shorter segments are ignored. Default = 1.
Any additional attributes required by the subclass speech_processor_class.

clear() → None: Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.

end_of_stream() → IncrementalOutput

This method is called at the end of audio chunk processing. It can be used to emit hypotheses at the end of the speech to conclude the output.

Returns:: The incremental output (new and deleted tokens/strings).
Return type:: IncrementalOutput

classmethod load_model(config: SimpleNamespace)

Load and initialize the underlying speech model.

Parameters:: config (SimpleNamespace) – Configuration of the speech processor.

process_chunk(waveform: float32) → IncrementalOutput

Process a chunk of waveform and produce incremental output.

Parameters:: waveform (np.float32) – A 1D NumPy array of the audio chunk. The array is PCM audio normalized to the range [-1.0, 1.0] sampled at simulstream.server.speech_processors.SAMPLE_RATE.
Returns:: The incremental output (new and deleted tokens/strings).
Return type:: IncrementalOutput

set_source_language(language: str) → None

Set the source language for the speech processor.

Parameters:: language (str) – Language code (e.g., "en", "it").

set_target_language(language: str) → None

Set the target language for the speech processor (for translation).

Parameters:: language (str) – Language code (e.g., "en", "it").

tokens_to_string(tokens: List[str]) → str

Converts token sequences into human-readable strings.

Returns:: The textual representation of the tokens.
Return type:: str