simulstream.server.speech_processors.vad_wrapper.VADWrapperSpeechProcessor
- class simulstream.server.speech_processors.vad_wrapper.VADWrapperSpeechProcessor(config: SimpleNamespace)
Bases:
SpeechProcessorA speech processor that integrates Voice Activity Detection (VAD) to filter and split continuous audio streams into meaningful speech chunks before processing them with an underlying speech processor.
This class wraps a
SpeechProcessorimplementation (defined by in the configuration via the attribute base_speech_processor_class) with a Silero VAD-based iterator that detects the start and end of speech segments. Audio outside of speech is ignored, and each detected segment is passed to the underlying speech processor.- Parameters:
config (SimpleNamespace) –
Configuration object. The following attributes are used:
base_speech_processor_class (str): full name of the underlying speech processor class to use.
vad_threshold (float, optional): VAD probability threshold. Default =
0.5.vad_min_silence_duration_ms (int, optional): Minimum silence duration (milliseconds) to consider the end of a speech segment. Default =
100.vad_speech_pad_ms (int, optional): Padding (milliseconds) to include before and after detected speech. Default =
30.min_speech_size (int, optional): Minimum segment size in seconds; shorter segments are ignored. Default =
1.Any additional attributes required by the subclass
speech_processor_class.
- __init__(config: SimpleNamespace)
Initialize the speech processor with a given configuration.
- Parameters:
config (SimpleNamespace) – Configuration loaded from a YAML file.
Methods
__init__(config)Initialize the speech processor with a given configuration.
append_to_speech_buffer(audio_chunk)clear()Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.
This method is called at the end of audio chunk processing.
load_model(config)Load and initialize the underlying speech model.
process_chunk(waveform)Process a chunk of waveform and produce incremental output.
set_source_language(language)Set the source language for the speech processor.
set_target_language(language)Set the target language for the speech processor (for translation).
speech_processor_class(config)tokens_to_string(tokens)Converts token sequences into human-readable strings.
Attributes
Return the size of the speech chunks to be processed (in seconds).
- clear() None
Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.
- end_of_stream() IncrementalOutput
This method is called at the end of audio chunk processing. It can be used to emit hypotheses at the end of the speech to conclude the output.
- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- classmethod load_model(config: SimpleNamespace)
Load and initialize the underlying speech model.
- Parameters:
config (SimpleNamespace) – Configuration of the speech processor.
- process_chunk(waveform: float32) IncrementalOutput
Process a chunk of waveform and produce incremental output.
- Parameters:
waveform (np.float32) – A 1D NumPy array of the audio chunk. The array is PCM audio normalized to the range
[-1.0, 1.0]sampled atsimulstream.server.speech_processors.SAMPLE_RATE.- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- set_source_language(language: str) None
Set the source language for the speech processor.
- Parameters:
language (str) – Language code (e.g.,
"en","it").
- set_target_language(language: str) None
Set the target language for the speech processor (for translation).
- Parameters:
language (str) – Language code (e.g.,
"en","it").