simulstream.server.speech_processors.vad_wrapper
Classes
|
A speech processor that integrates Voice Activity Detection (VAD) to filter and split continuous audio streams into meaningful speech chunks before processing them with an underlying speech processor. |
- class simulstream.server.speech_processors.vad_wrapper.VADWrapperSpeechProcessor(config: SimpleNamespace)
A speech processor that integrates Voice Activity Detection (VAD) to filter and split continuous audio streams into meaningful speech chunks before processing them with an underlying speech processor.
This class wraps a
SpeechProcessorimplementation (defined by in the configuration via the attribute base_speech_processor_class) with a Silero VAD-based iterator that detects the start and end of speech segments. Audio outside of speech is ignored, and each detected segment is passed to the underlying speech processor.- Parameters:
config (SimpleNamespace) –
Configuration object. The following attributes are used:
base_speech_processor_class (str): full name of the underlying speech processor class to use.
vad_threshold (float, optional): VAD probability threshold. Default =
0.5.vad_min_silence_duration_ms (int, optional): Minimum silence duration (milliseconds) to consider the end of a speech segment. Default =
100.vad_speech_pad_ms (int, optional): Padding (milliseconds) to include before and after detected speech. Default =
30.min_speech_size (int, optional): Minimum segment size in seconds; shorter segments are ignored. Default =
1.Any additional attributes required by the subclass
speech_processor_class.
- clear() None
Clear internal states, such as history of cached audio and/or tokens, in preparation for a new stream or conversation.
- end_of_stream() IncrementalOutput
This method is called at the end of audio chunk processing. It can be used to emit hypotheses at the end of the speech to conclude the output.
- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- classmethod load_model(config: SimpleNamespace)
Load and initialize the underlying speech model.
- Parameters:
config (SimpleNamespace) – Configuration of the speech processor.
- process_chunk(waveform: float32) IncrementalOutput
Process a chunk of waveform and produce incremental output.
- Parameters:
waveform (np.float32) – A 1D NumPy array of the audio chunk. The array is PCM audio normalized to the range
[-1.0, 1.0]sampled atsimulstream.server.speech_processors.SAMPLE_RATE.- Returns:
The incremental output (new and deleted tokens/strings).
- Return type:
- set_source_language(language: str) None
Set the source language for the speech processor.
- Parameters:
language (str) – Language code (e.g.,
"en","it").