speech activity detection with Silero VAD

speech activity detection with Silero VAD

Praat uses the whisper.cpp implementation of the Silero VAD speech activity detector. The pre-trained Silero VAD model weights have been converted to ggml format and compiled into Praat, so no external model files are required. The sound is automatically resampled to 16 kHz (the sampling frequency expected by the Silero VAD model) before being processed by the model.

Purpose

to detect which parts of a sound contain speech. The output is a list of speech segments, each defined by a start and an end time.

Algorithm

The Silero VAD model processes the sound in fixed frames of 512 samples (32 ms, since the sound is resampled to 16 kHz). For each frame, it outputs a probability that the frame contains speech. Based on this output, the list of speech segments is constructed as described below. This description reflects the whisper.cpp implementation, which differs slightly from Silero’s original implementation.

A speech segment begins at the first frame whose speech probability exceeds the Speech probability threshold. The segment continues as long as the speech probability stays above a lower threshold (called the “negative threshold” in the Silero source code, equal to the speech probability threshold minus 0.15). The segment ends only when the probability stays below the negative threshold for at least Min. gap between speech segments. When a segment ends, it is kept only if it is longer than Min. speech segment.

After all segments have been formed, segments separated by a gap shorter than 0.2 s are merged (this is hardcoded in whisper.cpp and not configurable). After that, padding is applied: each segment is extended on both sides by Padding around speech segments. If padding would cause two segments to overlap, they instead meet at the midpoint of the gap between them.

The result is a list of speech segments.

Settings

Speech probability threshold (0-1) (standard value: 0.5): determines the sensitivity of the speech detector. Higher values make the detector less sensitive, meaning that a frame requires a higher speech probability to be considered part of a speech segment. This reduces false positives (non-speech incorrectly classified as speech), but may cause some speech to be missed. Lower values make the detector more sensitive. The default of 0.5 works well for most use cases.
Min. gap between speech segments (s) (standard value: 0.1): the minimum duration of a gap between two speech segments. You might want to increase this value if short silences within speech (e.g. plosive closures) are splitting speech into multiple segments. Note that gaps shorter than 0.2 s are removed from the output (with their adjacent speech segments merged), regardless of this setting.
Min. speech segment (s) (standard value: 0.25): the minimum duration of a speech segment. Shorter segments are discarded.
Padding around speech segments (s) (standard value: 0.03): extends each detected speech segment by this amount on both sides. You might want to increase this value if speech onsets and offsets are being clipped.

Availability in Praat

Silero VAD speech activity detection is available in Praat:

• as part of transcription, running just before it to remove non-speech regions from the analysed sound (see transcription with whisper.cpp);

• standalone, producing a new TextGrid with non-speech and speech intervals (see Sound: To TextGrid (speech activity, Silero)...).

Purpose

Algorithm

Settings

Availability in Praat

Links to this page