|
Praat uses the whisper.cpp implementation of the Silero VAD speech activity detector. The pre-trained Silero VAD model weights have been converted to ggml format and compiled into Praat, so no external model files are required. The sound is automatically resampled to 16 kHz (the sampling frequency expected by the Silero VAD model) before being processed by the model.
to detect which parts of a sound contain speech. The output is a list of speech segments, each defined by a start and an end time.
The Silero VAD model processes the sound in fixed frames of 512 samples (32 ms, since the sound is resampled to 16 kHz). For each frame, it outputs a probability that the frame contains speech. Based on this output, the list of speech segments is constructed as described below. This description reflects the whisper.cpp implementation, which differs slightly from Silero’s original implementation.
A speech segment begins at the first frame whose speech probability exceeds the Speech probability threshold. The segment continues as long as the speech probability stays above a lower threshold (called the “negative threshold” in the Silero source code, equal to the speech probability threshold minus 0.15). The segment ends only when the probability stays below the negative threshold for at least Min. gap between speech segments. When a segment ends, it is kept only if it is longer than Min. speech segment.
After all segments have been formed, segments separated by a gap shorter than 0.2 s are merged (this is hardcoded in whisper.cpp and not configurable). After that, padding is applied: each segment is extended on both sides by Padding around speech segments. If padding would cause two segments to overlap, they instead meet at the midpoint of the gap between them.
The result is a list of speech segments.
Silero VAD speech activity detection is available in Praat:
© Anastasia Shchupak 2026-06-01