|
Sound: To TextGrid (speech activity, Silero)...
|
|
A command that creates a TextGrid for the selected Sound with one tier, in which the non-speech and speech intervals are marked. The discrimination between the two is based on the Silero VAD neural network model.
Speech activity detection, sometimes referred to as voice activity detection (VAD), is a method to discriminate speech from non-speech segments in audio. Unlike the spectral-flatness-based method in Sound: To TextGrid (speech activity, LTSF)..., this command uses a deep neural network (Silero VAD) that was trained on a large dataset of speech and non-speech audio.
The Silero VAD model is loaded from data compiled into Praat, so no external model files are required. The sound is internally resampled to 16 kHz before being processed by the model.
Settings
- Speech probability threshold (0 - 1)
- determines the sensitivity of the speech detector. Higher values make the detector less sensitive, meaning that it requires stronger evidence of speech to mark a segment as speech. This reduces false positives (non-speech incorrectly labelled as speech), but may cause some speech to be missed. Lower values make the detector more sensitive. A default of 0.5 works well for most use cases.
- Min. non-speech interval (s)
- determines the minimum duration for a gap between speech intervals to be considered non-speech. Shorter gaps will be merged with the surrounding speech. If you find that brief pauses (such as plosive closures) are splitting speech intervals and you don't want this to happen, increase this value.
- Min. speech interval (s)
- determines the minimum duration for an interval to be marked as speech. Shorter intervals are discarded. This helps filter out very short bursts of noise that might otherwise be detected as speech.
- Padding added around each speech segment (s)
- determines how much silence (or noise) is included before and after each detected speech segment. Adding a small amount of padding ensures that speech onsets and offsets are not clipped.
- Non-speech interval label
- the label assigned to intervals classified as non-speech in the resulting TextGrid.
- Speech interval label
- the label assigned to intervals classified as speech in the resulting TextGrid.
Algorithm
The Silero VAD model processes the audio in small frames and outputs a probability that each frame contains speech. Consecutive frames with speech probability above the threshold are grouped into speech segments, subject to the minimum duration and padding constraints. All other parts of the audio are marked as non-speech.
If no speech is detected anywhere in the sound, the entire duration is marked with a single non-speech interval.
Links to this page
© Anastasia Shchupak 2026-03-15