TextGrid & Sound: Transcribe interval

TextGrid & Sound: Transcribe interval

Transcribes the audio in a specific interval of the selected TextGrid using the whisper.cpp engine, and writes the transcription result into the TextGrid.

This command extracts the sound corresponding to the selected interval, runs speech recognition on it, and splits the interval into sentence-level sub-intervals with the recognized text as labels. Optionally, a word-level tier is also created.

The original interval is split into multiple intervals, one per recognized sentence. Sentence boundaries are determined by terminal punctuation (periods, exclamation marks, question marks). If Include words is selected, then word-level alignment is also performed. For this, a new word tier is created if one does not already exist. The word tier then contains one interval per recognized word, with boundaries derived from Whisper's token-level timestamps produced using Dynamic Time Warping (DTW).

Settings

Tier number: the number of the interval tier in which the interval to be transcribed resides.
Interval number: the number of the interval within the tier to transcribe.
Include words: if on, a new word-level tier is created (or reused if it already exists) directly below the selected tier. This tier is named by appending /word to the name of the selected tier. Then, each recognized word gets its own interval in the word tier, with word-level boundaries computed using whisper.cpp's internal DTW.
Allow silences: if on, Silero VAD (Voice Activity Detection) is used to detect and skip non-speech portions of the audio before recognition. This generally improves both speed and accuracy. Speed is improved by making the audio for recognition shorter, and accuracy by preventing Whisper from hallucinating text for silent segments.
Whisper model: determines which Whisper model to use for recognition. The list is populated with the .bin files found in the whispercpp subfolder of the models folder in the Praat preferences folder. Models that contain .en in their name are English-only; all other models are multilingual.
Language: determines the language to be recognized. Choose Autodetect language to let the model detect the language automatically. If you know the language of the audio, selecting it explicitly may improve recognition accuracy. Note that English-only models (those with .en in the name) can only be used with Autodetect language or English.

Installing Whisper models

Before you can use the SpeechRecognizer, you need to install one or more Whisper model files (in GGML format, with extension .bin) into the subfolder whispercpp of the folder models in the Praat preferences folder.

Whisper models come in several sizes, each offering a different trade-off between speed and accuracy. Model names that contain .en are English-only models. All other models are multilingual. Available model sizes are: tiny, base, small, medium, large-v1, large-v2, large-v3, and large-v3-turbo (also known as turbo). Larger models are more accurate but require more memory and processing time.

Model files can be obtained from the Hugging Face repository at https://huggingface.co/ggerganov/whisper.cpp/tree/main.