transcription with whisper.cpp

transcription with whisper.cpp

Praat can perform automatic transcription of a sound using whisper.cpp. To do this, at least one Whisper model must be installed on your computer. You can find the details about how to install models and how to use transcription in the Speech recognition tutorial. The sound is automatically resampled to 16 kHz (the sampling frequency expected by whisper.cpp) before being transcribed. This page documents the transcription settings.

Behaviour

When transcription is run on a TextGrid interval, it modifies the TextGrid: intervals are split, tiers may be added or renamed. The exact resulting TextGrid structure depends on the combination of the Include words and Include diarization settings. See the Speech recognition tutorial for the details of the transcription output under different combinations of these settings.

Settings

Whisper model: determines which Whisper model is used. The list is populated with the .bin files found in the whispercpp subfolder of the models folder in the Praat preferences folder. See the Speech recognition tutorial for details on how to install models.
Language (standard value: Autodetect language): determines the language to be used for transcription. Choose Autodetect language to let the model detect the language automatically. If you know the language you want to use for transcription, selecting it explicitly may improve transcription accuracy. Note that English-only models (those with .en in the name) can only be used with Autodetect language or English.
Include words (standard: on): if on, each transcribed word is given a start and an end time, computed using whisper.cpp’s internal dynamic time warping (DTW) algorithm.
Detect non-speech (standard: on): if on, speech activity detection with Silero VAD runs before transcription to identify speech regions. Only those regions are then passed to the Whisper model. This generally improves both speed and accuracy of transcription. Speed is improved by reducing the length of the sound sent to the model, and accuracy by preventing the model from hallucinating text for silent regions.
Speech probability threshold (0-1) (standard value: 0.5): see speech activity detection with Silero VAD.
Min. gap between speech segments (s) (standard value: 0.1): see speech activity detection with Silero VAD.
Min. speech segment (s) (standard value: 0.25): see speech activity detection with Silero VAD.
Padding around speech segments (s) (standard value: 0.03): see speech activity detection with Silero VAD.
Include diarization (standard: off): if on, speaker diarization is run alongside transcription (see speaker diarization with adapted pyannote.audio). The results of both are later combined to attribute portions of transcribed speech to different speakers.
Max. number of speakers (≥ 2) (standard value: 2): see speaker diarization with adapted pyannote.audio.
Allow speakers to overlap (standard: on): see speaker diarization with adapted pyannote.audio.
Clustering threshold (0-2) (standard value: 0.7): see speaker diarization with adapted pyannote.audio.
Segmentation step (0-1) (standard value: 0.1): see speaker diarization with adapted pyannote.audio.

Availability in Praat

Transcription with whisper.cpp is available in two ways in Praat:

• if you select a Sound together with its TextGrid and choose Transcribe interval...;

• via Transcribe interval from the Interval menu in the TextGridEditor. The settings for this command are set via Transcription settings... in the same menu and are remembered across Praat sessions.

Behaviour

Settings

Availability in Praat

Links to this page