Transcribing audio and labeling speakers
The annotation workspace can transcribe the audio track of any video
and, optionally, attach a speaker label to every transcript segment.
This is the standalone path used by the Transcribe Audio toolbar
button on AnnotationWorkspace, distinct from the audio path the
summary pipeline drives internally (covered in
Guide > Audio transcription).
What runs end to end
- The user clicks
Transcribe Audioon the workspace toolbar. - The frontend calls
POST /api/videos/:videoId/transcribeon the backend with the requested language and diarization options. - The backend resolves the video path, then calls the model service:
POST /api/transcribefor the transcript itself (default loader:faster-whisper).POST /api/diarizefor speaker turns, only whenenableDiarization=true(default loader:pyannote 3.1).
- The backend merges the two segment streams by per-second overlap, so every transcript segment carries the speaker who was talking the longest during it.
- The
TranscriptPanelUI renders one row per segment with a color-coded speaker chip, a click-to-seek timestamp, and an active-segment highlight that follows playback.
Diarization failures are non-fatal. If the diarizer errors out, the backend logs the failure and returns the plain transcript so the user still sees something useful.
Request shape
POST /api/videos/:videoId/transcribe
Content-Type: application/json
{
"language": "en", // optional ISO-639-1 code; omit to auto-detect
"enableDiarization": true, // optional, default false
"numSpeakers": null, // optional exact count
"minSpeakers": null, // optional lower bound
"maxSpeakers": null // optional upper bound
}
numSpeakers / minSpeakers / maxSpeakers are accepted for forward
compatibility. The current pyannote adapter binds these at loader-config
time and ignores per-request overrides; the model service logs a
warning when they are supplied.
Response shape
{
"text": "full concatenated transcript",
"segments": [
{
"start": 0.0,
"end": 2.4,
"text": "Welcome to the broadcast.",
"confidence": 0.93,
"speaker": "SPEAKER_00"
}
],
"language": "en",
"duration": 14.3,
"processingTime": 1.42,
"modelUsed": "Systran/faster-whisper-tiny",
"speakers": ["SPEAKER_00", "SPEAKER_01"],
"diarizationModelUsed": "pyannote/speaker-diarization-3.1",
"diarizationProcessingTime": 2.18
}
speakers, diarizationModelUsed, and diarizationProcessingTime
are present only when diarization actually ran. segments[].speaker
is null when no diarize turn overlaps that segment (typically
silence).
Model selection
The audio loaders are configured under the
audio_transcription and speaker_diarization task slots in
model-service/config/models.yaml (GPU build) and
model-service/config/models-cpu.yaml (CPU build). The CPU build
selects the faster-whisper-tiny-cpu loader option (model_id
Systran/faster-whisper-tiny) for transcription so the demo
laptop and the docs-test runners can transcribe a 30-second
clip in roughly two seconds.
The diarization loader is pyannote-3-1 in both builds. Pyannote
requires a Hugging Face account, a one-time acceptance of the model
license on huggingface.co, and an HF_TOKEN environment variable on
the model-service container; see
Reference > Environment variables.
Error responses
400 the model service rejected the request body
404 the video row does not exist
500 transcription failed (model load error, decoding error, etc.)
502 the model service is unreachable
504 the model service exceeded MODEL_SERVICE_TIMEOUTS.transcribe
The MODEL_SERVICE_TIMEOUTS.transcribe ceiling is overridable per
deployment via MODEL_SERVICE_TIMEOUT_TRANSCRIBE_MS; CPU-first-load
diarization on a long clip is the most common case for raising it.
TranscriptPanel UI
The panel renders inside the Audio Transcript dialog
(data-testid="transcript-dialog") opened by the Transcribe
Audio toolbar button in the annotation workspace. Each row is:
- A colored chip for the speaker (
SPEAKER_00,SPEAKER_01, ...). Colors are assigned by first-appearance order so the same clip always paints the same speaker palette across reloads. - A
mm:ssstart timestamp. Clicking it seeks the player. - The segment text. The currently-playing segment highlights as the
player crosses its
start..endwindow.
When diarization is disabled, the chip column collapses and the rows render as plain timestamped transcript lines.