Skip to main content

Audio transcription

Use the audio transcription path to attach speech-to-text output to a video summary. Every transcriber sits behind the IAudioTranscriber outbound port (see Concepts > Clean Architecture). The model service ships seven external-API vendor adapters under model-service/src/infrastructure/adapters/outbound/external_apis/audio/:

assemblyai_client.py
aws_transcribe_client.py
azure_speech_client.py
deepgram_client.py
gladia_client.py
google_speech_client.py
revai_client.py

plus the on-device adapters under model-service/src/infrastructure/adapters/outbound/models/audio/:

loader.py            Whisper, faster-whisper
canary.py Canary-Qwen 2.5B
parakeet.py Parakeet TDT 1.1B
whisperx.py WhisperX large-v3
adapters.py WhisperTranscriberAdapter, PyannoteDiarizerAdapter,
SileroVADAdapter

Each adapter implements the common base.py interface and is selected by the audio model name in model-service/config/models.yaml. Speaker diarization (ISpeakerDiarizer / pyannote 3.1) and voice activity detection (IVoiceActivityDetector / Silero VAD) sit behind their own ports.

Selection

The summary generation pipeline runs audio transcription, visual summarization, and a fusion step. The audio model is configured under the audio_transcription task slot in models.yaml. audioModelUsed on the resulting VideoSummary row records which adapter actually ran.

Vendor credentials

Each vendor needs its own credential. Store them as user-level or admin-level API keys (see Guide > API keys) under the matching provider name (assemblyai, aws, azure, deepgram, gladia, google, revai). The model service reads the resolved credential from the backend at call time; adapters do not have direct database access.

Fusion strategies

The fusionStrategy column on VideoSummary records how audio and visual outputs were combined:

sequential          visual summary generated first, then refined with audio context
timestamp_aligned visual and audio descriptions aligned at matching timestamps then fused
native_multimodal a single model processes both modalities jointly
hybrid combines sequential and timestamp-aligned passes

The fusion strategy is supplied by the caller via fusion_strategy on the SummarizeVideoRequest; create_fusion_strategy in model-service/src/application/use_cases/fuse_modalities.py dispatches that value to one of SequentialFusion, TimestampAlignedFusion, NativeMultimodalFusion, or HybridFusion. The default is sequential.

Standalone transcribe button

The summary pipeline above is the indirect path: audio runs as part of a larger generate-summary job. The direct path is a Transcribe Audio button on the workspace toolbar that calls POST /api/videos/:videoId/transcribe, optionally enabling speaker diarization via pyannote 3.1, and renders the result in a TranscriptPanel with click-to-seek timestamps and color-coded speaker chips. See Guide > Transcribe and diarize for the full request and response contract.

The standalone route hits the model-service's /api/transcribe and /api/diarize endpoints directly and does not produce a VideoSummary row; it is for the case where a user wants the transcript surface without the full summary pipeline.

Languages

audioLanguage on the summary row is the ISO code returned by the vendor. Different vendors auto-detect different language sets; consult the vendor adapter for the exact list.