Skip to main content

Audio transcription

Use the audio transcription path to attach speech-to-text output to a video summary. Since v0.3.0 every transcriber sits behind the IAudioTranscriber outbound port (see Concepts > Clean Architecture). The model service ships seven external-API vendor adapters under model-service/src/infrastructure/adapters/outbound/external_apis/audio/:

assemblyai_client.py
aws_transcribe_client.py
azure_speech_client.py
deepgram_client.py
gladia_client.py
google_speech_client.py
revai_client.py

plus the on-device adapters under model-service/src/infrastructure/adapters/outbound/models/audio/:

loader.py            Whisper, faster-whisper
canary.py Canary-Qwen 2.5B (v0.3.0 Wave 3)
parakeet.py Parakeet TDT 1.1B (v0.3.0 Wave 3)
whisperx.py WhisperX large-v3 (v0.3.0 Wave 3)
adapters.py WhisperTranscriberAdapter, PyannoteDiarizerAdapter,
SileroVADAdapter

Each adapter implements the common base.py interface and is selected by the audio model name in model-service/config/models.yaml. Speaker diarization (ISpeakerDiarizer / pyannote 3.1) and voice activity detection (IVoiceActivityDetector / Silero VAD) sit behind their own ports.

Selection

The summary generation pipeline runs audio transcription, visual summarization, and a fusion step. The audio model is configured under the audio_transcription task slot in models.yaml. audioModelUsed on the resulting VideoSummary row records which adapter actually ran.

Vendor credentials

Each vendor needs its own credential. Store them as user-level or admin-level API keys (see Guide > API keys) under the matching provider name (assemblyai, aws, azure, deepgram, gladia, google, revai). The model service reads the resolved credential from the backend at call time; adapters do not have direct database access.

Fusion strategies

The fusionStrategy column on VideoSummary records how audio and visual outputs were combined:

sequential    visual then audio, audio used to refine
parallel visual and audio independent, fused at end
audio-first audio drives the narrative, visual fills detail

The selected fusion strategy depends on the configured fusion model and the input characteristics; the model service picks based on the audio-presence detection in av_fusion.py.

Languages

audioLanguage on the summary row is the ISO code returned by the vendor. Different vendors auto-detect different language sets; consult the vendor adapter for the exact list.