Audio transcription
Use the audio transcription path to attach speech-to-text output
to a video summary. Every transcriber sits behind the
IAudioTranscriber outbound port (see
Concepts > Clean Architecture).
The model service ships seven external-API vendor adapters under
model-service/src/infrastructure/adapters/outbound/external_apis/audio/:
assemblyai_client.py
aws_transcribe_client.py
azure_speech_client.py
deepgram_client.py
gladia_client.py
google_speech_client.py
revai_client.py
plus the on-device adapters under
model-service/src/infrastructure/adapters/outbound/models/audio/:
loader.py Whisper, faster-whisper
canary.py Canary-Qwen 2.5B
parakeet.py Parakeet TDT 1.1B
whisperx.py WhisperX large-v3
adapters.py WhisperTranscriberAdapter, PyannoteDiarizerAdapter,
SileroVADAdapter
Each adapter implements the common base.py interface and is
selected by the audio model name in
model-service/config/models.yaml. Speaker diarization
(ISpeakerDiarizer / pyannote 3.1) and voice activity
detection (IVoiceActivityDetector / Silero VAD) sit behind
their own ports.
Selection
The summary generation pipeline runs audio transcription, visual
summarization, and a fusion step. The audio model is configured
under the audio_transcription task slot in models.yaml.
audioModelUsed on the resulting VideoSummary row records
which adapter actually ran.
Vendor credentials
Each vendor needs its own credential. Store them as user-level or
admin-level API keys (see Guide > API keys) under
the matching provider name (assemblyai, aws, azure,
deepgram, gladia, google, revai). The model service
reads the resolved credential from the backend at call time;
adapters do not have direct database access.
Fusion strategies
The fusionStrategy column on VideoSummary records how audio
and visual outputs were combined:
sequential visual summary generated first, then refined with audio context
timestamp_aligned visual and audio descriptions aligned at matching timestamps then fused
native_multimodal a single model processes both modalities jointly
hybrid combines sequential and timestamp-aligned passes
The fusion strategy is supplied by the caller via fusion_strategy
on the SummarizeVideoRequest; create_fusion_strategy in
model-service/src/application/use_cases/fuse_modalities.py
dispatches that value to one of SequentialFusion,
TimestampAlignedFusion, NativeMultimodalFusion, or
HybridFusion. The default is sequential.
Standalone transcribe button
The summary pipeline above is the indirect path: audio runs as
part of a larger generate-summary job. The direct path is a
Transcribe Audio button on the workspace toolbar that calls
POST /api/videos/:videoId/transcribe, optionally enabling speaker
diarization via pyannote 3.1, and renders the result in a
TranscriptPanel with click-to-seek timestamps and color-coded
speaker chips. See Guide > Transcribe and diarize
for the full request and response contract.
The standalone route hits the model-service's /api/transcribe and
/api/diarize endpoints directly and does not produce a VideoSummary
row; it is for the case where a user wants the transcript surface
without the full summary pipeline.
Languages
audioLanguage on the summary row is the ISO code returned by
the vendor. Different vendors auto-detect different language sets;
consult the vendor adapter for the exact list.