Skip to main content

Audio Transcription API

Audio transcription capabilities for video summarization. The model service supports transcribing audio from videos with optional speaker diarization and audio-visual fusion strategies.

Overview

Audio transcription is integrated into the video summarization endpoint (/api/summarize). Enable audio processing by setting enable_audio: true in the request. The system supports:

  • Local transcription models (Whisper, Faster-Whisper)
  • External audio APIs (AssemblyAI, Deepgram, Azure, AWS, Google, Rev.ai, Gladia)
  • Speaker diarization with Pyannote Audio
  • Four fusion strategies for combining audio and visual analysis

Audio Parameters

Add these parameters to the /api/summarize request to enable audio transcription:

ParameterTypeRequiredDefaultDescription
enable_audiobooleanNofalseEnable audio transcription
audio_languagestring | nullNonullLanguage code (e.g., "en", "es"). Auto-detects if null
enable_speaker_diarizationbooleanNofalseEnable speaker identification
fusion_strategystring | nullNo"sequential"Audio-visual fusion strategy

Fusion Strategy Options

StrategyDescription
sequentialProcess audio and visual independently, then combine
timestamp_alignedAlign audio segments with visual frames by timestamp
native_multimodalUse multimodal model (GPT-4o, Gemini 2.5 Flash) for joint processing
hybridWeighted combination of multiple approaches

Response Fields

When audio transcription is enabled, the response includes these additional fields:

FieldTypeDescription
audio_transcriptstring | nullFull transcript text
transcript_jsonobject | nullStructured transcript with segments and speakers
audio_languagestring | nullDetected or specified language code
speaker_countnumber | nullNumber of distinct speakers identified
audio_model_usedstring | nullAudio transcription model name
visual_model_usedstring | nullVisual analysis model name
fusion_strategystring | nullFusion strategy applied
processing_time_audionumber | nullAudio processing time in seconds
processing_time_visualnumber | nullVisual processing time in seconds
processing_time_fusionnumber | nullFusion processing time in seconds

Transcript JSON Schema

The transcript_json field contains structured transcription data:

{
"segments": [
{
"start": 5.2,
"end": 12.8,
"text": "Welcome to the presentation.",
"speaker": "Speaker 1",
"confidence": 0.94
},
{
"start": 13.5,
"end": 20.1,
"text": "Today we will discuss the quarterly results.",
"speaker": "Speaker 2",
"confidence": 0.91
}
],
"language": "en",
"speaker_count": 2
}

Transcript Segment Fields

FieldTypeDescription
startnumberStart time in seconds
endnumberEnd time in seconds
textstringTranscribed text for this segment
speakerstring | nullSpeaker label (e.g., "Speaker 1") if diarization enabled
confidencenumberConfidence score from 0.0 to 1.0

Examples

Basic Audio Transcription

Enable audio transcription with default settings:

curl -X POST http://localhost:8000/api/summarize \
-H "Content-Type: application/json" \
-d '{
"video_id": "abc-123",
"persona_id": "persona-456",
"frame_sample_rate": 1,
"max_frames": 30,
"enable_audio": true
}'

Response (200):

{
"id": "summary-789",
"video_id": "abc-123",
"persona_id": "persona-456",
"summary": "The video shows a business meeting where two speakers discuss quarterly results...",
"visual_analysis": "Frame 0: Conference room with presentation screen...",
"audio_transcript": "Welcome to the presentation. Today we will discuss the quarterly results.",
"key_frames": [
{
"frame_number": 0,
"timestamp": 0.0,
"description": "Opening slide",
"confidence": 0.95
}
],
"confidence": 0.92,
"transcript_json": {
"segments": [
{
"start": 5.2,
"end": 12.8,
"text": "Welcome to the presentation.",
"speaker": null,
"confidence": 0.94
}
],
"language": "en",
"speaker_count": 1
},
"audio_language": "en",
"speaker_count": null,
"audio_model_used": "whisper-large-v3",
"visual_model_used": "llama-4-maverick",
"fusion_strategy": "sequential",
"processing_time_audio": 12.5,
"processing_time_visual": 8.3,
"processing_time_fusion": 1.2
}

With Speaker Diarization

Enable speaker diarization to identify multiple speakers:

curl -X POST http://localhost:8000/api/summarize \
-H "Content-Type: application/json" \
-d '{
"video_id": "abc-123",
"persona_id": "persona-456",
"enable_audio": true,
"enable_speaker_diarization": true,
"audio_language": "en"
}'

Response (200):

{
"id": "summary-790",
"video_id": "abc-123",
"persona_id": "persona-456",
"summary": "Two speakers discuss quarterly results in a business meeting...",
"audio_transcript": "Speaker 1: Welcome to the presentation. Speaker 2: Thank you for joining us.",
"transcript_json": {
"segments": [
{
"start": 5.2,
"end": 12.8,
"text": "Welcome to the presentation.",
"speaker": "Speaker 1",
"confidence": 0.94
},
{
"start": 13.5,
"end": 20.1,
"text": "Thank you for joining us.",
"speaker": "Speaker 2",
"confidence": 0.91
}
],
"language": "en",
"speaker_count": 2
},
"audio_language": "en",
"speaker_count": 2,
"audio_model_used": "whisper-large-v3",
"visual_model_used": "llama-4-maverick",
"fusion_strategy": "sequential",
"processing_time_audio": 18.7,
"processing_time_visual": 8.3,
"processing_time_fusion": 1.5
}

Timestamp-Aligned Fusion

Use timestamp alignment to correlate audio segments with visual frames:

curl -X POST http://localhost:8000/api/summarize \
-H "Content-Type: application/json" \
-d '{
"video_id": "abc-123",
"persona_id": "persona-456",
"enable_audio": true,
"fusion_strategy": "timestamp_aligned"
}'

Response (200):

{
"id": "summary-791",
"video_id": "abc-123",
"persona_id": "persona-456",
"summary": "At 5.2 seconds, the speaker welcomes viewers while a title slide appears...",
"visual_analysis": "Frame 0 (0.0s): Title slide. Frame 150 (5.0s): Speaker at podium...",
"audio_transcript": "Welcome to the presentation. Today we will discuss the quarterly results.",
"fusion_strategy": "timestamp_aligned",
"audio_model_used": "whisper-large-v3",
"visual_model_used": "llama-4-maverick",
"processing_time_audio": 12.5,
"processing_time_visual": 8.3,
"processing_time_fusion": 2.8
}

Native Multimodal Processing

Use GPT-4o or Gemini 2.5 Flash for joint audio-visual processing:

curl -X POST http://localhost:8000/api/summarize \
-H "Content-Type: application/json" \
-d '{
"video_id": "abc-123",
"persona_id": "persona-456",
"enable_audio": true,
"fusion_strategy": "native_multimodal"
}'

Response (200):

{
"id": "summary-792",
"video_id": "abc-123",
"persona_id": "persona-456",
"summary": "A comprehensive analysis showing the speaker's body language aligned with key financial points...",
"audio_transcript": "Welcome to the presentation. Today we will discuss the quarterly results.",
"fusion_strategy": "native_multimodal",
"audio_model_used": "gpt-4o",
"visual_model_used": "gpt-4o",
"processing_time_audio": 15.2,
"processing_time_visual": 15.2,
"processing_time_fusion": 0.0
}

Error Responses

400 Bad Request

Invalid parameters or missing API keys:

{
"error": "BadRequest",
"message": "Audio transcription requires API key for provider 'deepgram'",
"details": {
"provider": "deepgram",
"resolution_chain": "user_keys -> system_keys -> environment"
}
}

404 Not Found

Video file not found:

{
"error": "NotFound",
"message": "Video not found: abc-123"
}

422 Validation Error

Invalid parameter values:

{
"detail": [
{
"loc": ["body", "fusion_strategy"],
"msg": "value is not a valid enumeration member; permitted: 'sequential', 'timestamp_aligned', 'native_multimodal', 'hybrid'",
"type": "type_error.enum"
}
]
}

500 Internal Server Error

Processing failure:

{
"error": "InternalServerError",
"message": "Audio transcription failed: Unsupported audio format",
"details": {
"video_id": "abc-123",
"audio_codec": "unknown"
}
}

Configuration

Configure audio transcription via API keys. The system resolves API keys in this order:

  1. User-level keys - Set in Settings > API Keys (user-scoped)
  2. System-level keys - Set in Admin Panel > API Keys (admin-only)
  3. Environment variables - Set in model-service/.env (fallback)

Supported Audio Providers

ProviderModelAPI Key Environment Variable
AssemblyAIUniversal-2ASSEMBLYAI_API_KEY
DeepgramNova-3DEEPGRAM_API_KEY
Azure SpeechdefaultAZURE_SPEECH_KEY + AZURE_SPEECH_REGION
AWS TranscribedefaultAWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_DEFAULT_REGION
Google SpeechChirp 2GOOGLE_APPLICATION_CREDENTIALS
Rev.aidefaultREVAI_API_KEY
GladiadefaultGLADIA_API_KEY

Local Models

Configure local transcription models in model-service/config/models.yaml:

tasks:
audio_transcription:
models:
- name: whisper-v3-turbo
model_id: openai/whisper-large-v3-turbo
framework: faster_whisper
device: cuda
- name: whisper-large-v3
model_id: openai/whisper-large-v3
framework: transformers
device: cuda

Performance Considerations

Processing Time

Processing time varies by configuration:

  • Local models: 0.5x to 2x real-time (depends on hardware)
  • External APIs: 0.3x to 1x real-time (depends on provider)
  • Speaker diarization: Adds 30-50% processing time
  • Fusion strategies: Minimal overhead (1-3 seconds)

GPU Memory

Local transcription models require GPU memory:

  • Whisper Large v3: 10GB VRAM (float16)
  • Whisper Turbo: 6GB VRAM (float16)
  • Faster-Whisper: 4-8GB VRAM (int8_float16)
  • Pyannote Audio: 2GB VRAM (additional)

Accuracy Tradeoffs

ModelSpeedAccuracyGPU Memory
Whisper TurboFastGood6GB
Whisper Large v3SlowExcellent10GB
AssemblyAIFastExcellentN/A
DeepgramVery FastExcellentN/A

See Also