Audio Transcription API

Audio transcription capabilities for video summarization. The model service supports transcribing audio from videos with optional speaker diarization and audio-visual fusion strategies.

Overview

Audio transcription is integrated into the video summarization endpoint (/api/summarize). Enable audio processing by setting enable_audio: true in the request. The system supports:

Local transcription models (Whisper, Faster-Whisper)
External audio APIs (AssemblyAI, Deepgram, Azure, AWS, Google, Rev.ai, Gladia)
Speaker diarization with Pyannote Audio
Four fusion strategies for combining audio and visual analysis

Audio Parameters

Add these parameters to the /api/summarize request to enable audio transcription:

Parameter	Type	Required	Default	Description
`enable_audio`	boolean	No	`false`	Enable audio transcription
`audio_language`	string \| null	No	`null`	Language code (e.g., "en", "es"). Auto-detects if null
`enable_speaker_diarization`	boolean	No	`false`	Enable speaker identification
`fusion_strategy`	string \| null	No	`"sequential"`	Audio-visual fusion strategy

Fusion Strategy Options

Strategy	Description
`sequential`	Process audio and visual independently, then combine
`timestamp_aligned`	Align audio segments with visual frames by timestamp
`native_multimodal`	Use multimodal model (GPT-4o, Gemini 2.5 Flash) for joint processing
`hybrid`	Weighted combination of multiple approaches

Response Fields

When audio transcription is enabled, the response includes these additional fields:

Field	Type	Description
`audio_transcript`	string \| null	Full transcript text
`transcript_json`	object \| null	Structured transcript with segments and speakers
`audio_language`	string \| null	Detected or specified language code
`speaker_count`	number \| null	Number of distinct speakers identified
`audio_model_used`	string \| null	Audio transcription model name
`visual_model_used`	string \| null	Visual analysis model name
`fusion_strategy`	string \| null	Fusion strategy applied
`processing_time_audio`	number \| null	Audio processing time in seconds
`processing_time_visual`	number \| null	Visual processing time in seconds
`processing_time_fusion`	number \| null	Fusion processing time in seconds

Transcript JSON Schema

The transcript_json field contains structured transcription data:

{
  "segments": [
    {
      "start": 5.2,
      "end": 12.8,
      "text": "Welcome to the presentation.",
      "speaker": "Speaker 1",
      "confidence": 0.94
    },
    {
      "start": 13.5,
      "end": 20.1,
      "text": "Today we will discuss the quarterly results.",
      "speaker": "Speaker 2",
      "confidence": 0.91
    }
  ],
  "language": "en",
  "speaker_count": 2
}

Transcript Segment Fields

Field	Type	Description
`start`	number	Start time in seconds
`end`	number	End time in seconds
`text`	string	Transcribed text for this segment
`speaker`	string \| null	Speaker label (e.g., "Speaker 1") if diarization enabled
`confidence`	number	Confidence score from 0.0 to 1.0

Examples

Basic Audio Transcription

Enable audio transcription with default settings:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "abc-123",
    "persona_id": "persona-456",
    "frame_sample_rate": 1,
    "max_frames": 30,
    "enable_audio": true
  }'

Response (200):

{
  "id": "summary-789",
  "video_id": "abc-123",
  "persona_id": "persona-456",
  "summary": "The video shows a business meeting where two speakers discuss quarterly results...",
  "visual_analysis": "Frame 0: Conference room with presentation screen...",
  "audio_transcript": "Welcome to the presentation. Today we will discuss the quarterly results.",
  "key_frames": [
    {
      "frame_number": 0,
      "timestamp": 0.0,
      "description": "Opening slide",
      "confidence": 0.95
    }
  ],
  "confidence": 0.92,
  "transcript_json": {
    "segments": [
      {
        "start": 5.2,
        "end": 12.8,
        "text": "Welcome to the presentation.",
        "speaker": null,
        "confidence": 0.94
      }
    ],
    "language": "en",
    "speaker_count": 1
  },
  "audio_language": "en",
  "speaker_count": null,
  "audio_model_used": "whisper-large-v3",
  "visual_model_used": "llama-4-maverick",
  "fusion_strategy": "sequential",
  "processing_time_audio": 12.5,
  "processing_time_visual": 8.3,
  "processing_time_fusion": 1.2
}

With Speaker Diarization

Enable speaker diarization to identify multiple speakers:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "abc-123",
    "persona_id": "persona-456",
    "enable_audio": true,
    "enable_speaker_diarization": true,
    "audio_language": "en"
  }'

Response (200):

{
  "id": "summary-790",
  "video_id": "abc-123",
  "persona_id": "persona-456",
  "summary": "Two speakers discuss quarterly results in a business meeting...",
  "audio_transcript": "Speaker 1: Welcome to the presentation. Speaker 2: Thank you for joining us.",
  "transcript_json": {
    "segments": [
      {
        "start": 5.2,
        "end": 12.8,
        "text": "Welcome to the presentation.",
        "speaker": "Speaker 1",
        "confidence": 0.94
      },
      {
        "start": 13.5,
        "end": 20.1,
        "text": "Thank you for joining us.",
        "speaker": "Speaker 2",
        "confidence": 0.91
      }
    ],
    "language": "en",
    "speaker_count": 2
  },
  "audio_language": "en",
  "speaker_count": 2,
  "audio_model_used": "whisper-large-v3",
  "visual_model_used": "llama-4-maverick",
  "fusion_strategy": "sequential",
  "processing_time_audio": 18.7,
  "processing_time_visual": 8.3,
  "processing_time_fusion": 1.5
}

Timestamp-Aligned Fusion

Use timestamp alignment to correlate audio segments with visual frames:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "abc-123",
    "persona_id": "persona-456",
    "enable_audio": true,
    "fusion_strategy": "timestamp_aligned"
  }'

Response (200):

{
  "id": "summary-791",
  "video_id": "abc-123",
  "persona_id": "persona-456",
  "summary": "At 5.2 seconds, the speaker welcomes viewers while a title slide appears...",
  "visual_analysis": "Frame 0 (0.0s): Title slide. Frame 150 (5.0s): Speaker at podium...",
  "audio_transcript": "Welcome to the presentation. Today we will discuss the quarterly results.",
  "fusion_strategy": "timestamp_aligned",
  "audio_model_used": "whisper-large-v3",
  "visual_model_used": "llama-4-maverick",
  "processing_time_audio": 12.5,
  "processing_time_visual": 8.3,
  "processing_time_fusion": 2.8
}

Native Multimodal Processing

Use GPT-4o or Gemini 2.5 Flash for joint audio-visual processing:

curl -X POST http://localhost:8000/api/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "video_id": "abc-123",
    "persona_id": "persona-456",
    "enable_audio": true,
    "fusion_strategy": "native_multimodal"
  }'

Response (200):

{
  "id": "summary-792",
  "video_id": "abc-123",
  "persona_id": "persona-456",
  "summary": "A comprehensive analysis showing the speaker's body language aligned with key financial points...",
  "audio_transcript": "Welcome to the presentation. Today we will discuss the quarterly results.",
  "fusion_strategy": "native_multimodal",
  "audio_model_used": "gpt-4o",
  "visual_model_used": "gpt-4o",
  "processing_time_audio": 15.2,
  "processing_time_visual": 15.2,
  "processing_time_fusion": 0.0
}

Error Responses

400 Bad Request

Invalid parameters or missing API keys:

{
  "error": "BadRequest",
  "message": "Audio transcription requires API key for provider 'deepgram'",
  "details": {
    "provider": "deepgram",
    "resolution_chain": "user_keys -> system_keys -> environment"
  }
}

404 Not Found

Video file not found:

{
  "error": "NotFound",
  "message": "Video not found: abc-123"
}

422 Validation Error

Invalid parameter values:

{
  "detail": [
    {
      "loc": ["body", "fusion_strategy"],
      "msg": "value is not a valid enumeration member; permitted: 'sequential', 'timestamp_aligned', 'native_multimodal', 'hybrid'",
      "type": "type_error.enum"
    }
  ]
}

500 Internal Server Error

Processing failure:

{
  "error": "InternalServerError",
  "message": "Audio transcription failed: Unsupported audio format",
  "details": {
    "video_id": "abc-123",
    "audio_codec": "unknown"
  }
}

Configuration

Configure audio transcription via API keys. The system resolves API keys in this order:

User-level keys - Set in Settings > API Keys (user-scoped)
System-level keys - Set in Admin Panel > API Keys (admin-only)
Environment variables - Set in model-service/.env (fallback)

Supported Audio Providers

Provider	Model	API Key Environment Variable
AssemblyAI	Universal-2	`ASSEMBLYAI_API_KEY`
Deepgram	Nova-3	`DEEPGRAM_API_KEY`
Azure Speech	default	`AZURE_SPEECH_KEY` + `AZURE_SPEECH_REGION`
AWS Transcribe	default	`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` + `AWS_DEFAULT_REGION`
Google Speech	Chirp 2	`GOOGLE_APPLICATION_CREDENTIALS`
Rev.ai	default	`REVAI_API_KEY`
Gladia	default	`GLADIA_API_KEY`

Local Models

Configure local transcription models in model-service/config/models.yaml:

tasks:
  audio_transcription:
    models:
      - name: whisper-v3-turbo
        model_id: openai/whisper-large-v3-turbo
        framework: faster_whisper
        device: cuda
      - name: whisper-large-v3
        model_id: openai/whisper-large-v3
        framework: transformers
        device: cuda

Performance Considerations

Processing Time

Processing time varies by configuration:

Local models: 0.5x to 2x real-time (depends on hardware)
External APIs: 0.3x to 1x real-time (depends on provider)
Speaker diarization: Adds 30-50% processing time
Fusion strategies: Minimal overhead (1-3 seconds)

GPU Memory

Local transcription models require GPU memory:

Whisper Large v3: 10GB VRAM (float16)
Whisper Turbo: 6GB VRAM (float16)
Faster-Whisper: 4-8GB VRAM (int8_float16)
Pyannote Audio: 2GB VRAM (additional)

Accuracy Tradeoffs

Model	Speed	Accuracy	GPU Memory
Whisper Turbo	Fast	Good	6GB
Whisper Large v3	Slow	Excellent	10GB
AssemblyAI	Fast	Excellent	N/A
Deepgram	Very Fast	Excellent	N/A

Overview​

Audio Parameters​

Fusion Strategy Options​

Response Fields​

Transcript JSON Schema​

Transcript Segment Fields​

Examples​

Basic Audio Transcription​

With Speaker Diarization​

Timestamp-Aligned Fusion​

Native Multimodal Processing​

Error Responses​

400 Bad Request​

404 Not Found​

422 Validation Error​

500 Internal Server Error​

Configuration​

Supported Audio Providers​

Local Models​

Performance Considerations​

Processing Time​

GPU Memory​

Accuracy Tradeoffs​

See Also​

Overview

Audio Parameters

Fusion Strategy Options

Response Fields

Transcript JSON Schema

Transcript Segment Fields

Examples

Basic Audio Transcription

With Speaker Diarization

Timestamp-Aligned Fusion

Native Multimodal Processing

Error Responses

400 Bad Request

404 Not Found

422 Validation Error

500 Internal Server Error

Configuration

Supported Audio Providers

Local Models

Performance Considerations

Processing Time

GPU Memory

Accuracy Tradeoffs

See Also