Skip to main content

Model service

The model service is a FastAPI process that hosts the VLM, LLM, detector, and tracker models plus the audio vendor adapters. It exposes an HTTP surface to the backend and never talks to the database. The backend invokes it via BullMQ jobs (for the long-running summarization, extraction, and synthesis flows) and direct HTTP (for detection, tracking, and thumbnail generation).

Since v0.3.0 the model service is laid out as a Clean Architecture stack (domain / application / infrastructure). This page covers the task-slot configuration, the loader hierarchy, and the external-API path; the layered structure is documented in Clean Architecture.

Task-slot configuration

model-service/config/models.yaml (GPU) and model-service/config/models-cpu.yaml (CPU) declare one entry per task slot. Each slot has a selected model id and a dictionary of options. The Docker build creates a symlink at /app/config/active-models.yaml pointing at the right file based on the DEVICE build arg. MODEL_CONFIG_PATH overrides the path at runtime. The model manager loads the selected option on startup; switching options requires a restart.

The slots are:

video_summarization
ontology_augmentation
claim_extraction
claim_synthesis
object_detection
object_tracking
audio_transcription
speaker_diarization
voice_activity_detection

The schema is documented in Reference > Model config.

Loaders

Loaders live under model-service/src/infrastructure/adapters/outbound/models/. One subdirectory per modality, each with a loader.py factory and a base.py shared interface so the adapter never imports the factory directly:

models/llm/                 SGLang, vLLM, Transformers LLMs
models/vlm/ SGLang, vLLM, Transformers VLMs
models/detection/ OWLv2, Grounding DINO, YOLO-World
models/tracking/ CoTracker, SAMURAI, SAM2
models/audio/ Whisper, faster-whisper, pyannote, Silero VAD,
Canary, Parakeet, WhisperX
models/onnx/ ONNX Runtime YOLO-World, Florence-2,
Grounding DINO (CPU mode)
models/llama_cpp/ llama.cpp LLM and VLM (GGUF, CPU mode)
models/sam3/ SAM 3 / 3.1 with detection and tracking adapters

Each loader returns a callable wrapped behind one of the outbound ports (ILanguageModel, IVisionLanguageModel, IDetectionModel, ITrackingModel, IAudioTranscriber). The ModelManager caches loaders in process memory.

Frameworks

sglang         on-GPU inference via SGLang runtime
vllm on-GPU inference via vLLM
transformers on-GPU inference via Transformers (also for SmolVLM /
Moondream on CPU)
llama_cpp CPU inference via llama.cpp (GGUF quantizations)
onnx CPU inference via ONNX Runtime
external_api delegate to a hosted provider

external_api options dispatch through the matching client under model-service/src/infrastructure/adapters/outbound/external_apis/. The client receives the API key from the backend at call time (the model service never reads the key from the database).

Audio adapters

The seven audio vendor adapters live under model-service/src/infrastructure/adapters/outbound/external_apis/audio/ and share a common base.py. They normalize transcripts to a common shape: paragraph text, per-word offsets, speaker labels (where supported), and language code. The AudioProcessingService consumes that normalized shape and hands it to the summarization use case for fusion.

assemblyai_client.py
aws_transcribe_client.py
azure_speech_client.py
deepgram_client.py
gladia_client.py
google_speech_client.py
revai_client.py

The on-device path (Whisper, faster-whisper, Canary, Parakeet, WhisperX) lives under models/audio/ and runs through the same IAudioTranscriber port.

Reasoning traces

Use cases that call thinking-capable models return a ReasonedText DTO with an optional ThinkingTrace. See Guide > Reasoning traces.

Observability

Every use case wraps its execute in an OpenTelemetry span. Every outbound adapter emits a model_inference metric on every call, tagged with model_id, task, and framework. Spans and metrics ship to OTEL_EXPORTER_OTLP_ENDPOINT; see Guide > Observability.