Skip to main content

Model configuration

Use model-service/config/models.yaml (GPU) and model-service/config/models-cpu.yaml (CPU) to declare which model backs each task and how it is loaded. The model service reads the file pointed at by MODEL_CONFIG_PATH; the Dockerfile creates a build-time symlink at /app/config/active-models.yaml based on the DEVICE build arg (see Guide > Deployment).

The configuration is loaded by the YamlModelRepository adapter, an implementation of the IModelRepository outbound port. Application code reads TaskConfig and ModelConfig domain entities, never raw YAML.

Task slots

The top-level keys are task slots:

video_summarization      VLM that produces summary text
ontology_augmentation LLM that suggests ontology additions
claim_extraction LLM that pulls claims from a summary
claim_synthesis LLM that re-derives claims from a revised summary
object_detection open-vocabulary detector for POST /detect
video_tracking tracker used to fill keyframes
audio_transcription vendor adapter for speech-to-text

Slot shape

Each slot has a selected model id and an options dictionary:

video_summarization:
selected: "qwen-2-5-vl-7b"
options:
qwen-2-5-vl-7b:
model_id: "Qwen/Qwen2.5-VL-7B-Instruct"
quantization: "4bit"
framework: "sglang"
vram_gb: 8
speed: "fast"
description: "Compact VLM, ungated, fits well on A10G"
claude-sonnet-4-5:
model_id: "claude-sonnet-4-5"
framework: "external_api"
provider: "anthropic"
api_endpoint: "https://api.anthropic.com/v1/messages"
requires_api_key: true

The full schema is in Reference > Model config.

Frameworks

Common frameworks include:

sglang           on-GPU inference via SGLang runtime
transformers on-GPU inference via Transformers (also for SmolVLM /
Moondream on CPU)
llama_cpp CPU inference via llama.cpp (GGUF quantizations)
onnx CPU inference via ONNX Runtime
faster_whisper CPU/GPU speech-to-text via faster-whisper
whisper reference Whisper implementation
whisperx WhisperX with alignment and diarization
nemo_canary NVIDIA NeMo Canary ASR
nemo_parakeet NVIDIA NeMo Parakeet ASR
pyannote pyannote speaker diarization
sam3 Segment Anything 3 detection/segmentation
ultralytics Ultralytics detectors (YOLO family)
pytorch generic PyTorch loader
external_api delegate to a hosted provider

Query /api/models/frameworks for the live framework set recognized by the running model service.

requires_api_key: true on an external_api option means the backend resolves a user-level or admin-level API key for the named provider; see Guide > API keys.

Switching models

Edit models.yaml, change the selected field for the relevant task, and restart the model service:

docker compose restart model-service

GET /api/models/config returns the parsed configuration for the frontend's model picker. GET /api/models/status reports the actual loaded model. POST /api/models/validate validates a candidate config before applying it.

Per-persona overrides

Persona-level inference overrides live in PersonaPreferences and are merged with the user-level defaults from UserPreferences; persona precedence wins for keys present in both. The merged GenerationOverrides and AudioOverrides structures are threaded through CreateSummaryRequest, the BullMQ job payload, and finally into the model-service request body as generation_overrides and audio_overrides. The Inference Settings panel lives in the user-facing Settings page under the Inference tab (available to all authenticated users) and binds to /api/models/defaults and /api/models/frameworks. See Reference > Model loaders for the option set.