Model configuration
Use model-service/config/models.yaml (GPU) and
model-service/config/models-cpu.yaml (CPU) to declare which
model backs each task and how it is loaded. The model service
reads the file pointed at by MODEL_CONFIG_PATH; the Dockerfile
creates a build-time symlink at /app/config/active-models.yaml
based on the DEVICE build arg (see
Guide > Deployment).
Since v0.3.0 the configuration is loaded by the
YamlModelRepository adapter, an implementation of the
IModelRepository outbound port. Application code reads
TaskConfig and ModelConfig domain entities, never raw
YAML.
Task slots
The top-level keys are task slots:
video_summarization VLM that produces summary text
ontology_augmentation LLM that suggests ontology additions
claim_extraction LLM that pulls claims from a summary
claim_synthesis LLM that re-derives claims from a revised summary
object_detection open-vocabulary detector for POST /detect
object_tracking tracker used to fill keyframes
audio_transcription vendor adapter for speech-to-text
Slot shape
Each slot has a selected model id and an options dictionary:
video_summarization:
selected: "qwen-2-5-vl-7b"
options:
qwen-2-5-vl-7b:
model_id: "Qwen/Qwen2.5-VL-7B-Instruct"
quantization: "4bit"
framework: "sglang"
vram_gb: 8
speed: "fast"
description: "Compact VLM, ungated, fits well on A10G"
claude-sonnet-4-5:
model_id: "claude-sonnet-4-5"
framework: "external_api"
provider: "anthropic"
api_endpoint: "https://api.anthropic.com/v1/messages"
requires_api_key: true
The full schema is in Reference > Model config.
Frameworks
sglang on-GPU inference via SGLang runtime
vllm on-GPU inference via vLLM
transformers on-GPU inference via Transformers (also for SmolVLM /
Moondream on CPU)
llama_cpp CPU inference via llama.cpp (GGUF quantizations)
onnx CPU inference via ONNX Runtime
external_api delegate to a hosted provider
requires_api_key: true on an external_api option means the
backend resolves a user-level or admin-level API key for the
named provider; see Guide > API keys.
Switching models
Edit models.yaml, change the selected field for the relevant
task, and restart the model service:
docker compose restart model-service
GET /api/models/config returns the parsed configuration for the
frontend's model picker.
GET /api/models/status reports the actual loaded model.
POST /api/models/validate validates a candidate config before
applying it.
Per-persona overrides
Persona-level inference overrides live in PersonaPreferences
(introduced in v0.3.0) and are merged with the user-level
defaults from UserPreferences; user precedence wins for keys
present in both. The merged GenerationOverrides and
AudioOverrides structures are threaded through
CreateSummaryRequest, the BullMQ job payload, and finally
into the model-service request body as generation_overrides
and audio_overrides. The Inference Settings panel in the
admin UI binds to /api/models/defaults and
/api/models/frameworks. See
Reference > Model loaders for
the option set.