Model configuration

Use model-service/config/models.yaml (GPU) and model-service/config/models-cpu.yaml (CPU) to declare which model backs each task and how it is loaded. The model service reads the file pointed at by MODEL_CONFIG_PATH; the Dockerfile creates a build-time symlink at /app/config/active-models.yaml based on the DEVICE build arg (see Guide > Deployment).

Since v0.3.0 the configuration is loaded by the YamlModelRepository adapter, an implementation of the IModelRepository outbound port. Application code reads TaskConfig and ModelConfig domain entities, never raw YAML.

Task slots

The top-level keys are task slots:

video_summarization      VLM that produces summary text
ontology_augmentation    LLM that suggests ontology additions
claim_extraction         LLM that pulls claims from a summary
claim_synthesis          LLM that re-derives claims from a revised summary
object_detection         open-vocabulary detector for POST /detect
object_tracking          tracker used to fill keyframes
audio_transcription      vendor adapter for speech-to-text

Slot shape

Each slot has a selected model id and an options dictionary:

video_summarization:
  selected: "qwen-2-5-vl-7b"
  options:
    qwen-2-5-vl-7b:
      model_id: "Qwen/Qwen2.5-VL-7B-Instruct"
      quantization: "4bit"
      framework: "sglang"
      vram_gb: 8
      speed: "fast"
      description: "Compact VLM, ungated, fits well on A10G"
    claude-sonnet-4-5:
      model_id: "claude-sonnet-4-5"
      framework: "external_api"
      provider: "anthropic"
      api_endpoint: "https://api.anthropic.com/v1/messages"
      requires_api_key: true

The full schema is in Reference > Model config.

Frameworks

sglang         on-GPU inference via SGLang runtime
vllm           on-GPU inference via vLLM
transformers   on-GPU inference via Transformers (also for SmolVLM /
               Moondream on CPU)
llama_cpp      CPU inference via llama.cpp (GGUF quantizations)
onnx           CPU inference via ONNX Runtime
external_api   delegate to a hosted provider

requires_api_key: true on an external_api option means the backend resolves a user-level or admin-level API key for the named provider; see Guide > API keys.

Switching models

Edit models.yaml, change the selected field for the relevant task, and restart the model service:

docker compose restart model-service

GET /api/models/config returns the parsed configuration for the frontend's model picker. GET /api/models/status reports the actual loaded model. POST /api/models/validate validates a candidate config before applying it.

Per-persona overrides

Persona-level inference overrides live in PersonaPreferences (introduced in v0.3.0) and are merged with the user-level defaults from UserPreferences; user precedence wins for keys present in both. The merged GenerationOverrides and AudioOverrides structures are threaded through CreateSummaryRequest, the BullMQ job payload, and finally into the model-service request body as generation_overrides and audio_overrides. The Inference Settings panel in the admin UI binds to /api/models/defaults and /api/models/frameworks. See Reference > Model loaders for the option set.

Task slots​

Slot shape​

Frameworks​

Switching models​

Per-persona overrides​

Task slots

Slot shape

Frameworks

Switching models

Per-persona overrides