Skip to main content

Model Service Configuration

The model service uses config/models.yaml to define available models, inference settings, and hardware allocation. Configuration determines which models load and how they run.

Configuration File Structure

The configuration file has two main sections:

models:
# Task-specific model definitions
video_summarization:
selected: "model-name"
options: {...}

inference:
# Global inference settings
max_memory_per_model: "auto"
offload_threshold: 0.85

Models Section

Defines available models for each task type. Each task has:

  • selected: Currently active model
  • options: Available model configurations

Inference Section

Global settings for model loading and memory management:

  • max_memory_per_model: VRAM limit per model
  • offload_threshold: Memory usage trigger for CPU offloading
  • warmup_on_startup: Load models at startup (default: false)
  • default_batch_size: Batch size for inference
  • max_batch_size: Maximum batch size allowed

Task Types

Video Summarization

VLM models for video frame analysis and description generation.

video_summarization:
selected: "llama-4-maverick"
options:
llama-4-maverick:
model_id: "meta-llama/Llama-4-Maverick"
quantization: "4bit"
framework: "sglang"
vram_gb: 62
speed: "fast"
description: "MoE model with 17B active parameters, multimodal, 10M context"

Available models:

ModelVRAM (4-bit)SpeedContext LengthNotes
llama-4-maverick62 GBFast10M tokensMoE with 17B active params
gemma-3-27b14 GBVery fast8K tokensDocument analysis, OCR
internvl3-78b40 GBMedium32K tokensHigh accuracy benchmarks
pixtral-large62 GBMedium128K tokensLong context processing
qwen2-5-vl-72b36 GBFast32K tokensStable baseline model

Ontology Augmentation

LLM models for generating ontology suggestions.

ontology_augmentation:
selected: "llama-4-scout"
options:
llama-4-scout:
model_id: "meta-llama/Llama-4-Scout"
quantization: "4bit"
framework: "sglang"
vram_gb: 55
speed: "very_fast"
description: "MoE model with 17B active, 10M context, multimodal"

Available models:

ModelVRAM (4-bit)SpeedContext LengthNotes
llama-4-scout55 GBVery fast10M tokensMoE, multimodal capable
llama-3-3-70b35 GBFast128K tokensMatches 405B quality
deepseek-v385 GBFast128K tokensMoE, 37B active params
gemma-3-27b-text14 GBVery fast8K tokensLightweight, fast iteration

Object Detection

Models for detecting and localizing objects in video frames.

object_detection:
selected: "yolo-world-v2"
options:
yolo-world-v2:
model_id: "ultralytics/yolo-world-v2-l"
framework: "pytorch"
vram_gb: 2
speed: "real_time"
fps: 52
description: "Speed and accuracy balance, image prompts"

Available models:

ModelVRAMFPS (GPU)TypeNotes
yolo-world-v22 GB52Open-worldImage prompt support
grounding-dino-1-54 GB20Zero-shotText prompt, 52.5 AP
owlv26 GB15Zero-shotRare class detection
florence-22 GB30UnifiedCaptioning support

Video Tracking

Models for tracking objects across multiple frames.

video_tracking:
selected: "samurai"
options:
samurai:
model_id: "yangchris11/samurai"
framework: "pytorch"
vram_gb: 3
speed: "real_time"
description: "Motion-aware, occlusion handling, 7.1% better"

Available models:

ModelVRAMSpeedNotes
samurai3 GBReal-timeMotion-aware, occlusion handling
sam2long3 GBReal-timeLong video support, error correction
sam2-13 GBReal-timeBaseline SAM2 model
yolo11n-seg1 GBVery fastLightweight segmentation

Model Selection Strategies

Automatic Selection (Default)

The service uses the selected field for each task type.

video_summarization:
selected: "gemma-3-27b" # This model loads automatically

When a request arrives:

  1. Service reads selected field
  2. Loads model if not already cached
  3. Performs inference
  4. Keeps model in memory for future requests

Manual Selection

Override the default model per request using the model parameter:

curl -X POST http://localhost:8000/api/summarize \
-H "Content-Type: application/json" \
-d '{
"video_path": "/data/example.mp4",
"model": "qwen2-5-vl-72b",
"frame_count": 8
}'

This loads qwen2-5-vl-72b instead of the configured default.

Switching Models

To permanently switch models:

  1. Edit config/models.yaml:
video_summarization:
selected: "qwen2-5-vl-72b" # Changed from gemma-3-27b
  1. Restart the service:
docker compose restart model-service

The new model loads on the next inference request.

Device Configuration

Environment Variables

DEVICE: Target device for inference

# CPU mode
export DEVICE=cpu

# CUDA GPU mode
export DEVICE=cuda

# Specific GPU
export DEVICE=cuda:0

# Apple Silicon (experimental)
export DEVICE=mps

CUDA_VISIBLE_DEVICES: Control GPU visibility

# Use only GPU 0
export CUDA_VISIBLE_DEVICES=0

# Use GPUs 0 and 1
export CUDA_VISIBLE_DEVICES=0,1

# Use all GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3

BUILD_MODE: Build profile selection

# Minimal build (PyTorch only, fast builds)
export BUILD_MODE=minimal

# Recommended build (adds bitsandbytes)
export BUILD_MODE=recommended

# Full build (adds vLLM, SGLang, requires GPU)
export BUILD_MODE=full

Docker Compose Configuration

Set device in .env file:

# For CPU development
DEVICE=cpu
BUILD_MODE=minimal

# For GPU production
DEVICE=cuda
BUILD_MODE=full
CUDA_VISIBLE_DEVICES=0,1,2,3

Start with appropriate profile:

# CPU mode
docker compose up

# GPU mode
docker compose --profile gpu up

Memory Requirements

VRAM Requirements by Model

Full precision (FP16/BF16):

ModelVRAM
Llama-4-Maverick240 GB
Llama-4-Scout220 GB
DeepSeek-V3340 GB
InternVL3-78B156 GB
Qwen2.5-VL-72B144 GB
Llama-3.3-70B140 GB
Gemma-3-27b54 GB

4-bit quantization (AWQ/GPTQ):

ModelVRAMReduction
Llama-4-Maverick62 GB74%
Llama-4-Scout55 GB75%
DeepSeek-V385 GB75%
InternVL3-78B40 GB74%
Qwen2.5-VL-72B36 GB75%
Llama-3.3-70B35 GB75%
Gemma-3-27b14 GB74%

Detection and tracking models use 1-6 GB regardless of quantization.

RAM Requirements

ModeMinimumRecommended
CPU16 GB32 GB
GPU32 GB64 GB

CPU mode requires extra RAM for model weights when VRAM unavailable.

Model Caching

Cache Directory Structure

Models download to cache directories:

~/.cache/huggingface/hub/
├── models--meta-llama--Llama-4-Maverick/
│ └── snapshots/
│ └── abc123def456/
│ ├── config.json
│ ├── model-00001-of-00005.safetensors
│ └── tokenizer.json
├── models--ultralytics--yolo-world-v2-l/
│ └── snapshots/
│ └── def789ghi012/
│ └── yolo_world_v2_l.pt

Environment Variables

TRANSFORMERS_CACHE: HuggingFace cache location

export TRANSFORMERS_CACHE=/path/to/cache

HF_HOME: Alternative cache location

export HF_HOME=/mnt/models

MODEL_CACHE_DIR: Service-specific cache

export MODEL_CACHE_DIR=/data/model-cache

Cache Behavior

First run: Models download from HuggingFace Hub. This requires internet and takes 5-60 minutes depending on model size.

Subsequent runs: Models load from local cache in seconds.

Shared cache: Multiple services can share the same cache directory.

Clearing Cache

Remove specific model:

rm -rf ~/.cache/huggingface/hub/models--meta-llama--Llama-4-Maverick

Clear entire cache:

rm -rf ~/.cache/huggingface/hub/

Note: Models re-download on next use.

Build Profiles

Minimal Profile

Purpose: Fast builds for development and CI/CD

Includes:

  • PyTorch 2.5+
  • Transformers 4.47+
  • Ultralytics (YOLO)
  • FastAPI
  • OpenCV

Excludes:

  • vLLM
  • SGLang
  • SAM-2
  • bitsandbytes

Build time: 1-2 minutes

Use when:

  • Developing API endpoints
  • Testing business logic
  • Running CI/CD pipelines
  • CPU-only environments

Purpose: Development with model optimization

Includes:

  • All minimal components
  • bitsandbytes (4-bit/8-bit quantization)

Excludes:

  • vLLM
  • SGLang
  • SAM-2

Build time: 1-2 minutes

Use when:

  • GPU available but limited VRAM
  • Need quantization for lighter models
  • Development with actual inference

Full Profile

Purpose: Production deployment with all features

Includes:

  • All recommended components
  • vLLM 0.6+ (LLM serving)
  • SGLang 0.4+ (structured generation)
  • SAM-2 (segmentation)

Build time: 10-15 minutes

Image size: 8-10 GB

Use when:

  • Production GPU deployment
  • Need all model types
  • Performance critical workloads

Adding Custom Models

Step 1: Add to Configuration

Edit config/models.yaml:

video_summarization:
options:
my-custom-model:
model_id: "organization/model-name"
quantization: "4bit"
framework: "sglang"
vram_gb: 20
speed: "fast"
description: "Custom model for specific use case"

Step 2: Set as Selected

video_summarization:
selected: "my-custom-model"

Step 3: Verify Model ID

Ensure model exists on HuggingFace Hub:

curl https://huggingface.co/api/models/organization/model-name

Step 4: Test Loading

Restart service and check logs:

docker compose restart model-service
docker compose logs -f model-service

Look for:

INFO: Loading model organization/model-name
INFO: Model loaded successfully in 15.3s

Step 5: Test Inference

curl -X POST http://localhost:8000/api/summarize \
-H "Content-Type: application/json" \
-d '{
"video_path": "/data/test.mp4",
"model": "my-custom-model",
"frame_count": 4
}'

Inference Settings

Max Memory Per Model

Controls VRAM allocation per model:

inference:
max_memory_per_model: "auto" # Automatic allocation
# or
max_memory_per_model: "20GB" # Fixed limit

auto: Service calculates based on available VRAM and model requirements.

Fixed value: Hard limit in GB. Model loading fails if exceeded.

Offload Threshold

Triggers CPU offloading when VRAM usage exceeds threshold:

inference:
offload_threshold: 0.85 # 85% VRAM usage

When VRAM usage exceeds 85%:

  1. Service identifies least recently used model
  2. Offloads model layers to CPU RAM
  3. Frees VRAM for new models

Warmup on Startup

Load models at service startup instead of on first request:

inference:
warmup_on_startup: true

Advantages:

  • Faster first request
  • Validate models at startup
  • Detect configuration errors early

Disadvantages:

  • Slower startup time (30-60 seconds)
  • VRAM allocated immediately
  • Not suitable for CPU mode

When to use: Production GPU deployments with predictable model usage.

Batch Size

Control batch processing:

inference:
default_batch_size: 1
max_batch_size: 8

default_batch_size: Batch size when not specified in request.

max_batch_size: Maximum allowed batch size. Requests exceeding this split into multiple batches.

Example batch request:

curl -X POST http://localhost:8000/api/detect \
-H "Content-Type: application/json" \
-d '{
"video_path": "/data/example.mp4",
"frame_numbers": [0, 10, 20, 30, 40, 50, 60, 70],
"batch_size": 4
}'

Processes frames in 2 batches of 4.

Troubleshooting

Model Not Found

Symptom: Error "Model organization/model-name not found"

Causes:

  • Typo in model_id
  • Model not on HuggingFace Hub
  • Private model without authentication

Solutions:

  1. Verify model ID:
curl https://huggingface.co/api/models/organization/model-name
  1. Check for typos in config.

  2. Add HuggingFace token for private models:

export HF_TOKEN=hf_xxxxxxxxxxxxx

Quantization Error

Symptom: Error "bitsandbytes not installed"

Cause: Using quantization: "4bit" with minimal build profile.

Solutions:

  1. Use recommended or full build profile:
BUILD_MODE=recommended docker compose build model-service
  1. Or remove quantization:
quantization: null  # Use full precision

VRAM Limit Exceeded

Symptom: Error "CUDA out of memory"

Causes:

  • Model too large for GPU
  • Multiple models loaded
  • Batch size too large

Solutions:

  1. Enable quantization:
quantization: "4bit"
  1. Reduce max_memory_per_model:
max_memory_per_model: "10GB"
  1. Lower offload_threshold:
offload_threshold: 0.7  # Offload earlier
  1. Use smaller model:
video_summarization:
selected: "gemma-3-27b" # Only 14GB

Slow Model Loading

Symptom: Model takes 5+ minutes to load

Causes:

  • Downloading from HuggingFace Hub
  • Slow disk I/O
  • Large model size

Solutions:

  1. Pre-download models:
python -c "
from transformers import AutoModel
AutoModel.from_pretrained('meta-llama/Llama-4-Maverick')
"
  1. Use faster storage for cache (NVMe SSD).

  2. Use HuggingFace CDN mirror if available.

  3. Enable warmup_on_startup to load during service initialization.

Next Steps