Skip to main content

Model Service Overview

The model service provides AI inference capabilities for video analysis, object detection, tracking, and ontology augmentation. Built with FastAPI, PyTorch, and Transformers, it supports both CPU and GPU execution.

Architecture

The model service uses a layered architecture:

FastAPI Application Layer

Model Manager (Lazy Loading)

Inference Engines (SGLang, vLLM, PyTorch)

Model Weights Cache

Core Components

FastAPI Application (src/main.py): HTTP API server handling requests and responses.

Model Manager (src/model_manager.py): Lazy model loading and memory management. Models load on first use, not at startup.

Inference Engines:

  • SGLang: Primary engine for LLM and VLM inference
  • vLLM: Fallback engine for high-throughput LLM serving
  • PyTorch: Direct inference for detection and tracking models

Configuration (config/models.yaml): Model selection and parameters.

Lazy Loading System

Models load only when needed. This approach:

  • Reduces startup time from minutes to seconds
  • Allows running without GPU during development
  • Enables selective model loading based on usage
  • Conserves VRAM by loading models on demand

Example: A video summarization request triggers VLM loading, but detection models remain unloaded until needed.

Available Tasks

Video Summarization

VLM-based analysis generates text descriptions from video frames.

Endpoint: POST /api/summarize

Models: Llama-4-Maverick, Gemma-3-27b, InternVL3-78B, Pixtral-Large, Qwen2.5-VL-72B

Use cases:

  • Generate video summaries for annotation context
  • Extract text from video frames (OCR)
  • Identify key events in footage

See Video Summarization for details.

Object Detection

Detect and localize objects in video frames.

Endpoint: POST /api/detect

Models: YOLO-World-v2, GroundingDINO 1.5, OWLv2, Florence-2

Use cases:

  • Initialize bounding boxes for annotation
  • Detect specific object classes (COCO dataset)
  • Zero-shot detection with text prompts

See Object Detection for details.

Video Tracking

Track objects across multiple frames.

Endpoint: POST /api/track

Models: SAMURAI, SAM2Long, SAM2.1, YOLO11n-seg

Use cases:

  • Generate annotation sequences automatically
  • Track moving objects through occlusions
  • Reduce manual keyframe placement

See Video Tracking for details.

Ontology Augmentation

LLM-based suggestions for ontology types and relationships.

Endpoint: POST /api/augment

Models: Llama-4-Scout, Llama-3.3-70B, DeepSeek-V3, Gemma-3-27b

Use cases:

  • Suggest entity types based on domain
  • Generate relationship definitions
  • Expand ontologies with persona context

See Ontology Augmentation for details.

Inference Engines

SGLang (Primary)

SGLang provides fast inference with structured generation support.

Advantages:

  • Supports both LLM and VLM models
  • 10M context length for Llama-4 models
  • Batching and continuous batching
  • JSON mode for structured outputs

When to use: Default for all VLM and LLM tasks.

vLLM (Fallback)

vLLM offers high-throughput serving for large batches.

Advantages:

  • PagedAttention for memory efficiency
  • Dynamic batching
  • Tensor parallelism for multi-GPU

When to use: When SGLang unavailable or for batch processing.

PyTorch (Direct)

Direct PyTorch inference for detection and tracking.

Advantages:

  • Full control over inference pipeline
  • Custom preprocessing and postprocessing
  • Lower overhead for single predictions

When to use: Object detection and tracking tasks.

System Requirements

CPU Mode (Development)

ComponentMinimumRecommended
CPU8 cores16 cores
RAM16 GB32 GB
Storage50 GB100 GB
OSLinux, macOS, WindowsLinux

Model availability: PyTorch models only (detection, tracking). VLM inference runs but is slow (30-60 seconds per frame).

When to use:

  • Local development without GPU
  • Testing API endpoints
  • Annotation workflows without AI assistance

GPU Mode (Production)

ComponentMinimumRecommended
GPUNVIDIA T4 (16GB VRAM)A100 (40GB VRAM)
CPU8 cores16 cores
RAM32 GB64 GB
Storage100 GB500 GB
OSLinux with CUDA 12.1+Ubuntu 22.04

Model availability: All models supported.

When to use:

  • Production deployments
  • Real-time video processing
  • Multiple concurrent users

VRAM Requirements by Model

ModelTaskVRAM (4-bit)VRAM (full)
Llama-4-MaverickSummarization62 GB240 GB
Gemma-3-27bSummarization14 GB54 GB
InternVL3-78BSummarization40 GB156 GB
Qwen2.5-VL-72BSummarization36 GB144 GB
Llama-4-ScoutAugmentation55 GB220 GB
DeepSeek-V3Augmentation85 GB340 GB
YOLO-World-v2Detection2 GB2 GB
GroundingDINO 1.5Detection4 GB4 GB
SAMURAITracking3 GB3 GB
SAM2.1Tracking3 GB3 GB

4-bit quantization reduces VRAM usage by approximately 75% with minimal accuracy loss.

Service Endpoints

Health Check

curl http://localhost:8000/health

Response:

{
"status": "healthy",
"models_loaded": ["llama-4-maverick"],
"device": "cuda",
"gpu_memory_allocated": "14.5 GB",
"gpu_memory_reserved": "16.0 GB"
}

Model Info

curl http://localhost:8000/models/info

Response:

{
"available_models": {
"video_summarization": ["llama-4-maverick", "gemma-3-27b", "qwen2-5-vl-72b"],
"ontology_augmentation": ["llama-4-scout", "llama-3-3-70b"],
"object_detection": ["yolo-world-v2", "grounding-dino-1-5"],
"video_tracking": ["samurai", "sam2-1"]
},
"loaded_models": ["llama-4-maverick"],
"device": "cuda:0"
}

Summarize Video

curl -X POST http://localhost:8000/api/summarize \
-H "Content-Type: application/json" \
-d '{
"video_path": "/data/example.mp4",
"persona_context": "Baseball game analyst",
"frame_count": 8,
"sampling_strategy": "uniform"
}'

See Video Summarization for full API.

Detect Objects

curl -X POST http://localhost:8000/api/detect \
-H "Content-Type: application/json" \
-d '{
"video_path": "/data/example.mp4",
"frame_numbers": [0, 10, 20],
"model": "yolo-world-v2",
"confidence_threshold": 0.5
}'

See Object Detection for full API.

Track Objects

curl -X POST http://localhost:8000/api/track \
-H "Content-Type: application/json" \
-d '{
"video_path": "/data/example.mp4",
"frame_range": {"start": 0, "end": 100},
"tracking_model": "samurai",
"confidence_threshold": 0.7
}'

See Video Tracking for full API.

Augment Ontology

curl -X POST http://localhost:8000/api/augment \
-H "Content-Type: application/json" \
-d '{
"persona_name": "Baseball Analyst",
"existing_ontology": {"entity_types": ["Player", "Ball"]},
"domain_context": "Baseball game analysis",
"task_description": "Annotating pitcher actions"
}'

See Ontology Augmentation for full API.

Performance Characteristics

Throughput

TaskModelCPUGPU (T4)GPU (A100)
SummarizationGemma-3-27b0.5 frames/sec4 frames/sec12 frames/sec
SummarizationQwen2.5-VL-72B0.2 frames/sec2 frames/sec8 frames/sec
DetectionYOLO-World-v215 frames/sec52 frames/sec85 frames/sec
DetectionGroundingDINO8 frames/sec20 frames/sec35 frames/sec
TrackingSAMURAI5 frames/sec25 frames/sec45 frames/sec
TrackingSAM2.16 frames/sec30 frames/sec50 frames/sec

Latency

TaskFirst RequestSubsequent Requests
Summarization15-30 seconds (model load)0.5-2 seconds
Detection5-10 seconds (model load)0.05-0.2 seconds
Tracking5-10 seconds (model load)0.1-0.5 seconds
Augmentation15-30 seconds (model load)1-5 seconds

First request includes model loading time. Subsequent requests use cached models.

When to Use Model Service

Use Model Service When:

  1. Automating repetitive tasks: Detecting hundreds of objects across video frames.
  2. Bootstrapping annotations: Generating initial bounding boxes for manual refinement.
  3. Analyzing large video datasets: Processing hours of footage efficiently.
  4. Enriching ontologies: Suggesting types and relationships from domain knowledge.
  5. Extracting video content: OCR, scene detection, or summarization.

Skip Model Service When:

  1. Annotating small datasets: Manual annotation may be faster for 10-20 objects.
  2. Complex edge cases: AI struggles with unusual perspectives or rare objects.
  3. Precision requirements: Manual annotation provides exact bounding boxes.
  4. Resource constraints: CPU inference is too slow for interactive use.
  5. Offline environments: Model downloads require internet connection.

Troubleshooting

Model Loading Fails

Symptom: Error "Failed to load model llama-4-maverick"

Causes:

  • Insufficient VRAM
  • Missing model files in cache
  • Incorrect model configuration

Solutions:

  1. Check VRAM availability:
nvidia-smi
  1. Verify model cache:
ls ~/.cache/huggingface/hub/
  1. Switch to smaller model in config/models.yaml:
video_summarization:
selected: "gemma-3-27b" # Requires only 14GB VRAM

Slow Inference on CPU

Symptom: Summarization takes 60+ seconds per frame

Cause: CPU inference is inherently slow for large models.

Solutions:

  1. Use GPU mode if available.
  2. Switch to lighter models (Gemma-3-27b instead of Llama-4-Maverick).
  3. Reduce frame count in requests.
  4. Use detection/tracking only (skip summarization).

CUDA Out of Memory

Symptom: Error "RuntimeError: CUDA out of memory"

Causes:

  • Model too large for GPU
  • Multiple models loaded simultaneously
  • Batch size too large

Solutions:

  1. Enable 4-bit quantization in config:
quantization: "4bit"
  1. Unload unused models:
curl -X POST http://localhost:8000/models/unload \
-H "Content-Type: application/json" \
-d '{"model_id": "unused-model"}'
  1. Reduce batch size in requests.

Connection Refused

Symptom: Error "Connection refused to localhost:8000"

Causes:

  • Service not running
  • Port conflict
  • Firewall blocking

Solutions:

  1. Check service status:
docker compose ps model-service
  1. Check logs:
docker compose logs model-service
  1. Verify port availability:
lsof -i :8000

Next Steps