Model Management API

The Model Management API provides endpoints for configuring ML models, monitoring memory usage, and managing model selection for different AI tasks. These endpoints proxy requests to the Python model service.

Overview

Fovea supports multiple AI models for different tasks (object detection, tracking, video summarization). The Model Management API allows you to:

View available models and their specifications
Select which model to use for each task
Monitor loaded models and VRAM usage
Validate memory budgets before loading models

Supported Tasks

Object Detection:

YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l
GroundingDINO (zero-shot detection)

Object Tracking:

SAMURAI (segment anything with tracking)
SAM2, SAM2Long
YOLO11n-seg
ByteTrack, BoT-SORT

Video Summarization:

LLaVA-NeXT variants (local VLMs)
External APIs (Anthropic Claude, OpenAI GPT-4o, Google Gemini)

Base URL

http://localhost:3001/api/models

Authentication

No authentication required. Model configuration is system-wide and not user-specific.

Endpoints

Get Model Configuration

Retrieve available models for all tasks and currently selected models.

Endpoint:

GET /api/models/config

Response 200 (Success):

{
  "models": {
    "video_summarization": {
      "selected": "llama-4-scout",
      "options": {
        "llama-4-scout": {
          "model_id": "meta-llama/Llama-4-Scout",
          "framework": "sglang",
          "vram_gb": 8.0,
          "speed": "fast",
          "description": "Fast VLM for video understanding",
          "fps": 2.5
        },
        "llama-4-maverick": {
          "model_id": "meta-llama/Llama-4-Maverick",
          "framework": "vllm",
          "vram_gb": 16.0,
          "speed": "medium",
          "description": "High-quality VLM",
          "fps": 1.2
        }
      }
    },
    "object_detection": {
      "selected": "yolov8n",
      "options": {
        "yolov8n": {
          "model_id": "yolov8n",
          "vram_mb": 512,
          "speed": "fast",
          "description": "Nano YOLO model"
        },
        "groundingdino": {
          "model_id": "IDEA-Research/grounding-dino-base",
          "vram_mb": 2048,
          "speed": "medium",
          "description": "Zero-shot detection"
        }
      }
    },
    "object_tracking": {
      "selected": null,
      "options": {
        "samurai": {
          "model_id": "yangchris11/samurai",
          "vram_mb": 4096,
          "speed": "slow",
          "description": "Segment anything with tracking"
        },
        "bytetrack": {
          "model_id": "bytetrack",
          "vram_mb": 256,
          "speed": "fast",
          "description": "Lightweight tracking"
        }
      }
    }
  },
  "inference": {
    "max_memory_per_model": 24.0,
    "offload_threshold": 0.8,
    "warmup_on_startup": true,
    "default_batch_size": 1,
    "max_batch_size": 8
  },
  "cuda_available": true,
  "total_vram_gb": 24.0
}

Response Fields:

Field	Type	Description
`models`	object	Models grouped by task type
`models.<task>.selected`	string \| null	Currently selected model name
`models.<task>.options`	object	Available models for this task
`inference`	object	Global inference configuration
`cuda_available`	boolean	Whether CUDA/GPU is available
`total_vram_gb`	number	Total VRAM in gigabytes

Model Option Fields:

Field	Type	Description
`model_id`	string	Hugging Face model ID or local name
`framework`	string	Inference framework (sglang, vllm, transformers)
`vram_gb` or `vram_mb`	number	VRAM requirement
`speed`	string	Speed classification (fast, medium, slow)
`description`	string	Human-readable description
`fps`	number	Frames per second (for VLMs)

Response 500 (Server Error):

{
  "error": "Model service unavailable"
}

Response 503 (Service Unavailable):

{
  "error": "Connection to model service failed"
}

Get Model Status

Retrieve information about currently loaded models, memory usage, and health status.

Endpoint:

GET /api/models/status

Response 200 (Success):

{
  "loaded_models": {
    "video_summarization": {
      "model_id": "meta-llama/Llama-4-Scout",
      "memory_usage_gb": 7.8,
      "load_time": 12.5
    },
    "object_detection": {
      "model_id": "yolov8n",
      "memory_usage_gb": 0.5,
      "load_time": 0.8
    }
  },
  "total_vram_allocated_gb": 8.3,
  "total_vram_available_gb": 24.0,
  "cuda_available": true,
  "device": "cuda:0"
}

Response Fields:

Field	Type	Description
`loaded_models`	object	Currently loaded models by task
`loaded_models.<task>.model_id`	string	Model identifier
`loaded_models.<task>.memory_usage_gb`	number	VRAM used by this model
`loaded_models.<task>.load_time`	number	Load time in seconds
`total_vram_allocated_gb`	number	Total VRAM allocated
`total_vram_available_gb`	number	Total VRAM capacity
`cuda_available`	boolean	GPU availability
`device`	string	PyTorch device string

Response 500 (Server Error):

{
  "error": "Failed to query model status"
}

Select Model

Select a specific model for a task type. This may trigger model loading.

Endpoint:

POST /api/models/select

Query Parameters:

Parameter	Type	Required	Description
`task_type`	string	Yes	Task type (video_summarization, object_detection, object_tracking)
`model_name`	string	Yes	Model name from configuration (e.g., "llama-4-scout", "yolov8n")

Example Request:

curl -X POST "http://localhost:3001/api/models/select?task_type=video_summarization&model_name=llama-4-scout"

Response 200 (Success):

{
  "status": "success",
  "task_type": "video_summarization",
  "selected_model": "llama-4-scout",
  "model_id": "meta-llama/Llama-4-Scout",
  "message": "Model selection updated"
}

Response 400 (Bad Request):

{
  "error": "Invalid task_type: invalid_task"
}

Response 404 (Not Found):

{
  "error": "Model 'nonexistent-model' not found for task 'video_summarization'"
}

Response 500 (Server Error):

{
  "error": "Failed to load model: Out of memory"
}

Response 503 (Service Unavailable):

{
  "error": "Model service timeout during model loading"
}

Validate Memory Budget

Check whether currently selected models can fit in available VRAM.

Endpoint:

POST /api/models/validate

Response 200 (Valid Budget):

{
  "valid": true,
  "total_vram_gb": 24.0,
  "total_required_gb": 18.3,
  "threshold": 0.8,
  "max_allowed_gb": 19.2,
  "model_requirements": {
    "video_summarization": {
      "model_name": "llama-4-maverick",
      "vram_gb": 16.0
    },
    "object_detection": {
      "model_name": "yolov8s",
      "vram_gb": 1.5
    },
    "object_tracking": {
      "model_name": "bytetrack",
      "vram_gb": 0.8
    }
  },
  "warnings": []
}

Response 200 (Invalid Budget):

{
  "valid": false,
  "total_vram_gb": 8.0,
  "total_required_gb": 20.5,
  "threshold": 0.8,
  "max_allowed_gb": 6.4,
  "model_requirements": {
    "video_summarization": {
      "model_name": "llama-4-maverick",
      "vram_gb": 16.0
    },
    "object_detection": {
      "model_name": "yolov8l",
      "vram_gb": 4.0
    },
    "object_tracking": {
      "model_name": "samurai",
      "vram_gb": 0.5
    }
  },
  "warnings": [
    "Total required VRAM (20.5 GB) exceeds available VRAM (8.0 GB)",
    "Consider selecting smaller models or reducing simultaneous tasks"
  ]
}

Response Fields:

Field	Type	Description
`valid`	boolean	Whether budget is valid
`total_vram_gb`	number	Total VRAM capacity
`total_required_gb`	number	VRAM needed for selected models
`threshold`	number	Maximum utilization threshold (0-1)
`max_allowed_gb`	number	Maximum VRAM allocation allowed
`model_requirements`	object	VRAM per task
`warnings`	string[]	Warnings about memory usage

Response 500 (Server Error):

{
  "error": "Failed to validate memory budget"
}

Usage Examples

Check Available Models

const response = await fetch('/api/models/config')
const config = await response.json()

// List detection models
Object.entries(config.models.object_detection.options).forEach(([name, model]) => {
  console.log(`${name}: ${model.description} (${model.vram_mb} MB)`)
})

// Get current selection
console.log('Current detection model:', config.models.object_detection.selected)

Switch to Different Model

// Select YOLOv8m for better accuracy
const response = await fetch(
  '/api/models/select?task_type=object_detection&model_name=yolov8m',
  { method: 'POST' }
)

if (response.ok) {
  const result = await response.json()
  console.log('Model selected:', result.selected_model)
} else {
  const error = await response.json()
  console.error('Selection failed:', error.error)
}

Validate Before Selection

// Check if we have enough memory for high-quality models
const validateResponse = await fetch('/api/models/validate', {
  method: 'POST'
})

const validation = await validateResponse.json()

if (validation.valid) {
  // Safe to use selected models
  console.log(`Using ${validation.total_required_gb.toFixed(1)} GB of ${validation.total_vram_gb} GB`)
} else {
  // Need to select smaller models
  console.warn('Insufficient VRAM:', validation.warnings.join(', '))
}

Monitor Memory Usage

const statusResponse = await fetch('/api/models/status')
const status = await statusResponse.json()

const usagePercent = (
  (status.total_vram_allocated_gb / status.total_vram_available_gb) * 100
).toFixed(1)

console.log(`VRAM Usage: ${status.total_vram_allocated_gb.toFixed(1)} GB / ${status.total_vram_available_gb} GB (${usagePercent}%)`)

// List loaded models
Object.entries(status.loaded_models).forEach(([task, model]) => {
  console.log(`${task}: ${model.model_id} (${model.memory_usage_gb.toFixed(1)} GB)`)
})

Model Selection Strategy

Performance vs Accuracy Tradeoff

Fast Models (Low VRAM):

Detection: YOLOv8n (512 MB)
Tracking: ByteTrack (256 MB)
Summarization: External API (0 MB)

Balanced Models (Medium VRAM):

Detection: YOLOv8s (1.5 GB)
Tracking: SAM2 (2 GB)
Summarization: Llama-4-Scout (8 GB)

High-Quality Models (High VRAM):

Detection: YOLOv8l (4 GB)
Tracking: SAMURAI (4 GB)
Summarization: Llama-4-Maverick (16 GB)

Lazy Loading

Models are loaded on-demand when first needed. Unused models are automatically unloaded when memory is needed for other tasks.

Loading Triggers:

First API request for a task
Model selection change
Explicit warmup on startup (if configured)

Memory Management

The model manager uses these strategies:

Threshold-based: Keeps usage below offload_threshold (default 80%)
LRU Eviction: Unloads least recently used models
Priority Loading: Critical tasks load first
Graceful Degradation: Falls back to CPU if GPU is full

Frontend Integration

Model Selection UI

The frontend provides a model configuration panel:

// annotation-tool/src/components/settings/ModelConfigPanel.tsx
import { useModelConfig, useSelectModel } from '@/hooks/useModels'

export function ModelConfigPanel() {
  const { config, loading } = useModelConfig()
  const selectModel = useSelectModel()

  const handleSelect = async (taskType: string, modelName: string) => {
    try {
      await selectModel.mutateAsync({ taskType, modelName })
      toast.success('Model selected successfully')
    } catch (error) {
      toast.error('Failed to select model')
    }
  }

  return (
    <div>
      {Object.entries(config.models).map(([task, taskConfig]) => (
        <ModelSelector
          key={task}
          task={task}
          options={taskConfig.options}
          selected={taskConfig.selected}
          onSelect={(name) => handleSelect(task, name)}
        />
      ))}
    </div>
  )
}

Memory Monitoring

Real-time VRAM usage display:

import { useModelStatus } from '@/hooks/useModels'

export function MemoryMonitor() {
  const { data: status } = useModelStatus({ refetchInterval: 5000 })

  const usagePercent =
    (status.total_vram_allocated_gb / status.total_vram_available_gb) * 100

  return (
    <div>
      <ProgressBar value={usagePercent} />
      <span>
        {status.total_vram_allocated_gb.toFixed(1)} GB /
        {status.total_vram_available_gb} GB
      </span>
    </div>
  )
}

Configuration File

Models are configured in model-service/config/models.yaml:

tasks:
  video_summarization:
    selected: llama-4-scout
    options:
      llama-4-scout:
        model_id: meta-llama/Llama-4-Scout
        framework: sglang
        vram_gb: 8.0
        speed: fast

  object_detection:
    selected: yolov8n
    options:
      yolov8n:
        model_id: yolov8n
        vram_mb: 512
        speed: fast

inference:
  max_memory_per_model: 24.0
  offload_threshold: 0.8
  warmup_on_startup: false

Best Practices

Memory Management

Check Before Loading: Always validate before selecting high-VRAM models
Monitor Usage: Track VRAM allocation to prevent OOM errors
Sequential Tasks: Process one heavy task at a time if memory is limited
Use External APIs: Offload to cloud providers for resource-intensive tasks

Model Selection

Start Small: Begin with nano models, scale up if accuracy is insufficient
Task Priority: Allocate more VRAM to your primary use case
Benchmark: Test different models on your specific data
CPU Fallback: Enable CPU mode for development/testing

Error Handling

try {
  const response = await fetch('/api/models/select?...', { method: 'POST' })

  if (!response.ok) {
    const error = await response.json()

    if (response.status === 404) {
      console.error('Model not found:', error.error)
    } else if (response.status === 500) {
      console.error('Model loading failed:', error.error)
      // Try fallback model or external API
    }
  }
} catch (error) {
  console.error('Network error:', error)
}

Troubleshooting

"Model service unavailable"

The Python model service is not running or unreachable.

Solutions:

Check model service container: docker ps | grep model-service
Verify MODEL_SERVICE_URL environment variable
Check model service logs: docker logs fovea-model-service

"Out of memory" during model loading

Insufficient VRAM for the selected model.

Solutions:

Run /api/models/validate to check budget
Select a smaller model variant
Unload unused models first
Use CPU mode or external API

Model selection not persisting

Configuration changes are stored in the model service but may be reset on restart.

Solutions:

Update config/models.yaml for persistent changes
Set selected field for each task in the YAML file
Rebuild model service container to apply changes

Overview​

Supported Tasks​

Base URL​

Authentication​

Endpoints​

Get Model Configuration​

Get Model Status​

Select Model​

Validate Memory Budget​

Usage Examples​

Check Available Models​

Switch to Different Model​

Validate Before Selection​

Monitor Memory Usage​

Model Selection Strategy​

Performance vs Accuracy Tradeoff​

Lazy Loading​

Memory Management​

Frontend Integration​

Model Selection UI​

Memory Monitoring​

Configuration File​

Best Practices​

Memory Management​

Model Selection​

Error Handling​

Troubleshooting​

"Model service unavailable"​

"Out of memory" during model loading​

Model selection not persisting​

See Also​

Overview

Supported Tasks

Base URL

Authentication

Endpoints

Get Model Configuration

Get Model Status

Select Model

Validate Memory Budget

Usage Examples

Check Available Models

Switch to Different Model

Validate Before Selection

Monitor Memory Usage

Model Selection Strategy

Performance vs Accuracy Tradeoff

Lazy Loading

Memory Management

Frontend Integration

Model Selection UI

Memory Monitoring

Configuration File

Best Practices

Memory Management

Model Selection

Error Handling

Troubleshooting

"Model service unavailable"

"Out of memory" during model loading

Model selection not persisting

See Also