GPU Mode Deployment

GPU mode enables GPU-accelerated inference for significantly faster processing. This mode requires NVIDIA GPU with CUDA support and is recommended for production deployments.

Overview

GPU mode uses Docker Compose GPU profiles to start the GPU-enabled model service. All inference operations run on GPU, providing 5-10x speedup compared to CPU mode.

Best for:

Production deployments
High-volume video processing (10+ videos/day)
Real-time or near-real-time inference
Processing high-resolution video
Running large AI models

Prerequisites

Ensure you have met the GPU mode prerequisites:

Docker Engine 24.0+
Docker Compose 2.20+
NVIDIA GPU with 8GB+ VRAM (24GB recommended)
NVIDIA Driver 525.60+
CUDA Toolkit 11.8+
NVIDIA Container Toolkit (nvidia-docker2)
8 cores minimum (16 cores recommended)
16GB RAM minimum (32GB recommended)
50GB disk space minimum (100GB recommended)

Important: Complete GPU setup before proceeding.

Step-by-Step Deployment

Step 1: Verify GPU Access

Before deployment, verify Docker can access GPU:

nvidia-smi

Expected output shows GPU information, driver version, and CUDA version.

Test Docker GPU access:

docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

This should display GPU information from inside a container.

Step 2: Clone Repository

git clone https://github.com/parafovea/fovea.git
cd fovea

Step 3: Configure Environment (Optional)

Default GPU configuration is suitable for most deployments. To customize:

cp .env.example .env

Key GPU-related settings in .env:

# GPU Configuration
CUDA_VISIBLE_DEVICES=0,1,2,3  # GPU indices to use
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512  # Memory management

See Configuration for details.

Step 4: Start Services with GPU Profile

Start all services with GPU acceleration:

docker compose --profile gpu up -d

This command:

Downloads Docker images (first run only)
Creates persistent volumes
Starts model-service-gpu with GPU access
Enables GPU-accelerated inference

Initial startup takes 5-10 minutes for image download and model loading.

Step 5: Verify GPU Detection

Check GPU is detected by model service:

docker compose exec model-service-gpu nvidia-smi

Expected output shows GPU information from inside the container.

Check CUDA availability in PyTorch:

docker compose exec model-service-gpu python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}')"

Expected output:

CUDA available: True
GPU count: 1

Step 6: Verify Service Health

Check model service is using GPU:

curl http://localhost:8000/health

Expected response includes "device":"cuda".

Check backend connectivity:

curl http://localhost:3001/api/health

Expected response: {"status":"ok"}

Step 7: Access Application

Open your browser and navigate to:

Frontend: http://localhost:3000

Model Service docs: http://localhost:8000/docs (Swagger UI)

Grafana dashboards: http://localhost:3002 (admin/admin)

Performance Expectations

GPU mode provides significant performance improvements over CPU mode. Performance varies based on:

GPU VRAM (larger models require more memory)
GPU compute capability (newer GPUs perform better)
Batch size configuration
Model size and precision settings

Typical speedup compared to CPU mode: 5-10x for most inference operations.

Performance scales with GPU capabilities. High-end GPUs (RTX 4090, A100) provide faster inference than entry-level GPUs (RTX 3070).

Resource Usage

Typical resource consumption in GPU mode:

Resource	Usage	Notes
RAM	8-12GB	System memory
GPU VRAM	4-8GB	Model and batch data
GPU Utilization	60-90%	During inference
Disk	20GB	Larger images with full build

Monitor GPU usage:

# From host
nvidia-smi -l 1

# From container
docker compose exec model-service-gpu nvidia-smi -l 1

Monitor overall resource usage:

docker stats

Troubleshooting GPU Issues

GPU Not Detected

If GPU is not visible inside container:

# Check GPU profile is active
docker compose --profile gpu ps

# Verify CUDA_VISIBLE_DEVICES
docker compose exec model-service-gpu env | grep CUDA

# Check Docker runtime
docker info | grep -i runtime

Ensure you started with --profile gpu flag.

CUDA Out of Memory

If you see "CUDA out of memory" errors:

Reduce batch sizes in model-service/config/models.yaml
Use fewer GPUs: Set CUDA_VISIBLE_DEVICES=0 in .env
Use smaller models: Switch to lighter models in configuration
Increase GPU memory: Upgrade to GPU with more VRAM

Driver Version Mismatch

If you see CUDA version incompatibility errors:

# Check driver CUDA version
nvidia-smi

# Check toolkit version in container
docker compose exec model-service-gpu nvcc --version

Update NVIDIA driver to 525.60+ if needed.

Container Fails to Start

Check logs for specific errors:

docker compose --profile gpu logs model-service-gpu

Common issues:

NVIDIA Container Toolkit not installed
Docker not restarted after toolkit installation
Insufficient permissions to access /dev/nvidia*

GPU Memory Management

Monitor Memory Usage

Check current memory usage:

nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv

Optimize Memory Usage

Edit .env for memory optimization:

# Limit memory fragmentation
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,roundup_power2_divisions:16

# Use mixed precision training/inference
MIXED_PRECISION=true

Clear GPU Cache

If memory is not released:

docker compose restart model-service-gpu

Multi-GPU Setup

To use multiple GPUs:

Configure GPU indices in .env:

CUDA_VISIBLE_DEVICES=0,1,2,3  # Use GPUs 0-3

Restart services:

docker compose --profile gpu down
docker compose --profile gpu up -d

Verify all GPUs detected:

docker compose exec model-service-gpu python -c "import torch; print(torch.cuda.device_count())"

Model service will automatically distribute workload across available GPUs.

Production Considerations

Security

Change default passwords in .env
Restrict access to ports 3001, 8000
Use HTTPS reverse proxy (nginx, Traefik)
Enable authentication on Grafana

Monitoring

Set up alerts for GPU memory usage >90%
Monitor GPU temperature
Track inference latency via Prometheus
Review Grafana dashboards regularly

See Monitoring Overview for details.

Backup Strategy

Backup PostgreSQL database regularly
Store video files on reliable storage
Keep .env configuration in secure location

See Common Tasks for backup commands.

Switching from CPU to GPU

To switch existing CPU deployment to GPU:

# Stop CPU services
docker compose down

# Start GPU services
docker compose --profile gpu up -d

Data in volumes is preserved during the switch.

Next Steps

Build Modes: Understand minimal vs full builds
Configuration: Advanced GPU configuration
Monitoring: Set up monitoring
Common Tasks: Daily operations

Overview​

Prerequisites​

Step-by-Step Deployment​

Step 1: Verify GPU Access​

Step 2: Clone Repository​

Step 3: Configure Environment (Optional)​

Step 4: Start Services with GPU Profile​

Step 5: Verify GPU Detection​

Step 6: Verify Service Health​

Step 7: Access Application​

Performance Expectations​

Resource Usage​

Troubleshooting GPU Issues​

GPU Not Detected​

CUDA Out of Memory​

Driver Version Mismatch​

Container Fails to Start​

GPU Memory Management​

Monitor Memory Usage​

Optimize Memory Usage​

Clear GPU Cache​

Multi-GPU Setup​

Production Considerations​

Security​

Monitoring​

Backup Strategy​

Switching from CPU to GPU​

Next Steps​