Skip to main content

Build Modes

The model service supports three build modes to balance features, build time, and image size. Choose the mode that best fits your development workflow and deployment requirements.

Build Modes Comparison

AspectMinimalRecommendedFull
Build time1-2 min1-2 min10-15 min
Image size3-4GB3-4GB8-10GB
Inference enginesBasicBasicAll engines
QuantizationNoYes (bitsandbytes)Yes
vLLM/SGLangNoNoYes
SAM-2 segmentationNoNoYes
Best forDevelopment, CI/CDDevelopment with optimizationProduction
PlatformsCPU, GPU, ARM64CPU, GPU, ARM64GPU only (Linux)

Minimal Build

Minimal build includes only essential dependencies for basic functionality.

Included Features

  • PyTorch 2.5+
  • Transformers 4.47+
  • Ultralytics (YOLO detection)
  • FastAPI
  • Basic video processing (OpenCV)
  • Standard tracking models

Use Cases

  • Fast iteration during development
  • CI/CD pipelines with time constraints
  • Testing basic functionality
  • Limited disk space or bandwidth
  • Quick experimentation

Building Minimal Mode

# Default for CPU mode
docker compose build model-service

# Explicit minimal build
MODEL_BUILD_MODE=minimal docker compose build model-service

# GPU with minimal build
MODEL_BUILD_MODE=minimal docker compose --profile gpu build model-service-gpu

Recommended build adds quantization support for memory-efficient inference.

Included Features

All minimal features plus:

  • bitsandbytes for 4-bit and 8-bit quantization
  • Reduced VRAM usage with quantized models
  • Better memory efficiency

Use Cases

  • Development with model optimization
  • Memory-constrained environments
  • Testing quantization strategies
  • Balanced performance and build time
# CPU mode
MODEL_BUILD_MODE=recommended docker compose build model-service

# GPU mode
MODEL_BUILD_MODE=recommended docker compose --profile gpu build model-service-gpu

Full Build

Full build includes all inference engines and features for production use.

Included Features

All recommended features plus:

  • vLLM 0.6+ for high-throughput LLM serving
  • SGLang 0.4+ for structured generation
  • SAM-2 for video segmentation
  • All tracking models (SAMURAI, ByteTrack, BoT-SORT)
  • Production-optimized inference

Use Cases

  • Production deployments
  • Maximum feature set
  • High-throughput inference
  • Video segmentation tasks

Requirements

  • GPU Required: vLLM and SGLang require NVIDIA GPU with CUDA
  • Linux only: Will not build on macOS or Windows
  • CUDA 11.8+: Minimum CUDA version required
  • 20GB disk space: Large dependencies

Building Full Mode

# GPU mode (default for GPU profile)
docker compose --profile gpu build model-service-gpu

# Explicit full build
MODEL_BUILD_MODE=full docker compose --profile gpu build model-service-gpu

Note: Full build will fail on CPU-only systems or macOS due to GPU-specific dependencies.

Setting Build Mode

Using Environment Variables

Set MODEL_BUILD_MODE before building:

# Minimal (default)
MODEL_BUILD_MODE=minimal docker compose build model-service

# Recommended
MODEL_BUILD_MODE=recommended docker compose build model-service

# Full (GPU only)
MODEL_BUILD_MODE=full docker compose --profile gpu build model-service-gpu

Using .env File

Add to .env:

MODEL_BUILD_MODE=recommended

Then build:

docker compose build model-service

Default Build Modes

Without explicit configuration:

  • CPU mode: Uses minimal build
  • GPU mode: Uses full build

Switching Between Modes

To switch build modes, rebuild the service:

# Stop service
docker compose down

# Rebuild with new mode
MODEL_BUILD_MODE=recommended docker compose build model-service

# Start service
docker compose up -d

Data in volumes is preserved during rebuild.

Build Mode Details

Minimal Dependencies

# Core packages (minimal mode)
torch==2.5.0
transformers==4.47.1
ultralytics
opencv-python
fastapi
# Minimal + quantization
bitsandbytes
accelerate

Full Dependencies

# Recommended + inference engines
vllm==0.6.0
sglang==0.4.0
sam2
supervision

Platform Compatibility

Build ModeLinux CPULinux GPUmacOSWindows
MinimalYesYesYesYes*
RecommendedYesYesYesYes*
FullNoYesNoNo

*Windows requires WSL2 for Docker

Build Time Optimization

Use BuildKit

Enable BuildKit for faster builds:

export DOCKER_BUILDKIT=1
docker compose build

Use Build Cache

BuildKit caches layers automatically. Subsequent builds are faster:

# First build: 10 minutes
docker compose --profile gpu build

# Rebuild after code change: 1-2 minutes
docker compose --profile gpu build

Parallel Builds

Build multiple services in parallel:

docker compose build --parallel

Troubleshooting

Build Fails on macOS/Windows

If building full mode fails:

  • Solution: Use minimal or recommended mode
  • Reason: vLLM and SGLang require Linux + NVIDIA GPU
  • Alternative: Build on Linux GPU instance or cloud

Out of Disk Space

If build fails with disk space errors:

# Clean Docker cache
docker system prune -a

# Check available space
df -h

Build Takes Too Long

If builds are slow:

  • Use minimal mode for development
  • Enable BuildKit caching
  • Ensure adequate internet bandwidth for package downloads
  • Use Docker layer caching in CI/CD

ImportError After Switching Modes

If you see import errors after switching modes:

# Rebuild completely without cache
docker compose build --no-cache model-service

# Restart service
docker compose up -d model-service

Recommendations by Use Case

Local Development

  • Mode: Minimal
  • Reason: Fast builds, quick iteration
  • Command: docker compose up -d

Development with Optimization

  • Mode: Recommended
  • Reason: Test quantization without long builds
  • Command: MODEL_BUILD_MODE=recommended docker compose up -d --build

CI/CD Pipeline

  • Mode: Minimal
  • Reason: Minimize build time, test core functionality
  • Command: MODEL_BUILD_MODE=minimal docker compose build

Production Deployment

  • Mode: Full (GPU only)
  • Reason: All features, maximum performance
  • Command: docker compose --profile gpu up -d

Next Steps