External API Integration
The model service integrates with external model providers (Anthropic Claude, OpenAI GPT, Google Gemini) through a provider abstraction layer. This architecture enables using external APIs alongside self-hosted models without changing client code.
Architecture Overview
The external API integration uses three layers:
- Provider Abstraction Layer: Base
ExternalAPIClientclass defines common interface - Router Layer:
ExternalModelRouterroutes requests to appropriate provider clients - Client Implementations: Provider-specific clients (Anthropic, OpenAI, Google)
┌─────────────────────────────────────────────┐
│ Model Service API │
│ (routes.py) │
└──────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Model Manager │
│ (Decides: self-hosted or external?) │
└──────┬──────────────────────────────────────┘
│
├──────────────────────┬────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌─────────────┐
│ Self-Hosted │ │ External │ │ External │
│ VLM/LLM │ │ API Router │ │ Audio APIs │
│ (SGLang) │ └──────┬───────┘ └──────┬──────┘
└──────────────┘ │ │
│ │
┌──────────────┼────────────┐ │
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────┐ ┌──────────┐ ┌────────────┐
│ Anthropic │ │ OpenAI │ │ Google │
│ Client │ │ Client │ │ Client │
└──────┬──────┘ └─────┬────┘ └─────┬──────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────┐
│ External Provider APIs │
│ (Claude, GPT-4o, Gemini via HTTPS) │
└──────────────────────────────────────────┘
Provider Abstraction Layer
Base Client Interface
The ExternalAPIClient abstract base class defines the common interface:
# model-service/src/external_apis/base.py
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class ExternalAPIConfig:
"""Configuration for external API client."""
api_key: str
api_endpoint: str
model_id: str
timeout: int = 60
max_retries: int = 3
class ExternalAPIClient(ABC):
"""Base class for external API clients."""
def __init__(self, config: ExternalAPIConfig):
self.config = config
self.client = httpx.AsyncClient(timeout=config.timeout)
@abstractmethod
async def generate_text(
self, prompt: str, max_tokens: int = 1024,
temperature: float = 0.7, **kwargs
) -> dict[str, Any]:
"""Generate text from prompt."""
pass
@abstractmethod
async def generate_from_images(
self, images: list[bytes], prompt: str,
max_tokens: int = 1024, **kwargs
) -> dict[str, Any]:
"""Generate text from images and prompt."""
pass
@abstractmethod
async def validate_key(self) -> bool:
"""Validate API key with minimal API call."""
pass
All provider clients implement this interface, ensuring consistent behavior across providers.
Common Patterns
All client implementations follow these patterns:
- Configuration via dataclass: API keys, endpoints, and settings passed in
ExternalAPIConfig - Async HTTP client: Uses
httpx.AsyncClientfor non-blocking requests - Retry logic: Automatic retries with exponential backoff using
tenacity - Error handling: Provider-specific errors mapped to common exceptions
- Response normalization: All responses return
dict[str, Any]withtext,usage,modelkeys
Router Layer
The ExternalModelRouter manages client lifecycle and routes requests:
# model-service/src/external_apis/router.py
class ExternalModelRouter:
"""Routes requests to appropriate external API clients."""
def __init__(self):
self._clients: dict[str, ExternalAPIClient] = {}
def get_client(
self, config: ExternalAPIConfig, provider: str
) -> ExternalAPIClient:
"""Get or create client for the specified provider."""
cache_key = f"{provider}:{config.model_id}"
if cache_key in self._clients:
return self._clients[cache_key]
# Create client based on provider
if provider == "anthropic":
client = AnthropicClient(config)
elif provider == "openai":
client = OpenAIClient(config)
elif provider == "google":
client = GoogleClient(config)
else:
raise ValueError(f"Unsupported provider: {provider}")
self._clients[cache_key] = client
return client
Client Caching
The router caches client instances using {provider}:{model_id} as cache key. This:
- Reuses HTTP connections across requests
- Prevents recreating client objects
- Improves performance for repeated API calls
Clients remain cached until the router is closed with close_all().
Provider Implementations
Anthropic Claude Client
Supports Claude Sonnet 4.5 with vision capabilities:
# model-service/src/external_apis/anthropic_client.py
class AnthropicClient(ExternalAPIClient):
"""Client for Anthropic Claude API."""
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type((httpx.TimeoutException, httpx.HTTPStatusError))
)
async def generate_text(self, prompt: str, max_tokens: int = 1024,
temperature: float = 0.7, **kwargs) -> dict[str, Any]:
headers = {
"x-api-key": self.config.api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json"
}
payload = {
"model": self.config.model_id,
"max_tokens": max_tokens,
"temperature": temperature,
"messages": [{"role": "user", "content": prompt}]
}
response = await self.client.post(
self.config.api_endpoint, headers=headers, json=payload
)
response.raise_for_status()
data = response.json()
return {
"text": data["content"][0]["text"],
"usage": data["usage"],
"model": data["model"]
}
Features:
- Vision support via
generate_from_images()(max 20 images) - Automatic retry with exponential backoff
- System prompts via
systemkwarg - Stop sequences support
OpenAI GPT Client
Supports GPT-4o with vision:
# model-service/src/external_apis/openai_client.py
class OpenAIClient(ExternalAPIClient):
"""Client for OpenAI GPT API."""
async def generate_text(self, prompt: str, **kwargs) -> dict[str, Any]:
headers = {
"Authorization": f"Bearer {self.config.api_key}",
"Content-Type": "application/json"
}
payload = {
"model": self.config.model_id,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": kwargs.get("max_tokens", 1024),
"temperature": kwargs.get("temperature", 0.7)
}
response = await self.client.post(
self.config.api_endpoint, headers=headers, json=payload
)
response.raise_for_status()
data = response.json()
return {
"text": data["choices"][0]["message"]["content"],
"usage": data["usage"],
"model": data["model"]
}
Features:
- Vision support via
generate_from_images() - JSON mode via
response_formatkwarg - Function calling support
- Streaming via
stream=true
Google Gemini Client
Supports Gemini 2.5 Flash multimodal:
# model-service/src/external_apis/google_client.py
class GoogleClient(ExternalAPIClient):
"""Client for Google Gemini API."""
async def generate_from_images(
self, images: list[bytes], prompt: str, **kwargs
) -> dict[str, Any]:
url = f"{self.config.api_endpoint}?key={self.config.api_key}"
# Convert images to base64
image_parts = [
{
"inline_data": {
"mime_type": "image/jpeg",
"data": base64.b64encode(img).decode()
}
}
for img in images
]
payload = {
"contents": [{
"parts": [{"text": prompt}] + image_parts
}],
"generationConfig": {
"maxOutputTokens": kwargs.get("max_tokens", 1024),
"temperature": kwargs.get("temperature", 0.7)
}
}
response = await self.client.post(url, json=payload)
response.raise_for_status()
data = response.json()
return {
"text": data["candidates"][0]["content"]["parts"][0]["text"],
"usage": {"total_tokens": 0}, # Gemini doesn't return usage
"model": self.config.model_id
}
Features:
- Native multimodal support (text + images in single request)
- Safety settings configuration
- Response candidates filtering
- Function calling support
API Key Resolution
API keys are resolved in this priority order:
- User-level keys: Set in Settings > API Keys (user-scoped)
- System-level keys: Set in Admin Panel > API Keys (admin-only, userId: null)
- Environment variables: Fallback from
.envfile
Resolution Logic
# model-service/src/model_manager.py
def get_external_api_config(self, task_name: str) -> ExternalAPIConfig:
"""Get external API configuration with key resolution."""
task_config = self.tasks.get(task_name)
selected_model = task_config.get_selected_config()
provider = selected_model.provider
# 1. Try user-level key (from backend API)
api_key = await self._fetch_user_api_key(provider)
# 2. Try system-level key (from backend API, userId: null)
if not api_key:
api_key = await self._fetch_system_api_key(provider)
# 3. Fall back to environment variable
if not api_key:
api_key = os.getenv(f"{provider.upper()}_API_KEY")
if not api_key:
raise ValueError(
f"API key for provider '{provider}' not found. "
f"Set via UI or {provider.upper()}_API_KEY env var."
)
return ExternalAPIConfig(
api_key=api_key,
api_endpoint=selected_model.api_endpoint,
model_id=selected_model.model_id
)
This design allows:
- Users to configure their own API keys
- Admins to set fallback keys for all users
- Environment variables for simple deployment
Configuration
External API models are configured in config/models.yaml:
tasks:
video_summarization:
selected: "claude-sonnet-4-5"
options:
claude-sonnet-4-5:
model_id: "claude-sonnet-4-5"
framework: "external_api"
provider: "anthropic"
api_endpoint: "https://api.anthropic.com/v1/messages"
requires_api_key: true
speed: "fast"
description: "Anthropic Claude Sonnet 4.5, vision capable, API-based"
gpt-4o:
model_id: "gpt-4o"
framework: "external_api"
provider: "openai"
api_endpoint: "https://api.openai.com/v1/chat/completions"
requires_api_key: true
speed: "fast"
description: "OpenAI GPT-4 Omni, vision capable, API-based"
gemini-2-5-flash:
model_id: "gemini-2.5-flash"
framework: "external_api"
provider: "google"
api_endpoint: "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent"
requires_api_key: true
speed: "very_fast"
description: "Google Gemini 2.5 Flash, multimodal, API-based"
Model Selection
The ModelManager checks the framework field to determine if a model is external:
def is_external_api(self, task_name: str) -> bool:
"""Check if task uses external API model."""
task_config = self.tasks.get(task_name)
if not task_config:
return False
selected_model = task_config.get_selected_config()
return selected_model.framework == "external_api"
If framework == "external_api", the model manager creates an ExternalAPIConfig and uses the router instead of loading a self-hosted model.
Frame Sampling Strategies
Different providers require different frame sampling approaches:
Anthropic Claude
- Max images per request: 20
- Sampling strategy: Uniform sampling across video duration
- Format: JPEG at 1024px width (maintains aspect ratio)
# Sample 16 frames from 300-frame video
total_frames = 300
max_frames = min(16, MAX_IMAGES_PER_REQUEST)
frame_indices = np.linspace(0, total_frames - 1, max_frames, dtype=int)
OpenAI GPT-4o
- Max images per request: 10 (unofficial limit, API may increase)
- Sampling strategy: Adaptive based on video length
- Format: JPEG at 2048px width (higher detail mode)
Google Gemini
- Max images per request: 16
- Sampling strategy: Keyframe detection or uniform
- Format: JPEG at 1536px width
The model service automatically adjusts frame count and sampling strategy based on the selected provider.
Retry Mechanisms
All clients use tenacity library for automatic retries:
@retry(
stop=stop_after_attempt(3), # Max 3 attempts
wait=wait_exponential(multiplier=1, min=2, max=10), # 2s, 4s, 8s backoff
retry=retry_if_exception_type((
httpx.TimeoutException,
httpx.HTTPStatusError
))
)
async def generate_text(self, prompt: str, **kwargs):
# API call here
pass
Retry Behavior
| Attempt | Wait Time | Total Elapsed |
|---|---|---|
| 1 | 0s | 0s |
| 2 | 2s | 2s |
| 3 | 4s | 6s |
| Fail | - | 6s+ |
Retries occur for:
- Timeout errors: Network or API timeout
- 429 Rate Limit: Too many requests
- 500/502/503: Server errors
Retries do NOT occur for:
- 401 Unauthorized: Invalid API key (permanent failure)
- 400 Bad Request: Invalid parameters (permanent failure)
- 404 Not Found: Model or endpoint not found (permanent failure)
Error Handling
Provider-specific errors are caught and re-raised as common exceptions:
try:
response = await self.client.post(url, headers=headers, json=payload)
response.raise_for_status()
except httpx.HTTPStatusError as e:
if e.response.status_code == 401:
raise ValueError(f"Invalid API key for {provider}")
elif e.response.status_code == 429:
raise ValueError(f"Rate limit exceeded for {provider}")
elif e.response.status_code >= 500:
raise RuntimeError(f"{provider} API server error")
else:
raise
Client code can catch these errors and provide user-friendly messages.
Integration with Summarization
The video summarization workflow integrates external APIs seamlessly:
# model-service/src/routes.py
@router.post("/api/summarize")
async def summarize_video(request: SummarizeRequest):
manager = get_model_manager()
# Check if using external API
if manager.is_external_api("video_summarization"):
api_config = manager.get_external_api_config("video_summarization")
provider = api_config.provider
# Use external API summarization
response = await summarize_video_with_external_api(
request=request,
video_path=video_path,
api_config=api_config,
provider=provider
)
else:
# Use self-hosted VLM
response = await summarize_video_with_vlm(
request=request,
video_path=video_path,
model_config=model_config
)
return response
This design allows switching between self-hosted and external models by changing selected in config:
video_summarization:
selected: "claude-sonnet-4-5" # Use external API
# OR
selected: "llama-4-maverick" # Use self-hosted
Performance Considerations
Latency
External API latency varies by provider:
| Provider | Typical Latency (8 frames) | P95 Latency |
|---|---|---|
| Anthropic Claude | 2-4 seconds | 6 seconds |
| OpenAI GPT-4o | 3-5 seconds | 8 seconds |
| Google Gemini | 1-3 seconds | 5 seconds |
Self-hosted models on GPU typically have 1-2 second latency for the same workload.
Cost
External APIs charge per token:
| Provider | Input (1K tokens) | Output (1K tokens) | Images (each) |
|---|---|---|---|
| Anthropic Claude Sonnet 4.5 | $0.003 | $0.015 | $0.0012 |
| OpenAI GPT-4o | $0.0025 | $0.01 | $0.002 |
| Google Gemini 2.5 Flash | $0.0001 | $0.0004 | $0.00004 |
A typical 8-frame video summary (500 input tokens + 100 output tokens) costs:
- Claude: ~$0.012
- GPT-4o: ~$0.018
- Gemini: ~$0.0004
Self-hosted models have no per-request cost (only infrastructure).
Throughput
External APIs limit concurrent requests:
- Anthropic: Tier-based (5-500 requests/min)
- OpenAI: Tier-based (3-10,000 requests/min)
- Google: 60 requests/min (free tier)
The model service does not implement request queuing. Clients must handle rate limits.
Design Rationale
Why Provider Abstraction?
- Flexibility: Switch providers without changing client code
- Testability: Mock external APIs in tests
- Extensibility: Add new providers by implementing
ExternalAPIClient - Consistency: Uniform error handling and retry logic
Why Async HTTP Client?
- Performance: Non-blocking I/O for concurrent requests
- Scalability: Handle multiple requests without threads
- Compatibility: FastAPI requires async handlers
Why Separate Router?
- Lifecycle Management: Centralized client caching and cleanup
- Configuration: Single place to configure all providers
- Testing: Easier to mock router than individual clients
See Also
- Audio Transcription API: External audio API integration
- External API Configuration: User guide for API keys
- Model Service Configuration: Model selection and settings
- Audio Processing Architecture: Audio API provider abstraction