Grafana Dashboards
Grafana provides visualization and analysis of FOVEA metrics. Dashboards are auto-provisioned from the grafana-dashboards/
directory.
Accessing Grafana
- Open http://localhost:3002
- Login with username
admin
and passwordadmin
- Change password on first login (production deployments)
- Navigate to Dashboards → Browse
Dashboard Overview
System Overview Dashboard
Displays overall system health and status.
Panels:
- Service health status
- Request rates across all services
- Error rates
- Resource usage summary
Use for:
- Quick health check
- Identifying service issues
- Monitoring overall load
API Performance Dashboard
Tracks backend API performance metrics.
Panels:
- Request duration (p50, p95, p99)
- Requests per second by endpoint
- Error rate by endpoint
- Slow query detection
Key Metrics:
- API latency trends
- Endpoint-specific performance
- Error patterns
Use for:
- Identifying slow endpoints
- Monitoring API health
- Detecting performance regressions
Model Service Dashboard
Monitors AI inference performance.
Panels:
- Inference duration by model
- GPU utilization (if applicable)
- Queue depth for inference jobs
- Model loading times
Key Metrics:
- Model performance
- Resource usage
- Queue backlog
Use for:
- Monitoring inference performance
- Detecting model issues
- GPU utilization tracking
Database Dashboard
Tracks database performance and health.
Panels:
- Connection pool usage
- Query duration
- Active connections
- Table sizes (if configured)
Key Metrics:
- Database latency
- Connection pool health
- Query performance
Use for:
- Identifying slow queries
- Monitoring connection usage
- Detecting database issues
Key Metrics to Watch
Critical Thresholds
API Latency:
- p95 > 1s: Investigate slow endpoints
- p99 > 5s: Critical performance issue
Error Rate:
-
1%: Check logs for errors
- Sudden spikes: Possible service failure
GPU Memory (GPU mode):
-
90%: Risk of CUDA out of memory
- Consider reducing batch sizes
Database Connections:
-
80% of pool: Increase pool size or investigate leaks
- Connection timeouts: Database overload
Interpreting Metrics
Request Rate
Normal Pattern:
- Steady during work hours
- Low during off hours
- Gradual changes
Anomalies:
- Sudden spikes: Possible bot or attack
- Sudden drops: Service unavailable or upstream issue
- Oscillating: Request retry storms
Latency
Normal Pattern:
- Consistent p50 and p95
- P99 higher but stable
- Slight variations acceptable
Anomalies:
- Increasing trend: Resource exhaustion
- Sudden spikes: Slow query or external dependency
- High variance: Inconsistent performance
Error Rate
Normal Pattern:
- Near zero for healthy service
- Occasional 4xx errors acceptable (client errors)
Anomalies:
- Increasing 5xx errors: Service issues
- Sudden 5xx spike: Service failure or deployment issue
- High 4xx rate: API misuse or client bugs
Creating Custom Dashboards
Add New Dashboard
- Navigate to Dashboards → New → New Dashboard
- Click "Add visualization"
- Select Prometheus data source
- Enter PromQL query
- Configure visualization type
- Save dashboard
Example PromQL Queries
API request rate:
rate(fovea_api_requests_total[5m])
API latency p95:
histogram_quantile(0.95, rate(fovea_api_request_duration_milliseconds_bucket[5m]))
Error rate percentage:
100 * (
rate(fovea_api_requests_total{status=~"5.."}[5m])
/
rate(fovea_api_requests_total[5m])
)
Job processing rate:
rate(fovea_queue_job_submitted{status="completed"}[5m])
Job failure rate:
rate(fovea_queue_job_submitted{status="failed"}[5m])
Average job duration:
rate(fovea_queue_job_duration_sum[5m]) / rate(fovea_queue_job_duration_count[5m])
Setting Up Alerts
Create Alert Rule
- Open dashboard panel
- Click panel title → Edit
- Go to Alert tab
- Click "Create alert rule from this panel"
- Configure conditions and thresholds
- Set notification channels
Example Alert: High Error Rate
Condition:
rate(fovea_api_requests_total{status=~"5.."}[5m]) > 0.05
For: 5 minutes
Annotations:
- Summary: "High error rate detected"
- Description: "Error rate exceeds 5% for 5 minutes"
Example Alert: High Latency
Condition:
histogram_quantile(0.95, rate(fovea_api_request_duration_milliseconds_bucket[5m])) > 1000
For: 10 minutes
Annotations:
- Summary: "High API latency"
- Description: "P95 latency exceeds 1 second"
Example Alert: GPU Memory
Condition (GPU mode):
fovea_model_gpu_memory_bytes / fovea_model_gpu_memory_total_bytes * 100 > 90
For: 5 minutes
Annotations:
- Summary: "High GPU memory usage"
- Description: "GPU memory usage exceeds 90%"
Notification Channels
Configure Email Notifications
- Navigate to Alerting → Notification channels
- Click "Add channel"
- Select "Email"
- Enter recipient email addresses
- Test notification
- Save
Configure Slack Notifications
- Create Slack incoming webhook
- Add notification channel in Grafana
- Select "Slack"
- Enter webhook URL
- Configure channel and mention settings
- Test and save
Dashboard Management
Export Dashboard
- Open dashboard
- Click settings icon (gear)
- Select "JSON Model"
- Copy JSON
- Save to file or share
Import Dashboard
- Navigate to Dashboards → Import
- Paste JSON or upload file
- Select data source
- Click Import
Auto-Provisioning
Dashboards in grafana-dashboards/
directory are auto-provisioned on startup.
To add permanent dashboard:
- Create JSON file in
grafana-dashboards/
- Restart Grafana:
docker compose restart grafana
- Dashboard appears automatically
Best Practices
Dashboard Organization
- Use folders to organize dashboards by service
- Name dashboards descriptively
- Add descriptions to panels
- Use consistent color schemes
Query Optimization
- Use appropriate time ranges
- Limit number of series
- Use recording rules for complex queries
- Cache expensive queries
Alert Configuration
- Set appropriate thresholds
- Use "For" duration to avoid flapping
- Group related alerts
- Test alerts before deploying
Regular Maintenance
- Review dashboard relevance
- Update queries as system changes
- Clean up unused dashboards
- Update alert thresholds based on baselines
Troubleshooting
Dashboard Not Loading
Check:
- Prometheus is running:
docker compose ps prometheus
- Data source configured: Alerting → Data sources
- Time range includes data
No Data in Panels
Check:
- Services are exporting metrics
- OTEL Collector is running
- Prometheus is scraping successfully
- Query syntax is correct
Alerts Not Firing
Check:
- Alert rule is enabled
- Condition threshold is correct
- Evaluation interval is appropriate
- Notification channel is configured
Next Steps
- Metrics Reference: Complete metrics catalog
- Overview: Monitoring stack overview
- Common Tasks: Daily operations