Object Detection

Object detection identifies and localizes objects in video frames using bounding boxes. The service supports class-based detection (COCO classes), zero-shot detection with text prompts, and automatic object discovery.

How It Works

The object detection pipeline:

Frame Extraction: Load specified frames from video
Preprocessing: Resize and normalize frames for model input
Model Inference: Run detection model on frames
Post-processing: Filter detections by confidence, apply NMS (Non-Maximum Suppression)
Response Generation: Return bounding boxes with labels and confidence scores

Available Models

YOLO-World-v2

Model ID: ultralytics/yolo-world-v2-l

Type: Open-world object detector

Characteristics:

VRAM: 2 GB
FPS: 52 (GPU T4)
Supports: Image prompts, COCO classes

Best for:

Real-time detection
Balanced speed and accuracy
General-purpose object detection

Example:

curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "yolo-world-v2",
    "frame_numbers": [0, 30, 60],
    "classes": ["person", "ball", "bat"],
    "confidence_threshold": 0.5
  }'

GroundingDINO 1.5

Model ID: IDEA-Research/grounding-dino-1.5-pro

Type: Zero-shot detector with text prompts

Characteristics:

VRAM: 4 GB
FPS: 20 (GPU T4)
AP: 52.5
Supports: Text prompts, fine-grained detection

Best for:

Custom object categories
Text-based queries
High accuracy requirements

Example:

curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "grounding-dino-1-5",
    "frame_numbers": [0, 30, 60],
    "text_prompt": "baseball pitcher in motion",
    "confidence_threshold": 0.6
  }'

OWLv2

Model ID: google/owlv2-large-patch14-ensemble

Type: Open-vocabulary detector

Characteristics:

VRAM: 6 GB
FPS: 15 (GPU T4)
Supports: Rare classes, long-tail distribution

Best for:

Uncommon objects
Rare class detection
Fine-grained categories

Example:

curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "owlv2",
    "frame_numbers": [0, 30, 60],
    "text_prompt": "catcher's mitt",
    "confidence_threshold": 0.4
  }'

Florence-2

Model ID: microsoft/Florence-2-large

Type: Unified vision model

Characteristics:

VRAM: 2 GB
FPS: 30 (GPU T4)
Supports: Detection, captioning, segmentation

Best for:

Multi-task scenarios
Combined detection and captioning
Efficient inference

Example:

curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "florence-2",
    "frame_numbers": [0, 30, 60],
    "classes": ["person", "sports equipment"],
    "confidence_threshold": 0.5
  }'

Detection Modes

Class-Based Detection (COCO)

Detect objects from 80 COCO dataset classes.

COCO Classes: person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, fire hydrant, stop sign, parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe, backpack, umbrella, handbag, tie, suitcase, frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, bottle, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake, chair, couch, potted plant, bed, dining table, toilet, tv, laptop, mouse, remote, keyboard, cell phone, microwave, oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair drier, toothbrush

Example:

{
  "model": "yolo-world-v2",
  "classes": ["person", "sports ball", "baseball bat"]
}

Text Prompt Detection (Zero-Shot)

Detect objects using natural language descriptions.

Supported models: GroundingDINO, OWLv2

Example:

{
  "model": "grounding-dino-1-5",
  "text_prompt": "pitcher throwing baseball from mound"
}

Best practices:

Use specific descriptions ("red bicycle" vs "bicycle")
Include context ("person holding tennis racket" vs "person")
Avoid ambiguous terms
Test prompt variations for best results

Automatic Detection

Detect all objects without specifying classes.

Example:

{
  "model": "yolo-world-v2",
  "auto_detect": true,
  "confidence_threshold": 0.6
}

Returns all detected objects above confidence threshold.

API Endpoint

Request

POST /api/detect

Content-Type: application/json

Request Schema:

{
  "video_path": "/data/example.mp4",
  "model": "yolo-world-v2",
  "frame_numbers": [0, 30, 60, 90],
  "classes": ["person", "ball"],
  "confidence_threshold": 0.5,
  "batch_size": 4,
  "nms_threshold": 0.45
}

Parameters:

Parameter	Type	Required	Default	Description
video_path	string	Yes	-	Path to video file
model	string	No	(from config)	Detection model to use
frame_numbers	array	Yes	-	Frame indices to process
classes	array	No	null	COCO class names to detect
text_prompt	string	No	null	Text description for zero-shot
confidence_threshold	float	No	0.5	Minimum confidence (0.0-1.0)
batch_size	integer	No	4	Frames per batch
nms_threshold	float	No	0.45	NMS IoU threshold
auto_detect	boolean	No	false	Detect all objects

Response

Status: 200 OK

{
  "detections": [
    {
      "frame_number": 0,
      "objects": [
        {
          "bbox": {"x": 245, "y": 180, "width": 120, "height": 200},
          "label": "person",
          "confidence": 0.92,
          "class_id": 0
        },
        {
          "bbox": {"x": 520, "y": 340, "width": 45, "height": 45},
          "label": "sports ball",
          "confidence": 0.87,
          "class_id": 32
        }
      ],
      "detection_count": 2
    },
    {
      "frame_number": 30,
      "objects": [
        {
          "bbox": {"x": 250, "y": 185, "width": 115, "height": 195},
          "label": "person",
          "confidence": 0.94,
          "class_id": 0
        }
      ],
      "detection_count": 1
    }
  ],
  "total_frames": 2,
  "total_detections": 3,
  "model_used": "yolo-world-v2",
  "inference_time_ms": 145
}

Response Fields:

Field	Type	Description
detections	array	Detections per frame
frame_number	integer	Frame index
objects	array	Detected objects in frame
bbox	object	Bounding box (x, y, width, height in pixels)
label	string	Object class name
confidence	float	Detection confidence (0.0-1.0)
class_id	integer	COCO class ID
total_frames	integer	Number of frames processed
total_detections	integer	Total objects detected
model_used	string	Model that performed inference
inference_time_ms	integer	Total inference duration

Error Responses

400 Bad Request:

{
  "error": "Validation Error",
  "message": "frame_numbers cannot be empty",
  "details": {
    "field": "frame_numbers",
    "constraint": "min_length:1"
  }
}

404 Not Found:

{
  "error": "Not Found",
  "message": "Video file not found: /data/missing.mp4"
}

500 Internal Server Error:

{
  "error": "Detection Error",
  "message": "Model inference failed on frame 60",
  "model": "yolo-world-v2",
  "frame_number": 60
}

Bounding Box Format

Bounding boxes use pixel coordinates with origin at top-left corner.

Format:

{
  "x": 100,        // Left edge (pixels from left)
  "y": 150,        // Top edge (pixels from top)
  "width": 200,    // Box width (pixels)
  "height": 250    // Box height (pixels)
}

Coordinate system:

(0,0)────────────────> x
 │
 │   (x,y)────────┐
 │    │           │
 │    │  Object   │ height
 │    │           │
 │    └───────────┘
 │        width
 v
 y

Converting to other formats:

To center-based (x_center, y_center, width, height):

x_center = x + width / 2
y_center = y + height / 2

To corner-based (x1, y1, x2, y2):

x1 = x
y1 = y
x2 = x + width
y2 = y + height

To normalized (0.0-1.0):

x_norm = x / frame_width
y_norm = y / frame_height
width_norm = width / frame_width
height_norm = height / frame_height

Confidence Thresholds and Filtering

Confidence Threshold

Minimum confidence score for detections. Lower values increase recall, higher values increase precision.

Threshold	Effect	Use Case
0.3	High recall, low precision	Exploratory analysis, find all candidates
0.5	Balanced (default)	General-purpose detection
0.7	High precision, low recall	Production, minimize false positives
0.9	Very high precision	Critical applications, manual review

Example:

{
  "confidence_threshold": 0.7  // Only detections above 70% confidence
}

NMS (Non-Maximum Suppression)

Removes duplicate detections of the same object. NMS threshold controls overlap tolerance.

NMS Threshold	Effect
0.3	Aggressive merging, fewer boxes
0.45	Balanced (default)
0.6	Lenient merging, more boxes

Example:

{
  "nms_threshold": 0.45
}

Class-Specific Thresholds

Different thresholds per class (requires model support):

{
  "class_thresholds": {
    "person": 0.6,
    "ball": 0.4,
    "bat": 0.5
  }
}

Batch Processing

Process multiple frames efficiently using batching.

Single Frame

{
  "frame_numbers": [45],
  "batch_size": 1
}

Inference time: ~50ms (YOLO-World-v2 on GPU T4)

Batch Processing

{
  "frame_numbers": [0, 10, 20, 30, 40, 50, 60, 70],
  "batch_size": 4
}

Processes in 2 batches of 4. Inference time: ~80ms (vs 400ms sequential)

Batch size recommendations:

Model	CPU	GPU T4	GPU A100
YOLO-World-v2	1	4-8	16-32
GroundingDINO	1	2-4	8-16
OWLv2	1	2-4	8-16
Florence-2	1	4-8	16-32

Performance Benchmarks

Frames Per Second (GPU T4)

Model	Single Frame	Batch (4)	Batch (8)
YOLO-World-v2	52 fps	60 fps	65 fps
GroundingDINO	20 fps	24 fps	26 fps
OWLv2	15 fps	18 fps	20 fps
Florence-2	30 fps	36 fps	40 fps

Accuracy (COCO Dataset)

Model	mAP	AP50	AP75
YOLO-World-v2	45.2	62.8	48.5
GroundingDINO	52.5	69.7	56.9
OWLv2	48.3	65.1	52.2
Florence-2	43.8	61.5	46.8

Latency Breakdown (Single Frame, GPU T4)

Using YOLO-World-v2:

Step	Duration	Percentage
Frame loading	8 ms	16%
Preprocessing	4 ms	8%
Model inference	32 ms	64%
Post-processing (NMS)	4 ms	8%
Response formatting	2 ms	4%
Total	50 ms	100%

Use Cases and Limitations

When to Use Object Detection

Bootstrap annotations: Generate initial bounding boxes for manual refinement
Object tracking initialization: Provide starting boxes for tracking
Quality control: Verify annotation coverage
Dataset analysis: Understand object distribution
Automated annotation: Process large video datasets

Limitations

Small objects: Objects smaller than 32x32 pixels may be missed
Occlusions: Partially hidden objects have lower confidence
Class limitations: COCO-based models limited to 80 classes
Domain shift: Models trained on natural images struggle with specialized domains
Motion blur: Fast-moving objects may have lower detection rates

Accuracy Expectations

Scenario	Expected mAP
Clear, well-lit objects	70-85%
Partial occlusions	50-65%
Small objects (< 32px)	30-45%
Rare classes	40-60%
Domain-specific (medical, aerial)	35-55%

Troubleshooting

No Detections

Symptom: Empty detections array for frames with objects

Causes:

Confidence threshold too high
Wrong class names
Objects too small

Solutions:

Lower confidence threshold:

{
  "confidence_threshold": 0.3
}

Verify class names (COCO classes only):

{
  "classes": ["person", "sports ball"]  // Not "human", "ball"
}

Use auto-detect to find all objects:

{
  "auto_detect": true
}

Duplicate Detections

Symptom: Multiple boxes for same object

Cause: NMS threshold too high

Solution: Lower NMS threshold:

{
  "nms_threshold": 0.3  // More aggressive merging
}

Slow Inference

Symptom: Detection takes multiple seconds

Causes:

CPU mode
Large batch size
High-resolution frames

Solutions:

Use GPU mode (see Configuration)
Reduce batch size:

{
  "batch_size": 1
}

Use faster model:

{
  "model": "yolo-world-v2"  // Fastest option
}

CUDA Out of Memory

Symptom: Error during batch processing

Cause: Batch size too large for GPU

Solution: Reduce batch size:

{
  "batch_size": 2  // Down from 8
}

Example Workflows

Workflow 1: Bootstrap Annotation

Detect objects for manual annotation:

curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "yolo-world-v2",
    "frame_numbers": [0, 30, 60, 90, 120],
    "classes": ["person", "sports ball", "baseball bat"],
    "confidence_threshold": 0.5,
    "batch_size": 4
  }'

Workflow 2: Zero-Shot Detection

Detect custom objects with text prompt:

curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "grounding-dino-1-5",
    "frame_numbers": [0, 30, 60],
    "text_prompt": "baseball pitcher in windup position",
    "confidence_threshold": 0.6
  }'

Workflow 3: High-Recall Detection

Find all candidate objects:

curl -X POST http://localhost:8000/api/detect \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "yolo-world-v2",
    "frame_numbers": [0, 30, 60],
    "auto_detect": true,
    "confidence_threshold": 0.3,
    "nms_threshold": 0.4
  }'

How It Works​

Available Models​

YOLO-World-v2​

GroundingDINO 1.5​

OWLv2​

Florence-2​

Detection Modes​

Class-Based Detection (COCO)​

Text Prompt Detection (Zero-Shot)​

Automatic Detection​

API Endpoint​

Request​

Response​

Error Responses​

Bounding Box Format​

Confidence Thresholds and Filtering​

Confidence Threshold​

NMS (Non-Maximum Suppression)​

Class-Specific Thresholds​

Batch Processing​

Single Frame​

Batch Processing​

Performance Benchmarks​

Frames Per Second (GPU T4)​

Accuracy (COCO Dataset)​

Latency Breakdown (Single Frame, GPU T4)​

Use Cases and Limitations​

When to Use Object Detection​

Limitations​

Accuracy Expectations​

Troubleshooting​

No Detections​

Duplicate Detections​

Slow Inference​

CUDA Out of Memory​

Example Workflows​

Workflow 1: Bootstrap Annotation​

Workflow 2: Zero-Shot Detection​

Workflow 3: High-Recall Detection​

Next Steps​

How It Works

Available Models

YOLO-World-v2

GroundingDINO 1.5

OWLv2

Florence-2

Detection Modes

Class-Based Detection (COCO)

Text Prompt Detection (Zero-Shot)

Automatic Detection

API Endpoint

Request

Response

Error Responses

Bounding Box Format

Confidence Thresholds and Filtering

Confidence Threshold

NMS (Non-Maximum Suppression)

Class-Specific Thresholds

Batch Processing

Single Frame

Batch Processing

Performance Benchmarks

Frames Per Second (GPU T4)

Accuracy (COCO Dataset)

Latency Breakdown (Single Frame, GPU T4)

Use Cases and Limitations

When to Use Object Detection

Limitations

Accuracy Expectations

Troubleshooting

No Detections

Duplicate Detections

Slow Inference

CUDA Out of Memory

Example Workflows

Workflow 1: Bootstrap Annotation

Workflow 2: Zero-Shot Detection

Workflow 3: High-Recall Detection

Next Steps