Video Tracking

Video tracking follows objects across multiple frames, maintaining consistent identities through occlusions, scale changes, and motion. The service provides automated track generation for annotation workflows.

How It Works

The tracking pipeline:

Initial Detection: Detect objects in first frame (or use provided bounding boxes)
Feature Extraction: Extract visual features from detected regions
Frame-to-Frame Association: Match objects across consecutive frames
Track Management: Maintain track IDs, handle births and deaths
Interpolation: Fill gaps in tracks due to occlusions
Response Generation: Return track sequences with bounding boxes per frame

Available Models

SAMURAI

Model ID: yangchris11/samurai

Type: Motion-aware segmentation and tracking

Characteristics:

VRAM: 3 GB
Speed: Real-time (25-45 fps)
Accuracy improvement: 7.1% over SAM2.1

Best for:

Fast-moving objects
Occlusion handling
Motion-based tracking

Example:

curl -X POST http://localhost:8000/api/track \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "samurai",
    "frame_range": {"start": 0, "end": 100},
    "confidence_threshold": 0.7,
    "decimation": 5
  }'

SAM2Long

Model ID: Mark12Ding/SAM2Long

Type: Long-video tracking with error correction

Characteristics:

VRAM: 3 GB
Speed: Real-time (30-50 fps)
Accuracy improvement: 5.3% over SAM2.1

Best for:

Extended video sequences
Error accumulation prevention
Long-duration tracking

Example:

curl -X POST http://localhost:8000/api/track \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "sam2long",
    "frame_range": {"start": 0, "end": 500},
    "confidence_threshold": 0.7
  }'

SAM2.1

Model ID: facebook/sam2.1-hiera-large

Type: Segment Anything Model for video

Characteristics:

VRAM: 3 GB
Speed: Real-time (30-50 fps)
Baseline model

Best for:

Stable baseline tracking
Proven reliability
General-purpose use

Example:

curl -X POST http://localhost:8000/api/track \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "sam2-1",
    "frame_range": {"start": 0, "end": 200},
    "confidence_threshold": 0.6
  }'

YOLO11n-seg

Model ID: ultralytics/yolo11n-seg

Type: Lightweight instance segmentation

Characteristics:

VRAM: 1 GB
Speed: Very fast (40-70 fps)
Smallest model

Best for:

Speed-critical applications
Limited VRAM environments
Real-time tracking

Example:

curl -X POST http://localhost:8000/api/track \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "yolo11n-seg",
    "frame_range": {"start": 0, "end": 100},
    "confidence_threshold": 0.5
  }'

Tracking Workflow

Detection → Association → Interpolation

Step 1: Detection

Initialize tracks with object detections from first frame.

{
  "initial_detections": [
    {"bbox": {"x": 100, "y": 150, "width": 80, "height": 120}, "label": "person"}
  ]
}

Or use automatic detection:

{
  "auto_detect": true,
  "detection_model": "yolo-world-v2"
}

Step 2: Association

Match objects across frames using visual similarity and motion prediction.

Association methods:

IoU (Intersection over Union): Spatial overlap
Feature similarity: Visual appearance matching
Kalman filter: Motion prediction

Step 3: Interpolation

Fill gaps in tracks when objects are temporarily occluded.

{
  "interpolation": "linear",
  "max_gap_frames": 10
}

API Endpoint

Request

POST /api/track

Content-Type: application/json

Request Schema:

{
  "video_path": "/data/example.mp4",
  "model": "samurai",
  "frame_range": {"start": 0, "end": 100},
  "initial_detections": [
    {"bbox": {"x": 100, "y": 150, "width": 80, "height": 120}, "label": "person"}
  ],
  "confidence_threshold": 0.7,
  "decimation": 5,
  "interpolation": "linear",
  "max_gap_frames": 10
}

Parameters:

Parameter	Type	Required	Default	Description
video_path	string	Yes	-	Path to video file
model	string	No	(from config)	Tracking model to use
frame_range	object	Yes	-	Start and end frame numbers
initial_detections	array	No	null	Starting bounding boxes (auto-detect if null)
auto_detect	boolean	No	false	Automatically detect objects in first frame
detection_model	string	No	"yolo-world-v2"	Model for auto-detection
confidence_threshold	float	No	0.7	Minimum confidence for tracks
decimation	integer	No	1	Keep every Nth frame (1 = all frames)
interpolation	string	No	"linear"	Interpolation mode: linear, cubic, none
max_gap_frames	integer	No	10	Maximum frames to interpolate across

Response

Status: 200 OK

{
  "tracks": [
    {
      "track_id": 1,
      "label": "person",
      "boxes": [
        {"frame_number": 0, "bbox": {"x": 100, "y": 150, "width": 80, "height": 120}, "confidence": 0.92},
        {"frame_number": 5, "bbox": {"x": 105, "y": 155, "width": 82, "height": 118}, "confidence": 0.90},
        {"frame_number": 10, "bbox": {"x": 110, "y": 160, "width": 84, "height": 116}, "confidence": 0.89}
      ],
      "frame_count": 3,
      "start_frame": 0,
      "end_frame": 10
    },
    {
      "track_id": 2,
      "label": "ball",
      "boxes": [
        {"frame_number": 0, "bbox": {"x": 300, "y": 200, "width": 40, "height": 40}, "confidence": 0.85},
        {"frame_number": 5, "bbox": {"x": 320, "y": 210, "width": 38, "height": 38}, "confidence": 0.83}
      ],
      "frame_count": 2,
      "start_frame": 0,
      "end_frame": 5
    }
  ],
  "total_tracks": 2,
  "frames_processed": 100,
  "model_used": "samurai",
  "inference_time_ms": 3450,
  "decimated": true,
  "decimation_factor": 5
}

Response Fields:

Field	Type	Description
tracks	array	Object tracks across frames
track_id	integer	Unique identifier for track
label	string	Object class
boxes	array	Bounding boxes per frame
frame_number	integer	Frame index
bbox	object	Bounding box coordinates
confidence	float	Tracking confidence (0.0-1.0)
frame_count	integer	Number of frames in track
start_frame	integer	First frame of track
end_frame	integer	Last frame of track
total_tracks	integer	Number of tracks returned
frames_processed	integer	Total frames analyzed
model_used	string	Tracking model used
inference_time_ms	integer	Total tracking duration
decimated	boolean	Whether decimation was applied
decimation_factor	integer	Decimation interval

Error Responses

400 Bad Request:

{
  "error": "Validation Error",
  "message": "frame_range.end must be greater than frame_range.start",
  "details": {
    "field": "frame_range",
    "start": 100,
    "end": 50
  }
}

404 Not Found:

{
  "error": "Not Found",
  "message": "Video file not found: /data/missing.mp4"
}

500 Internal Server Error:

{
  "error": "Tracking Error",
  "message": "Track lost at frame 45",
  "track_id": 1,
  "frame_number": 45
}

Track ID Preservation

Track IDs remain consistent across frames, enabling object identity tracking.

Track Lifecycle

Birth: Track starts when object first appears

{
  "track_id": 1,
  "start_frame": 0,
  "label": "person"
}

Continuation: Track persists across frames

{
  "frame_number": 10,
  "track_id": 1,
  "confidence": 0.87
}

Death: Track ends when object disappears or confidence drops

{
  "track_id": 1,
  "end_frame": 95,
  "reason": "low_confidence"
}

Re-identification

When objects reappear after occlusion:

Same track ID (successful re-ID):

{
  "track_id": 1,
  "frame_number": 85,
  "confidence": 0.75,
  "gap_frames": 15
}

New track ID (failed re-ID):

{
  "track_id": 3,
  "frame_number": 85,
  "confidence": 0.70,
  "note": "Possible duplicate of track 1"
}

Handling Occlusions

Temporary Occlusions

Objects briefly hidden behind other objects.

Solution: Interpolation fills gaps up to max_gap_frames.

{
  "max_gap_frames": 10,
  "interpolation": "linear"
}

Before interpolation:

Frame 0: ✓ Track 1
Frame 5: ✗ Occluded
Frame 10: ✓ Track 1

After interpolation:

Frame 0: ✓ Track 1 (detected)
Frame 5: ✓ Track 1 (interpolated)
Frame 10: ✓ Track 1 (detected)

Permanent Occlusions

Objects remain hidden or leave the scene.

Solution: Track terminates after max_gap_frames.

{
  "max_gap_frames": 10
}

If object missing for 11 frames, track ends.

Decimation for Efficiency

Process every Nth frame to reduce computation while maintaining track quality.

Full Frame Processing

{
  "decimation": 1  // Process all frames
}

For 100 frames: processes frames 0, 1, 2, 3, ..., 100

Keyframes: 100 Inference time: 4000ms

Decimated Processing

{
  "decimation": 5  // Process every 5th frame
}

For 100 frames: processes frames 0, 5, 10, 15, ..., 100

Keyframes: 20 Inference time: 800ms (5x faster) Interpolated frames: 80

Impact on Accuracy

Decimation Factor	Keyframes (100 frames)	Speed Gain	Accuracy Loss
1 (none)	100	1x	0%
2	50	2x	<5%
5	20	5x	5-10%
10	10	10x	10-15%
20	5	20x	15-25%

Recommendations:

Slow motion: decimation 2-5
Normal speed: decimation 5-10
Fast motion: decimation 1-2
Static scenes: decimation 10-20

Tracking Model Recommendations

By Use Case

Fast-moving objects (sports, wildlife):

{
  "model": "samurai",
  "decimation": 2,
  "confidence_threshold": 0.6
}

Long videos (surveillance, dashcam):

{
  "model": "sam2long",
  "decimation": 10,
  "max_gap_frames": 30
}

Real-time requirements:

{
  "model": "yolo11n-seg",
  "decimation": 1,
  "confidence_threshold": 0.7
}

General-purpose:

{
  "model": "sam2-1",
  "decimation": 5,
  "confidence_threshold": 0.7
}

Performance Comparison

Model	FPS (GPU T4)	MOTA	Occlusion Handling	Best For
SAMURAI	25	78.3	Excellent	Fast motion
SAM2Long	30	76.8	Excellent	Long videos
SAM2.1	30	73.5	Good	Baseline
YOLO11n-seg	40	69.2	Fair	Speed

MOTA: Multiple Object Tracking Accuracy (higher is better)

Performance Benchmarks

Throughput (GPU T4)

Model	100 frames (decimation 1)	100 frames (decimation 5)	500 frames (decimation 10)
SAMURAI	4.0s (25 fps)	0.8s (125 fps)	5.0s (100 fps)
SAM2Long	3.3s (30 fps)	0.7s (143 fps)	4.2s (119 fps)
SAM2.1	3.3s (30 fps)	0.7s (143 fps)	4.2s (119 fps)
YOLO11n-seg	2.5s (40 fps)	0.5s (200 fps)	3.2s (156 fps)

Accuracy Metrics

Model	MOTA	IDF1	MT	ML
SAMURAI	78.3	82.1	65%	8%
SAM2Long	76.8	80.5	62%	10%
SAM2.1	73.5	77.2	58%	12%
YOLO11n-seg	69.2	72.8	52%	15%

Metrics:

MOTA: Multiple Object Tracking Accuracy
IDF1: ID F1 Score (identity preservation)
MT: Mostly Tracked (objects tracked >80% of frames)
ML: Mostly Lost (objects tracked <20% of frames)

Use Cases and Limitations

When to Use Video Tracking

Annotation automation: Generate bounding box sequences automatically
Keyframe reduction: Decimation produces sparse keyframes for manual refinement
Motion analysis: Track object trajectories and velocities
Multi-object scenarios: Track multiple objects simultaneously
Long videos: Process extended sequences efficiently

Limitations

Identity switches: Similar-looking objects may swap IDs
Occlusion recovery: Re-identification may fail after long occlusions
Scale changes: Large size changes reduce accuracy
Crowded scenes: Many overlapping objects cause confusion
Motion blur: Fast motion reduces feature quality

Accuracy Expectations

Scenario	Expected MOTA
Clean, isolated objects	80-90%
Partial occlusions	70-80%
Crowded scenes	60-70%
Long-term tracking (>500 frames)	65-75%
Fast erratic motion	55-70%

Troubleshooting

Track ID Switches

Symptom: Objects exchange IDs

Causes:

Similar appearance
Close proximity
Occlusions

Solutions:

Reduce decimation:

{
  "decimation": 2  // More frequent updates
}

Increase confidence threshold:

{
  "confidence_threshold": 0.8
}

Use motion-aware model:

{
  "model": "samurai"
}

Tracks Lost During Occlusion

Symptom: Tracks terminate when objects hidden

Cause: max_gap_frames too low

Solution: Increase gap tolerance:

{
  "max_gap_frames": 20  // Up from 10
}

Slow Tracking

Symptom: Processing takes minutes for short video

Causes:

No decimation
CPU mode
Heavy model

Solutions:

Enable decimation:

{
  "decimation": 5
}

Use faster model:

{
  "model": "yolo11n-seg"
}

Switch to GPU mode (see Configuration)

Too Many False Tracks

Symptom: Background regions tracked as objects

Cause: Confidence threshold too low

Solution: Raise threshold:

{
  "confidence_threshold": 0.8  // Up from 0.5
}

Example Workflows

Workflow 1: Automated Annotation

Generate tracks for manual refinement:

curl -X POST http://localhost:8000/api/track \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "samurai",
    "frame_range": {"start": 0, "end": 200},
    "auto_detect": true,
    "detection_model": "yolo-world-v2",
    "confidence_threshold": 0.7,
    "decimation": 5,
    "interpolation": "linear"
  }'

Workflow 2: Long Video Surveillance

Track objects in extended footage:

curl -X POST http://localhost:8000/api/track \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/surveillance.mp4",
    "model": "sam2long",
    "frame_range": {"start": 0, "end": 5000},
    "auto_detect": true,
    "confidence_threshold": 0.6,
    "decimation": 10,
    "max_gap_frames": 30
  }'

Workflow 3: High-Precision Tracking

Track with minimal interpolation:

curl -X POST http://localhost:8000/api/track \
  -H "Content-Type: application/json" \
  -d '{
    "video_path": "/data/game.mp4",
    "model": "samurai",
    "frame_range": {"start": 0, "end": 100},
    "initial_detections": [
      {"bbox": {"x": 100, "y": 150, "width": 80, "height": 120}, "label": "pitcher"}
    ],
    "confidence_threshold": 0.8,
    "decimation": 1,
    "interpolation": "cubic"
  }'

Next Steps

Use object detection for track initialization
Configure models for your hardware
Use video summarization for context
Enable ontology augmentation
Return to overview

How It Works​

Available Models​

SAMURAI​

SAM2Long​

SAM2.1​

YOLO11n-seg​

Tracking Workflow​

Detection → Association → Interpolation​

API Endpoint​

Request​

Response​

Error Responses​

Track ID Preservation​

Track Lifecycle​

Re-identification​

Handling Occlusions​

Temporary Occlusions​

Permanent Occlusions​

Decimation for Efficiency​

Full Frame Processing​

Decimated Processing​

Impact on Accuracy​

Tracking Model Recommendations​

By Use Case​

Performance Comparison​

Performance Benchmarks​

Throughput (GPU T4)​

Accuracy Metrics​

Use Cases and Limitations​

When to Use Video Tracking​

Limitations​

Accuracy Expectations​

Troubleshooting​

Track ID Switches​

Tracks Lost During Occlusion​

Slow Tracking​

Too Many False Tracks​

Example Workflows​

Workflow 1: Automated Annotation​

Workflow 2: Long Video Surveillance​

Workflow 3: High-Precision Tracking​

Next Steps​

How It Works

Available Models

SAMURAI

SAM2Long

SAM2.1

YOLO11n-seg

Tracking Workflow

Detection → Association → Interpolation

API Endpoint

Request

Response

Error Responses

Track ID Preservation

Track Lifecycle

Re-identification

Handling Occlusions

Temporary Occlusions

Permanent Occlusions

Decimation for Efficiency

Full Frame Processing

Decimated Processing

Impact on Accuracy

Tracking Model Recommendations

By Use Case

Performance Comparison

Performance Benchmarks

Throughput (GPU T4)

Accuracy Metrics

Use Cases and Limitations

When to Use Video Tracking

Limitations

Accuracy Expectations

Troubleshooting

Track ID Switches

Tracks Lost During Occlusion

Slow Tracking

Too Many False Tracks

Example Workflows

Workflow 1: Automated Annotation

Workflow 2: Long Video Surveillance

Workflow 3: High-Precision Tracking

Next Steps