A modular and extensible multi-object tracking system that supports various detection and tracking algorithms. This system provides a complete pipeline for video processing, object detection, tracking, and visualization.
- Features
- Installation
- Quick Start
- Components
- Project Structure
- Configuration
- Key Configuration Parameters
- Detailed Component Documentation
- Contributing
- License
- Citation
- Modular Architecture: Easily swap between different components (preprocessors, detectors, trackers, post-processors)
- Multiple Detection Models:
- YOLOv8
- RT-DETR
- Grounding DINO
- Multiple Tracking Algorithms:
- ByteTrack
- DeepSORT
- Advanced Video Processing:
- Adaptive frame sampling
- Scene change detection
- Temporal smoothing
- Batch processing support
- Visualization Tools: Built-in tools for visualizing detections and tracks
- Configuration System: Flexible configuration system with defaults and easy overrides
- Performance Optimized: Support for GPU acceleration and batch processing
# Clone the repository
git clone https://github.com/DanBenAmi/tracking_system.git
cd tracking_system
# Install dependencies
pip install -r requirements.txt
There are two main ways to run the tracking pipeline:
The easiest way to run the system is through main.py
, which uses predefined configurations:
python main.py
The pipeline behavior is controlled through configurations in configs/run_configs.py
. Here are the key configuration sections:
# Basic video processing settings
VIDEO_CONFIG = {
"video_path": "path/to/your/video.mp4",
"start_time": 180, # Start processing from 3 minutes
"end_time": 240 # Process until 4 minutes
}
# Visualization settings
VISUALIZATION_CONFIG = {
"show_detections": False, # Show detection boxes
"show_tracks": True, # Show tracking results
"tracks_vis_params": {
"display": True, # Show visualization window
"keep_size": False, # Maintain original video size
"frame_delay": 1/10, # Playback speed
"save_video": True, # Save output video
"output_video_suffix": "_visualized.mp4"
}
}
# Output settings
OUTPUT_CONFIG = {
"save_tracks": True,
"tracks_format": "pickle", # Options: pickle, json, yaml
"output_dir": "output",
"save_original_frames": True
}
For more control, you can use the tracking system API directly:
from tracking_system import create_tracking_system
from tracking_system.configs.run_configs import CUSTOM_RUN_CONFIG
# Create tracking system with custom configuration
tracking_system = create_tracking_system(CUSTOM_RUN_CONFIG)
# Process video
video_path = "path/to/your/video.mp4"
tracks = tracking_system.process_video(
video_path,
start_time=0, # Start time in seconds
end_time=None # Process until the end
)
# Visualize results
tracking_system.visualize(
tracking_system._last_frames,
tracking_system._last_tracks,
display=True,
keep_size=True,
output_path="output_video.mp4"
)
- BasicPreprocessor: Simple frame resizing and batch processing
- OfflinePreprocessor: Advanced features like scene detection and adaptive sampling
- YOLOv8: Fast and accurate object detection
- RT-DETR: Real-time detection transformer
- Grounding DINO: Vision-language detector
- ByteTrack: High-performance multi-object tracker
- DeepSORT: Classic tracking algorithm with deep association metrics
- BasicPostProcessor: Simple filtering based on track length and confidence
- AdvancedOfflinePostProcessor: Sophisticated track refinement with interpolation and smoothing
Note: For detailed explanations of each component's architecture, methodology, and parameter effects, see the Detailed Component Documentation section below.
tracking_system/
├── base.py # Core classes and interfaces
├── preprocessors/ # Video preprocessing components
├── detectors/ # Object detection models
├── trackers/ # Tracking algorithms
├── postprocessors/ # Track post-processing and refinement
├── configs/ # Configuration system
└── utils/ # Utility functions
The system uses a hierarchical configuration system:
- Default Configs: Base configurations for all components
- Run Configs: Specific configurations for different use cases
- Custom Configs: User-defined configurations that override defaults
Example configuration:
CUSTOM_RUN_CONFIG = {
"preprocessor": {
"type": "offline",
"params": {
"batch_size": 16,
"frame_sampling": "uniform",
"fps": 10.0
}
},
"detector": {
"type": "yolov8",
"params": {
"confidence_threshold": 0.5
}
},
"tracker": {
"type": "bytetrack",
},
"post_processor": {
"type": "basic",
"params": {
"min_track_length": 7
}
}
}
"detector": {
"type": "yolov8", # Options: yolov8, rtdetr, dino
"params": {
"confidence_threshold": 0.5, # Detection confidence threshold
"model_path": "path/to/weights.pt", # Model weights path
"device": None, # Auto-select GPU/CPU
}
}
"tracker": {
"type": "bytetrack", # Options: bytetrack, deepsort
"params": {
"track_thresh": 0.5, # Tracking confidence threshold
"track_buffer": 30, # Frames to keep track alive
"match_thresh": 0.8 # IOU threshold for matching
}
}
"preprocessor": {
"type": "offline", # Options: basic, offline
"params": {
"batch_size": 32,
"frame_sampling": "adaptive", # Options: uniform, adaptive, scene_based
"fps": 30.0, # Target processing FPS
"target_size": [640, 640] # Input resolution [height, width]
}
}
"post_processor": {
"type": "basic", # Options: basic, advanced_offline
"params": {
"min_track_length": 5, # Minimum track length to keep
"min_confidence": 0.3 # Minimum average confidence
}
}
Note: For a complete list of configurable parameters and their default values, check
configs/default_configs.py
. Feel free to experiment with different parameter combinations to optimize for your specific use case.
Simple frame preprocessing with minimal overhead.
Key Parameters:
target_size
: Controls input resolution- Larger sizes improve detection of small objects but increase processing time
- None keeps original video resolution
batch_size
: Number of frames processed together- Larger batches improve GPU utilization but require more memory
- Recommended: 16-32 for 4GB GPU, 32-64 for 8GB+ GPU
Advanced preprocessing with scene analysis and adaptive sampling.
Key Parameters:
frame_sampling
: Controls frame selection strategy- "uniform": Regular intervals, good for stable scenes
- "adaptive": More frames in high-motion scenes
- "scene_based": Focuses on scene changes
scene_threshold
: Sensitivity for scene change detection (0-100)- Higher values detect subtle changes
- Lower values only detect major scene changes
temporal_smooth
: Applies temporal smoothing- Reduces noise but may blur fast motion
min_scene_length
: Minimum frames between scene changes- Prevents over-segmentation of scenes
YOLOv8 is a single-stage object detector that processes the entire image in one forward pass, making it extremely fast. It uses a CSP-Darknet backbone with multiple detection heads at different scales. The architecture employs anchor-free detection with objectness prediction and integrates advanced training techniques like mosaic augmentation and adaptive image scaling. YOLOv8 is particularly good at real-time applications and maintains a good balance between speed and accuracy.
Pros:
- Excellent speed-accuracy trade-off
- Good performance on small objects
- Easy to deploy with many optimized backends
Cons:
- May struggle with densely packed objects
- Less accurate than two-stage detectors in some scenarios
Key Parameters:
confidence_threshold
: Minimum detection confidence- Higher values (e.g., 0.7): Fewer false positives but might miss objects
- Lower values (e.g., 0.3): Better recall but more false positives
iou_threshold
: NMS overlap threshold- Higher values keep more overlapping boxes
- Lower values aggressively remove overlaps
input_size
: Input resolution [height, width]- Larger sizes: Better for small objects but slower
- Smaller sizes: Faster but might miss small objects
RT-DETR (Real-Time Detection Transformer) combines the efficiency of YOLO-style architectures with the power of transformers. It uses a hybrid architecture with a CNN backbone for feature extraction and a lightweight transformer decoder for object detection. The model employs deformable attention and iterative refinement to achieve high accuracy while maintaining real-time performance. RT-DETR is designed to handle complex scenes with varying object scales and occlusions.
Pros:
- Better handling of occlusions and complex scenes
- Strong performance on varying object scales
- More accurate than traditional CNN-only detectors
Cons:
- Slightly slower than pure CNN approaches
- Higher memory requirements
Key Parameters:
max_det
: Maximum detections per frame- Higher values catch more objects but slower post-processing
- Lower values faster but might miss objects in crowded scenes
half
: Use FP16 precision- True: Faster and less memory on supported GPUs
- False: More accurate but slower
Grounding DINO is a vision-language object detector that can detect objects based on natural language descriptions. It uses a transformer-based architecture that jointly processes visual and textual inputs, allowing for zero-shot detection of new object categories. The model employs cross-attention mechanisms to ground language descriptions to visual features and can handle both open-vocabulary and closed-set detection scenarios.
Pros:
- Flexible object category definition through text
- Zero-shot detection capabilities
- Strong semantic understanding
Cons:
- Slower than pure object detectors
- May require careful prompt engineering
- Higher computational requirements
Key Parameters:
text_prompt
: Text description of objects to detectbox_threshold
: Minimum box confidencetext_threshold
: Minimum text-grounding confidence
ByteTrack is a simple yet effective tracking-by-detection approach that utilizes all detection boxes instead of just high-confidence ones. It employs a two-stage association strategy: first matching high-confidence detections with existing tracks, then using low-confidence detections to recover occluded objects. This approach significantly improves tracking performance in crowded scenes and during occlusions. ByteTrack maintains high efficiency by using simple motion models and IoU-based matching.
Pros:
- State-of-the-art tracking performance
- Robust to occlusions and crowded scenes
- Computationally efficient
Cons:
- May need careful threshold tuning
- Can be sensitive to detection quality
- Limited appearance modeling
Key Parameters:
track_thresh
: High-confidence threshold- Above this: Create new tracks
- Below this: Used for track association only
match_thresh
: IOU matching threshold- Higher values: Stricter matching, fewer ID switches
- Lower values: More lenient matching, better track continuity
track_buffer
: Frames to keep inactive tracks- Larger buffer: Better recovery from occlusions
- Smaller buffer: Less memory usage
DeepSORT extends the traditional SORT algorithm with deep learning-based appearance features. It combines Kalman filtering for motion prediction with a deep association metric learned from a large-scale person re-identification dataset. The algorithm maintains appearance feature galleries for each track and uses both motion and appearance information for data association. This makes it particularly effective at handling long-term occlusions and ID switches.
Pros:
- Robust to long-term occlusions
- Good identity preservation
- Well-suited for person tracking
Cons:
- Higher computational overhead
- Requires feature extraction
- May struggle with dense crowds
Key Parameters:
max_cosine_distance
: Feature similarity threshold- Lower values: Stricter feature matching
- Higher values: More lenient matching
nn_budget
: Maximum size of appearance descriptor gallery- Larger values: Better reidentification but more memory
max_iou_distance
: Maximum IOU distance for matching- Controls spatial association strictness
Simple filtering based on track statistics.
Key Parameters:
min_track_length
: Minimum frames for valid track- Higher values: More stable tracks but might miss short interactions
- Lower values: Catches brief appearances but more false tracks
min_confidence
: Minimum average confidence- Filters out uncertain tracks
Sophisticated track refinement with interpolation and smoothing.
Key Parameters:
max_frame_gap
: Maximum frames to interpolate- Larger gaps: Better track continuity but might create false connections
velocity_threshold
: Maximum allowed object velocity- Filters out physically impossible movements
smooth_window
: Temporal smoothing window size- Larger window: Smoother tracks but might lag behind fast motion
interpolate_gaps
: Whether to fill tracking gaps- True: More complete tracks but might create false trajectories
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this project in your research, please cite:
@misc{tracking_system,
author = {Dan Ben Ami},
title = {Multi-Object Tracking System},
year = {2024},
publisher = {GitHub},
url = {https://github.com/DanBenAmi/tracking_system}
}