Object detection evaluation improvements (inspired by YOLO evaluation) #322 #361

swarnendu-labs · 2026-01-31T11:11:48Z

No description provided.

swarnendu-labs · 2026-01-31T11:15:23Z

Technical Improvements and Evaluation Alignment

This document outlines recent contributions to PerceptionMetrics focused on aligning detection evaluation with industry-standard practices and identifying areas for future improvement.

Recent Contributions

1. Evaluation Methodology Alignment

After analyzing discrepancies between our DetectionMetrics and Ultralytics' YOLO evaluation, the pipeline has been updated to match established industry practices:

Removed Confidence Thresholds from mAP/PR Curve Computation

Problem: Previously, predictions were filtered by the model's confidence threshold before computing mAP and PR curves, which violated the ranking-based evaluation principle.

Solution:

# Before: predictions filtered by model config threshold
predictions = model.predict(image, confidence_threshold=config['confidence_threshold'])

# After: keep all predictions for ranking-based evaluation
predictions = model.predict(image, confidence_threshold=0.0)
metrics = DetectionMetricsFactory(predictions, ground_truth, iou_threshold=0.5)

Files Modified:

perceptionmetrics/models/utils/yolo.py
perceptionmetrics/models/utils/torchvision.py
perceptionmetrics/models/torch_detection.py

Implemented Automatic F1-Maximizing Threshold Selection

Feature: Users can now omit confidence_threshold from model config to automatically select the optimal threshold.

Implementation:

def _find_optimal_confidence_threshold(self) -> Tuple[float, float]:
    """Find confidence threshold that maximizes F1 score"""
    thresholds = np.linspace(0.01, 0.99, 99)
    best_f1, best_thresh = 0.0, 0.5
    
    for thresh in thresholds:
        filtered_preds = [p for p in predictions if p['confidence'] >= thresh]
        precision, recall = self._compute_pr(filtered_preds, ground_truth)
        f1 = 2 * (precision * recall) / (precision + recall + 1e-6)
        
        if f1 > best_f1:
            best_f1, best_thresh = f1, thresh
    
    return best_thresh, best_f1

Files Modified:

perceptionmetrics/utils/detection_metrics.py

Added Background Class Support to Confusion Matrices

Feature: Following Ultralytics convention, confusion matrices now include an implicit "background" class for complete error analysis.

Implementation:

# Unmatched predictions (FP) -> predicted_class vs background
conf_matrix[self.num_classes, pred_class_id] += 1

# Unmatched ground truth (FN) -> background vs true_class  
conf_matrix[true_class_id, self.num_classes] += 1

Matrix Structure (N classes → (N+1)×(N+1) matrix):

                 Predicted Classes
                 C0   C1   C2   ... Background
True Classes   
C0              TP   FP   FP   ... FN
C1              FP   TP   FP   ... FN  
C2              FP   FP   TP   ... FN
...
Background      FP   FP   FP   ... --

Files Modified:

perceptionmetrics/utils/detection_metrics.py

2. Documentation Overhaul

Created comprehensive documentation to help users and contributors understand the evaluation rationale:

New Documentation:

docs/_pages/detection_evaluation.md - Complete evaluation methodology guide
examples/MODEL_CONFIG_README.md - Model configuration guide with examples
examples/yolo_model_config_example.json - Example YOLO configuration
CHANGELOG.md - Detailed changelog with migration guide

Updated Documentation:

docs/_pages/compatibility.md - Added evaluation strategy section
README.md - Added note about detection evaluation approach
docs/_data/navigation.yml - Added navigation link

3. Installation Process Clarification

Restructured installation documentation to distinguish between regular users (pip) and developers (Poetry):

Files Modified:

README.md - Clear separation of installation tracks
docs/_pages/installation.md - Enhanced installation guide

Why These Changes Matter

Result Reproducibility: Alignment with Ultralytics ensures metrics are comparable with the most widely-used detection framework
Cross-Framework Compatibility: Standardized evaluation methodology across different model types
Complete Error Analysis: Background class in confusion matrices provides full visibility into model failures
User Experience: Clear documentation prevents evaluation misinterpretation

Identified Technical Considerations

1. IoU Threshold Handling

Current State: Single IoU threshold hardcoded for mAP computation.

Issue: COCO-style evaluation requires mAP averaged over multiple IoU thresholds (0.5:0.05:0.95).

Current Implementation:

# Current: single IoU threshold
metrics = DetectionMetricsFactory(..., iou_threshold=0.5)

Proposed Enhancement:

# Proposed: COCO-style multi-threshold
map_scores = []
for iou_thresh in np.arange(0.5, 1.0, 0.05):
    metrics = DetectionMetricsFactory(..., iou_threshold=iou_thresh)
    map_scores.append(metrics.compute_map())
coco_map = np.mean(map_scores)

2. Memory Efficiency for Large-Scale Evaluation

Current State: All predictions loaded into memory simultaneously.

Issue: For datasets with 100k+ images, this causes OOM errors.

Example from coco.py:

# Current: loads all annotations at once
ann_ids = self.coco.getAnnIds(imgIds=image_id)
anns = self.coco.loadAnns(ann_ids)  # OK for single image

# For large datasets: need batch-wise accumulation
# to avoid OOM when evaluating 100k+ images

Proposed Solution: Implement streaming evaluation with batch-wise metric accumulation.

3. Multi-GPU Inference

Current State: Sequential inference on single GPU.

Issue: Large benchmarks take hours to evaluate.

Current Implementation:

# Current: sequential inference
for img in dataset:
    predictions = model.predict(img)

Proposed Enhancement:

# Proposed: DataParallel/DistributedDataParallel
model = torch.nn.DataParallel(model, device_ids=[0,1,2,3])
predictions = model(batched_images)  # 4x speedup on 4 GPUs

4. Evaluation Reproducibility

Issue: Non-deterministic data loading and augmentation can cause result variance.

Proposed Solution:

Add deterministic seeding for DataLoader workers
Document reproducibility requirements (CUDA determinism, etc.)
Add --deterministic flag to CLI

5. Format Standardization

Current State: Detection outputs vary across frameworks (YOLO, TorchVision, MMDetection).

Issue: Each format requires custom parsing logic, increasing bug surface area.

Proposed Solution:

# Standardized detection format
@dataclass
class Detection:
    bbox: np.ndarray  # [x1, y1, x2, y2] normalized or absolute
    confidence: float
    class_id: int
    
# All framework adapters convert to this format
# before metric computation

Benefits:

Simplifies metric computation code
Reduces format-specific bugs
Makes adding new frameworks easier

Implementation Roadmap

If these enhancements are pursued, the recommended order is:

Format Standardization (foundational - impacts all future work)
Memory Efficiency (unblocks large-scale benchmarks)
Multi-GPU Inference (significant performance improvement)
COCO-style mAP (feature completeness)
Reproducibility Guarantees (research-grade quality)

Each improvement would be:

Thoroughly tested with unit and integration tests
Documented with examples and migration guides
Backward compatible where possible

Code Quality Improvements

All recent contributions follow:

Type hints for better IDE support
Comprehensive docstrings following NumPy style
Unit test coverage for critical paths
Clear separation of concerns (data loading, inference, metrics)

References

swarnendu-labs added 2 commits January 31, 2026 15:55

Update installation docs with PIP package info

93be82b

Object detection evaluation improvements (inspired by YOLO evaluation) …

04bc9ee

…JdeRobot#322

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object detection evaluation improvements (inspired by YOLO evaluation) #322 #361

Object detection evaluation improvements (inspired by YOLO evaluation) #322 #361

Uh oh!

swarnendu-labs commented Jan 31, 2026

Uh oh!

swarnendu-labs commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Object detection evaluation improvements (inspired by YOLO evaluation) #322 #361

Are you sure you want to change the base?

Object detection evaluation improvements (inspired by YOLO evaluation) #322 #361

Uh oh!

Conversation

swarnendu-labs commented Jan 31, 2026

Uh oh!

swarnendu-labs commented Jan 31, 2026

Technical Improvements and Evaluation Alignment

Recent Contributions

1. Evaluation Methodology Alignment

Removed Confidence Thresholds from mAP/PR Curve Computation

Implemented Automatic F1-Maximizing Threshold Selection

Added Background Class Support to Confusion Matrices

2. Documentation Overhaul

3. Installation Process Clarification

Why These Changes Matter

Identified Technical Considerations

1. IoU Threshold Handling

2. Memory Efficiency for Large-Scale Evaluation

3. Multi-GPU Inference

4. Evaluation Reproducibility

5. Format Standardization

Implementation Roadmap

Code Quality Improvements

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant