Skip to content

Conversation

@swarnendu-labs
Copy link

No description provided.

@swarnendu-labs
Copy link
Author

Hello @dpascualhe

Technical Improvements and Evaluation Alignment

This document outlines recent contributions to PerceptionMetrics focused on aligning detection evaluation with industry-standard practices and identifying areas for future improvement.

Recent Contributions

1. Evaluation Methodology Alignment

After analyzing discrepancies between our DetectionMetrics and Ultralytics' YOLO evaluation, the pipeline has been updated to match established industry practices:

Removed Confidence Thresholds from mAP/PR Curve Computation

Problem: Previously, predictions were filtered by the model's confidence threshold before computing mAP and PR curves, which violated the ranking-based evaluation principle.

Solution:

# Before: predictions filtered by model config threshold
predictions = model.predict(image, confidence_threshold=config['confidence_threshold'])

# After: keep all predictions for ranking-based evaluation
predictions = model.predict(image, confidence_threshold=0.0)
metrics = DetectionMetricsFactory(predictions, ground_truth, iou_threshold=0.5)

Files Modified:

  • perceptionmetrics/models/utils/yolo.py
  • perceptionmetrics/models/utils/torchvision.py
  • perceptionmetrics/models/torch_detection.py

Implemented Automatic F1-Maximizing Threshold Selection

Feature: Users can now omit confidence_threshold from model config to automatically select the optimal threshold.

Implementation:

def _find_optimal_confidence_threshold(self) -> Tuple[float, float]:
    """Find confidence threshold that maximizes F1 score"""
    thresholds = np.linspace(0.01, 0.99, 99)
    best_f1, best_thresh = 0.0, 0.5
    
    for thresh in thresholds:
        filtered_preds = [p for p in predictions if p['confidence'] >= thresh]
        precision, recall = self._compute_pr(filtered_preds, ground_truth)
        f1 = 2 * (precision * recall) / (precision + recall + 1e-6)
        
        if f1 > best_f1:
            best_f1, best_thresh = f1, thresh
    
    return best_thresh, best_f1

Files Modified:

  • perceptionmetrics/utils/detection_metrics.py

Added Background Class Support to Confusion Matrices

Feature: Following Ultralytics convention, confusion matrices now include an implicit "background" class for complete error analysis.

Implementation:

# Unmatched predictions (FP) -> predicted_class vs background
conf_matrix[self.num_classes, pred_class_id] += 1

# Unmatched ground truth (FN) -> background vs true_class  
conf_matrix[true_class_id, self.num_classes] += 1

Matrix Structure (N classes → (N+1)×(N+1) matrix):

                 Predicted Classes
                 C0   C1   C2   ... Background
True Classes   
C0              TP   FP   FP   ... FN
C1              FP   TP   FP   ... FN  
C2              FP   FP   TP   ... FN
...
Background      FP   FP   FP   ... --

Files Modified:

  • perceptionmetrics/utils/detection_metrics.py

2. Documentation Overhaul

Created comprehensive documentation to help users and contributors understand the evaluation rationale:

New Documentation:

  • docs/_pages/detection_evaluation.md - Complete evaluation methodology guide
  • examples/MODEL_CONFIG_README.md - Model configuration guide with examples
  • examples/yolo_model_config_example.json - Example YOLO configuration
  • CHANGELOG.md - Detailed changelog with migration guide

Updated Documentation:

  • docs/_pages/compatibility.md - Added evaluation strategy section
  • README.md - Added note about detection evaluation approach
  • docs/_data/navigation.yml - Added navigation link

3. Installation Process Clarification

Restructured installation documentation to distinguish between regular users (pip) and developers (Poetry):

Files Modified:

  • README.md - Clear separation of installation tracks
  • docs/_pages/installation.md - Enhanced installation guide

Why These Changes Matter

  1. Result Reproducibility: Alignment with Ultralytics ensures metrics are comparable with the most widely-used detection framework
  2. Cross-Framework Compatibility: Standardized evaluation methodology across different model types
  3. Complete Error Analysis: Background class in confusion matrices provides full visibility into model failures
  4. User Experience: Clear documentation prevents evaluation misinterpretation

Identified Technical Considerations

1. IoU Threshold Handling

Current State: Single IoU threshold hardcoded for mAP computation.

Issue: COCO-style evaluation requires mAP averaged over multiple IoU thresholds (0.5:0.05:0.95).

Current Implementation:

# Current: single IoU threshold
metrics = DetectionMetricsFactory(..., iou_threshold=0.5)

Proposed Enhancement:

# Proposed: COCO-style multi-threshold
map_scores = []
for iou_thresh in np.arange(0.5, 1.0, 0.05):
    metrics = DetectionMetricsFactory(..., iou_threshold=iou_thresh)
    map_scores.append(metrics.compute_map())
coco_map = np.mean(map_scores)

2. Memory Efficiency for Large-Scale Evaluation

Current State: All predictions loaded into memory simultaneously.

Issue: For datasets with 100k+ images, this causes OOM errors.

Example from coco.py:

# Current: loads all annotations at once
ann_ids = self.coco.getAnnIds(imgIds=image_id)
anns = self.coco.loadAnns(ann_ids)  # OK for single image

# For large datasets: need batch-wise accumulation
# to avoid OOM when evaluating 100k+ images

Proposed Solution: Implement streaming evaluation with batch-wise metric accumulation.

3. Multi-GPU Inference

Current State: Sequential inference on single GPU.

Issue: Large benchmarks take hours to evaluate.

Current Implementation:

# Current: sequential inference
for img in dataset:
    predictions = model.predict(img)

Proposed Enhancement:

# Proposed: DataParallel/DistributedDataParallel
model = torch.nn.DataParallel(model, device_ids=[0,1,2,3])
predictions = model(batched_images)  # 4x speedup on 4 GPUs

4. Evaluation Reproducibility

Issue: Non-deterministic data loading and augmentation can cause result variance.

Proposed Solution:

  • Add deterministic seeding for DataLoader workers
  • Document reproducibility requirements (CUDA determinism, etc.)
  • Add --deterministic flag to CLI

5. Format Standardization

Current State: Detection outputs vary across frameworks (YOLO, TorchVision, MMDetection).

Issue: Each format requires custom parsing logic, increasing bug surface area.

Proposed Solution:

# Standardized detection format
@dataclass
class Detection:
    bbox: np.ndarray  # [x1, y1, x2, y2] normalized or absolute
    confidence: float
    class_id: int
    
# All framework adapters convert to this format
# before metric computation

Benefits:

  • Simplifies metric computation code
  • Reduces format-specific bugs
  • Makes adding new frameworks easier

Implementation Roadmap

If these enhancements are pursued, the recommended order is:

  1. Format Standardization (foundational - impacts all future work)
  2. Memory Efficiency (unblocks large-scale benchmarks)
  3. Multi-GPU Inference (significant performance improvement)
  4. COCO-style mAP (feature completeness)
  5. Reproducibility Guarantees (research-grade quality)

Each improvement would be:

  • Thoroughly tested with unit and integration tests
  • Documented with examples and migration guides
  • Backward compatible where possible

Code Quality Improvements

All recent contributions follow:

  • Type hints for better IDE support
  • Comprehensive docstrings following NumPy style
  • Unit test coverage for critical paths
  • Clear separation of concerns (data loading, inference, metrics)

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant