-
Notifications
You must be signed in to change notification settings - Fork 53
Object detection evaluation improvements (inspired by YOLO evaluation) #322 #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Object detection evaluation improvements (inspired by YOLO evaluation) #322 #361
Conversation
|
Hello @dpascualhe Technical Improvements and Evaluation AlignmentThis document outlines recent contributions to PerceptionMetrics focused on aligning detection evaluation with industry-standard practices and identifying areas for future improvement. Recent Contributions1. Evaluation Methodology AlignmentAfter analyzing discrepancies between our DetectionMetrics and Ultralytics' YOLO evaluation, the pipeline has been updated to match established industry practices: Removed Confidence Thresholds from mAP/PR Curve ComputationProblem: Previously, predictions were filtered by the model's confidence threshold before computing mAP and PR curves, which violated the ranking-based evaluation principle. Solution: # Before: predictions filtered by model config threshold
predictions = model.predict(image, confidence_threshold=config['confidence_threshold'])
# After: keep all predictions for ranking-based evaluation
predictions = model.predict(image, confidence_threshold=0.0)
metrics = DetectionMetricsFactory(predictions, ground_truth, iou_threshold=0.5)Files Modified:
Implemented Automatic F1-Maximizing Threshold SelectionFeature: Users can now omit Implementation: def _find_optimal_confidence_threshold(self) -> Tuple[float, float]:
"""Find confidence threshold that maximizes F1 score"""
thresholds = np.linspace(0.01, 0.99, 99)
best_f1, best_thresh = 0.0, 0.5
for thresh in thresholds:
filtered_preds = [p for p in predictions if p['confidence'] >= thresh]
precision, recall = self._compute_pr(filtered_preds, ground_truth)
f1 = 2 * (precision * recall) / (precision + recall + 1e-6)
if f1 > best_f1:
best_f1, best_thresh = f1, thresh
return best_thresh, best_f1Files Modified:
Added Background Class Support to Confusion MatricesFeature: Following Ultralytics convention, confusion matrices now include an implicit "background" class for complete error analysis. Implementation: # Unmatched predictions (FP) -> predicted_class vs background
conf_matrix[self.num_classes, pred_class_id] += 1
# Unmatched ground truth (FN) -> background vs true_class
conf_matrix[true_class_id, self.num_classes] += 1Matrix Structure (N classes → (N+1)×(N+1) matrix): Files Modified:
2. Documentation OverhaulCreated comprehensive documentation to help users and contributors understand the evaluation rationale: New Documentation:
Updated Documentation:
3. Installation Process ClarificationRestructured installation documentation to distinguish between regular users (pip) and developers (Poetry): Files Modified:
Why These Changes Matter
Identified Technical Considerations1. IoU Threshold HandlingCurrent State: Single IoU threshold hardcoded for mAP computation. Issue: COCO-style evaluation requires mAP averaged over multiple IoU thresholds (0.5:0.05:0.95). Current Implementation: # Current: single IoU threshold
metrics = DetectionMetricsFactory(..., iou_threshold=0.5)Proposed Enhancement: # Proposed: COCO-style multi-threshold
map_scores = []
for iou_thresh in np.arange(0.5, 1.0, 0.05):
metrics = DetectionMetricsFactory(..., iou_threshold=iou_thresh)
map_scores.append(metrics.compute_map())
coco_map = np.mean(map_scores)2. Memory Efficiency for Large-Scale EvaluationCurrent State: All predictions loaded into memory simultaneously. Issue: For datasets with 100k+ images, this causes OOM errors. Example from # Current: loads all annotations at once
ann_ids = self.coco.getAnnIds(imgIds=image_id)
anns = self.coco.loadAnns(ann_ids) # OK for single image
# For large datasets: need batch-wise accumulation
# to avoid OOM when evaluating 100k+ imagesProposed Solution: Implement streaming evaluation with batch-wise metric accumulation. 3. Multi-GPU InferenceCurrent State: Sequential inference on single GPU. Issue: Large benchmarks take hours to evaluate. Current Implementation: # Current: sequential inference
for img in dataset:
predictions = model.predict(img)Proposed Enhancement: # Proposed: DataParallel/DistributedDataParallel
model = torch.nn.DataParallel(model, device_ids=[0,1,2,3])
predictions = model(batched_images) # 4x speedup on 4 GPUs4. Evaluation ReproducibilityIssue: Non-deterministic data loading and augmentation can cause result variance. Proposed Solution:
5. Format StandardizationCurrent State: Detection outputs vary across frameworks (YOLO, TorchVision, MMDetection). Issue: Each format requires custom parsing logic, increasing bug surface area. Proposed Solution: # Standardized detection format
@dataclass
class Detection:
bbox: np.ndarray # [x1, y1, x2, y2] normalized or absolute
confidence: float
class_id: int
# All framework adapters convert to this format
# before metric computationBenefits:
Implementation RoadmapIf these enhancements are pursued, the recommended order is:
Each improvement would be:
Code Quality ImprovementsAll recent contributions follow:
References |
No description provided.