Skip to content

Benchmarks

Feature comparison

Feature pycocotools faster-coco-eval hotcoco
Installation Prebuilt wheels available Prebuilt wheels available Prebuilt wheels — pip install just works
Metric parity Reference Exact ≤1e-4 bbox, ≤2e-4 segm, exact keypoints
LVIS evaluation No Yes — via lvis_style=True flag Yes — 13 metrics, LVISeval class, init_as_lvis()
TIDE error analysis No No Yes — 6 error types, ΔAP per type
Confusion matrix No No Yes — cross-category, configurable threshold
F-scores No No Yes — F-beta at any β
Per-class AP Manual only Yes — via extended_metrics Built-in via get_results(per_class=True)
Dataset operations No No Yes — filter, merge, split, sample, stats
Format conversion No No Yes — COCO ↔ YOLO, VOC, CVAT, DOTA
PyTorch integration Via torchvision Yes — TorchVision compatible Yes — CocoDetection, CocoEvaluator
Rust API No No Yes — native crate on crates.io
CLI No No Yes — coco (Python) + coco-eval (Rust)
Results export No No Yes — JSON with params + metrics + per-class
Memory at scale 24 GB committed on O365 30 GB committed on O365 8 GB committed on O365
Python versions 3.9+ 3.7+ 3.9+
License BSD Apache 2.0 MIT

Speed benchmarks

Hardware: Apple M1 MacBook Air, 16 GB RAM Dataset: COCO val2017 — 5,000 images Detections: 36,781 synthetic (seed=42; AP scores are not meaningful) Timing: Wall clock time, single run Versions: pycocotools 2.0.11, faster-coco-eval 1.7.2, hotcoco 0.3.0

Results (1x detections)

Eval Type pycocotools faster-coco-eval hotcoco
bbox 9.46s 2.45s (3.9×) 0.41s (23.0×)
segm 9.16s 4.36s (2.1×) 0.49s (18.6×)
keypoints 2.62s 1.78s (1.5×) 0.21s (12.7×)

Speedups in parentheses are vs pycocotools.

Results (10x detections)

Scaling detections by 10x (~368,000) to test behavior under higher load:

Eval Type pycocotools faster-coco-eval hotcoco
bbox 34.53s 5.72s (6.0×) 1.91s (18.0×)
segm 39.91s 11.91s (3.4×) 3.42s (11.7×)
keypoints 16.93s 16.28s (1.0×) 1.76s (9.6×)

hotcoco scales better at higher detection counts due to multi-threaded evaluation.

Objects365 scale benchmark

Hardware: Windows 11, AMD Ryzen 5 5600X, 16 GB RAM + swap Dataset: Objects365 val — 80,000 images, 1.2M annotations, 365 categories Detections: ~1.2M synthetic bbox (capped at 100/image, seed=42) Timing: Wall clock time, single run Versions: pycocotools 2.0.11, faster-coco-eval 1.7.2, hotcoco 0.3.0

Library Time Peak RAM Committed Speedup
pycocotools 721.18s 14.34 GB 23.71 GB baseline
faster-coco-eval 250.90s 14.57 GB 29.96 GB 2.9x
hotcoco 18.32s 7.47 GB 8.11 GB 39.4x

Peak RAM is the peak working set (physical memory). Committed includes swap — both pycocotools and faster-coco-eval exceeded physical RAM and relied heavily on the pagefile, which significantly inflated their wall clock times. hotcoco completed within physical memory with minimal swap.

Metric parity

Dataset: COCO val2017 — 5,000 images, synthetic detections (included in repository)

All 34 metrics match pycocotools within tolerance (bbox ≤1e-4, segm ≤2e-4, keypoints exact):

Bounding box

Metric pycocotools hotcoco Diff
AP 0.578 0.578 0.000
AP50 0.861 0.861 0.000
AP75 0.600 0.600 0.000
APs 0.327 0.327 0.000
APm 0.707 0.707 0.000
APl 0.918 0.918 0.000
AR1 0.427 0.427 0.000
AR10 0.687 0.687 0.000
AR100 0.701 0.701 0.000
ARs 0.437 0.437 0.000
ARm 0.806 0.806 0.000
ARl 0.960 0.960 0.000

7 of 12 metrics are exact; the remaining 5 differ by less than 1e-4.

Segmentation

Metric pycocotools hotcoco Diff
AP 0.658 0.658 0.000
AP50 0.923 0.923 0.000
AP75 0.701 0.701 0.000
APs 0.461 0.461 0.000
APm 0.772 0.772 0.000
APl 0.934 0.934 0.000
AR1 0.455 0.455 0.000
AR10 0.746 0.746 0.000
AR100 0.762 0.762 0.000
ARs 0.546 0.546 0.000
ARm 0.859 0.859 0.000
ARl 0.981 0.981 0.000

All metrics accurate to within 2e-4 (shown rounded to 3 decimal places).

Keypoints

Metric pycocotools hotcoco Diff
AP 0.413 0.413 0.000
AP50 0.606 0.606 0.000
AP75 0.429 0.429 0.000
APm 0.403 0.403 0.000
APl 0.883 0.883 0.000
AR1 0.766 0.766 0.000
AR10 0.975 0.975 0.000
AR100 0.806 0.806 0.000
ARm 0.622 0.622 0.000
ARl 0.963 0.963 0.000

Keypoint metrics are exact.

Methodology

  • Wall clock time includes file I/O, evaluation, and accumulation. Excludes Python import time.
  • Detections are synthetic — generated from GT annotations with a fixed seed (seed=42), so AP scores are meaningless but detection count and format are representative of real model output. Fixed seed means results are identical across runs.
  • Only detections are scaled for the 10x benchmark — ground truth annotations are unchanged.
  • Benchmark scripts are in scripts/ at the repo root.

Reproducing the benchmarks

You'll need the COCO val2017 annotation files and a working hotcoco build — see the installation page for setup. Then:

just bench                                        # speed benchmark (1x)
uv run python scripts/bench.py --scale 10        # 10x stress test
just parity                                       # metric parity vs pycocotools
uv run python scripts/bench_objects365.py        # O365 scale (requires O365 annotations)