Benchmarks¶

Feature comparison¶

Feature	pycocotools	faster-coco-eval	hotcoco
Installation	Prebuilt wheels available	Prebuilt wheels available	Prebuilt wheels — `pip install` just works
Metric parity	Reference	Exact	≤1e-4 bbox, ≤2e-4 segm, exact keypoints
LVIS evaluation	No	Yes — via `lvis_style=True` flag	Yes — 13 metrics, `LVISeval` class, `init_as_lvis()`
TIDE error analysis	No	No	Yes — 6 error types, ΔAP per type
Confusion matrix	No	No	Yes — cross-category, configurable threshold
F-scores	No	No	Yes — F-beta at any β
Per-class AP	Manual only	Yes — via `extended_metrics`	Built-in via `get_results(per_class=True)`
Dataset operations	No	No	Yes — filter, merge, split, sample, stats
Format conversion	No	No	Yes — COCO ↔ YOLO, VOC, CVAT, DOTA
PyTorch integration	Via torchvision	Yes — TorchVision compatible	Yes — `CocoDetection`, `CocoEvaluator`
Rust API	No	No	Yes — native crate on crates.io
CLI	No	No	Yes — `coco` (Python) + `coco-eval` (Rust)
Results export	No	No	Yes — JSON with params + metrics + per-class
Memory at scale	24 GB committed on O365	30 GB committed on O365	8 GB committed on O365
Python versions	3.9+	3.7+	3.9+
License	BSD	Apache 2.0	MIT

Speed benchmarks¶

Hardware: Apple M1 MacBook Air, 16 GB RAM Dataset: COCO val2017 — 5,000 images Detections: 36,781 synthetic (seed=42; AP scores are not meaningful) Timing: Wall clock time, single run Versions: pycocotools 2.0.11, faster-coco-eval 1.7.2, hotcoco 0.3.0

Results (1x detections)¶

Eval Type	pycocotools	faster-coco-eval	hotcoco
bbox	9.46s	2.45s (3.9×)	0.41s (23.0×)
segm	9.16s	4.36s (2.1×)	0.49s (18.6×)
keypoints	2.62s	1.78s (1.5×)	0.21s (12.7×)

Speedups in parentheses are vs pycocotools.

Results (10x detections)¶

Scaling detections by 10x (~368,000) to test behavior under higher load:

Eval Type	pycocotools	faster-coco-eval	hotcoco
bbox	34.53s	5.72s (6.0×)	1.91s (18.0×)
segm	39.91s	11.91s (3.4×)	3.42s (11.7×)
keypoints	16.93s	16.28s (1.0×)	1.76s (9.6×)

hotcoco scales better at higher detection counts due to multi-threaded evaluation.

Objects365 scale benchmark¶

Hardware: Windows 11, AMD Ryzen 5 5600X, 16 GB RAM + swap Dataset: Objects365 val — 80,000 images, 1.2M annotations, 365 categories Detections: ~1.2M synthetic bbox (capped at 100/image, seed=42) Timing: Wall clock time, single run Versions: pycocotools 2.0.11, faster-coco-eval 1.7.2, hotcoco 0.3.0

Library	Time	Peak RAM	Committed	Speedup
pycocotools	721.18s	14.34 GB	23.71 GB	baseline
faster-coco-eval	250.90s	14.57 GB	29.96 GB	2.9x
hotcoco	18.32s	7.47 GB	8.11 GB	39.4x

Peak RAM is the peak working set (physical memory). Committed includes swap — both pycocotools and faster-coco-eval exceeded physical RAM and relied heavily on the pagefile, which significantly inflated their wall clock times. hotcoco completed within physical memory with minimal swap.

Metric parity¶

Dataset: COCO val2017 — 5,000 images, synthetic detections (included in repository)

All 34 metrics match pycocotools within tolerance (bbox ≤1e-4, segm ≤2e-4, keypoints exact):

Bounding box¶

Metric	pycocotools	hotcoco
AP	0.578	0.578
AP50	0.861	0.861
AP75	0.600	0.600
APs	0.327	0.327
APm	0.707	0.707
APl	0.918	0.918
AR1	0.427	0.427
AR10	0.687	0.687
AR100	0.701	0.701
ARs	0.437	0.437
ARm	0.806	0.806
ARl	0.960	0.960

7 of 12 metrics are exact; the remaining 5 differ by less than 1e-4.

Segmentation¶

Metric	pycocotools	hotcoco
AP	0.658	0.658
AP50	0.923	0.923
AP75	0.701	0.701
APs	0.461	0.461
APm	0.772	0.772
APl	0.934	0.934
AR1	0.455	0.455
AR10	0.746	0.746
AR100	0.762	0.762
ARs	0.546	0.546
ARm	0.859	0.859
ARl	0.981	0.981

All metrics accurate to within 2e-4 (shown rounded to 3 decimal places).

Keypoints¶

Metric	pycocotools	hotcoco
AP	0.413	0.413
AP50	0.606	0.606
AP75	0.429	0.429
APm	0.403	0.403
APl	0.883	0.883
AR1	0.766	0.766
AR10	0.975	0.975
AR100	0.806	0.806
ARm	0.622	0.622
ARl	0.963	0.963

Keypoint metrics are exact.

Methodology¶

Wall clock time includes file I/O, evaluation, and accumulation. Excludes Python import time.
Detections are synthetic — generated from GT annotations with a fixed seed (seed=42), so AP scores are meaningless but detection count and format are representative of real model output. Fixed seed means results are identical across runs.
Only detections are scaled for the 10x benchmark — ground truth annotations are unchanged.
Benchmark scripts are in scripts/ at the repo root.

Reproducing the benchmarks¶

You'll need the COCO val2017 annotation files and a working hotcoco build — see the installation page for setup. Then:

just bench                                        # speed benchmark (1x)
uv run python scripts/bench.py --scale 10        # 10x stress test
just parity                                       # metric parity vs pycocotools
uv run python scripts/bench_objects365.py        # O365 scale (requires O365 annotations)