Skip to content

Evaluation

hotcoco supports four evaluation types: bounding box, segmentation, keypoints, and oriented bounding box (OBB). All four follow the same workflow.

The three-step pipeline

Every COCO evaluation follows the same pattern:

from hotcoco import COCO, COCOeval

coco_gt = COCO("annotations.json")
coco_dt = coco_gt.load_res("detections.json")

ev = COCOeval(coco_gt, coco_dt, iou_type)
ev.evaluate()    # Per-image matching
ev.accumulate()  # Aggregate into precision/recall curves
ev.summarize()   # Print and compute the 12 summary metrics
use hotcoco::{COCO, COCOeval};
use hotcoco::params::IouType;
use std::path::Path;

let coco_gt = COCO::new(Path::new("annotations.json"))?;
let coco_dt = coco_gt.load_res(Path::new("detections.json"))?;

let mut ev = COCOeval::new(coco_gt, coco_dt, iou_type);
ev.evaluate();    // Per-image matching
ev.accumulate();  // Aggregate into precision/recall curves
ev.summarize();   // Print and compute the 12 summary metrics

The only thing that changes between eval types is the iou_type parameter and the format of your detections.

Bounding box evaluation

Set iou_type to "bbox" (Python) or IouType::Bbox (Rust).

Detection format — each result needs image_id, category_id, bbox as [x, y, width, height], and score:

[
  {"image_id": 42, "category_id": 1, "bbox": [10.0, 20.0, 30.0, 40.0], "score": 0.95},
  ...
]

IoU is computed as the intersection-over-union of the two bounding boxes.

Segmentation evaluation

Set iou_type to "segm" (Python) or IouType::Segm (Rust).

Detection format — each result needs image_id, category_id, segmentation as a compressed RLE dict, and score:

[
  {
    "image_id": 42,
    "category_id": 1,
    "segmentation": {"counts": "abc123...", "size": [480, 640]},
    "score": 0.95
  },
  ...
]

What is RLE? Run-Length Encoding is a compact format for binary masks. Instead of storing every pixel, it stores the lengths of alternating runs of 0s and 1s. The size field is [height, width] and counts is a UTF-8 string. To convert a binary mask (numpy array) from your model:

import numpy as np
from hotcoco import mask as mask_utils

binary_mask = ...  # your H×W uint8 array
rle = mask_utils.encode(np.asfortranarray(binary_mask))
if isinstance(rle["counts"], bytes):
    rle["counts"] = rle["counts"].decode("utf-8")

See the Mask Operations guide for more details.

IoU is computed on the binary masks after RLE decoding.

Tip

If your results only have bounding boxes, use bbox evaluation instead. load_res generates polygon segmentations from bboxes, but these are axis-aligned rectangles — not instance masks.

Keypoint evaluation

Set iou_type to "keypoints" (Python) or IouType::Keypoints (Rust).

Detection format — each result needs image_id, category_id, keypoints as a flat list of [x1, y1, v1, x2, y2, v2, ...], and score:

[
  {
    "image_id": 42,
    "category_id": 1,
    "keypoints": [x1, y1, v1, x2, y2, v2, ...],
    "score": 0.95
  },
  ...
]

Each keypoint has an (x, y) position and a visibility flag v (0 = not labeled, 1 = labeled but not visible, 2 = labeled and visible).

Similarity is measured using Object Keypoint Similarity (OKS) instead of IoU. OKS uses per-keypoint sigma values that account for annotation noise — keypoints with higher variance (like hips) are weighted less strictly than precise ones (like eyes).

Differences from bbox/segm:

  • 10 metrics instead of 12 (no small area range — keypoints are only meaningful on medium and large objects)
  • Default max detections is [20] instead of [1, 10, 100]
  • Ground truth annotations with num_keypoints == 0 are automatically ignored

Oriented bounding box (OBB) evaluation

Set iou_type to "obb" (Python) or IouType::Obb (Rust).

OBB evaluation is for rotated/oriented detection tasks — aerial imagery, document analysis, and scene text. Each annotation uses a 5-parameter representation: center coordinates, dimensions, and rotation angle.

Detection format — each result needs image_id, category_id, obb as [cx, cy, w, h, angle], and score:

[
  {"image_id": 42, "category_id": 1, "obb": [150.0, 200.0, 80.0, 40.0, 0.785], "score": 0.95},
  ...
]

The obb field is [cx, cy, w, h, angle] where:

  • cx, cy — center of the rotated rectangle
  • w, h — width and height of the rectangle
  • angle — rotation angle in radians, counter-clockwise positive

IoU is computed via polygon intersection of the two rotated rectangles (Sutherland-Hodgman clipping). This is exact — no approximation or rasterization.

Differences from bbox:

  • Uses rotated IoU instead of axis-aligned IoU
  • Same 12 metrics as bbox/segm
  • load_res automatically computes area (w × h) and an axis-aligned bbox (for area-range filtering) from the OBB
  • No pycocotools equivalent exists — hotcoco defines the evaluation protocol

Tip

hotcoco includes DOTA format conversion (to_dota() / from_dota()) for the standard aerial detection benchmark format. See format conversion.

The 12 COCO metrics

summarize() computes and prints these metrics (10 for keypoints). The evaluation protocol is defined in the COCO detection evaluation specification (Lin et al., ECCV 2014):

Index Metric IoU Area MaxDets
0 AP 0.50:0.95 all 100
1 AP 0.50 all 100
2 AP 0.75 all 100
3 AP 0.50:0.95 small 100
4 AP 0.50:0.95 medium 100
5 AP 0.50:0.95 large 100
6 AR 0.50:0.95 all 1
7 AR 0.50:0.95 all 10
8 AR 0.50:0.95 all 100
9 AR 0.50:0.95 small 100
10 AR 0.50:0.95 medium 100
11 AR 0.50:0.95 large 100

Key concepts

IoU (Intersection over Union) measures how well a predicted box (or mask) overlaps with the ground truth. IoU = 1.0 is a perfect match; IoU = 0.0 means no overlap. A detection "counts" only when its IoU with a GT exceeds the threshold. IoU=0.50 is lenient (half the area must overlap); IoU=0.95 is very strict (near-perfect alignment required).

AP (Average Precision) measures how precisely your model ranks its detections. It is the area under the precision-recall curve: a model that confidently finds all objects scores AP=1.0; a model that misses many or produces lots of false positives scores lower. The headline AP metric averages over 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05 — it rewards both finding objects (recall) and being confident only when correct (precision).

AR (Average Recall) measures what fraction of ground-truth objects your model finds, given a cap on detections per image (1, 10, or 100). AR@1 tells you how good your model's single best detection is; AR@100 tells you how much it can find when allowed 100 guesses.

Area ranges break results down by object size in the image:

Range Pixel area Intuition
small < 1,024 px² Smaller than ~32×32 pixels
medium 1,024–9,216 px² Roughly 32×32 to 96×96 pixels
large > 9,216 px² Larger than ~96×96 pixels

Many models perform differently across sizes — area-range metrics help you identify where improvement is needed.

Customizing parameters

Modify ev.params before calling evaluate():

ev = COCOeval(coco_gt, coco_dt, "bbox")

# Evaluate a subset of categories
ev.params.cat_ids = [1, 3]

# Evaluate a subset of images
ev.params.img_ids = [42, 139]

# Custom IoU thresholds
ev.params.iou_thrs = [0.5, 0.75, 0.9]

# Custom max detections
ev.params.max_dets = [1, 10, 100]

# Category-agnostic evaluation (pool all categories)
ev.params.use_cats = False

ev.evaluate()
ev.accumulate()
ev.summarize()
let mut ev = COCOeval::new(coco_gt, coco_dt, IouType::Bbox);

ev.params.cat_ids = vec![1, 3];
ev.params.img_ids = vec![42, 139];
ev.params.iou_thrs = vec![0.5, 0.75, 0.9];
ev.params.max_dets = vec![1, 10, 100];
ev.params.use_cats = false;

ev.evaluate();
ev.accumulate();
ev.summarize();

Note

Changing iou_thrs, max_dets, or area_ranges from their defaults affects what summarize() can display. The 12-metric output format is fixed — for example, AP50 looks for IoU=0.50 in your thresholds and shows -1.000 if it's not there. A warning is printed when your parameters don't match the expected defaults. Filtering by img_ids, cat_ids, or setting use_cats is safe and won't trigger warnings.

See Params for the full list of configurable parameters.

LVIS evaluation

LVIS (Gupta et al., ECCV 2019) is a large-vocabulary instance segmentation dataset with ~1,200 categories. It uses federated annotation — each image is only exhaustively labeled for a subset of categories. Running standard COCO eval on LVIS over-penalizes detectors by treating every unannotated category as a missed detection. hotcoco handles this correctly out of the box.

Drop-in replacement for lvis-api

If your pipeline uses lvis-api (Detectron2, MMDetection, or any code that does from lvis import LVISEval), call init_as_lvis() once at startup:

from hotcoco import init_as_lvis
init_as_lvis()

# Existing lvis-api code works unchanged
from lvis import LVIS, LVISEval, LVISResults

lvis_gt = LVIS("lvis_v1_val.json")
lvis_dt = LVISResults(lvis_gt, "detections.json")
ev = LVISEval(lvis_gt, lvis_dt, "bbox")
ev.run()
ev.print_results()
results = ev.get_results()

Direct usage

If you're not using lvis-api, use LVISeval or pass lvis_style=True to COCOeval:

from hotcoco import COCO, LVISeval

lvis_gt = COCO("lvis_v1_val.json")
lvis_dt = lvis_gt.load_res("detections.json")

ev = LVISeval(lvis_gt, lvis_dt, "segm")  # lvis_style=True is set automatically
ev.run()
results = ev.get_results()
# {"AP": 0.42, "APr": 0.38, "APc": 0.44, "APf": 0.45, "AR@300": ..., ...}

Or equivalently:

from hotcoco import COCO, COCOeval

ev = COCOeval(lvis_gt, lvis_dt, "segm", lvis_style=True)
ev.evaluate()
ev.accumulate()
ev.summarize()
results = ev.get_results()

The 13 LVIS metrics

Metric Description
AP mAP @ IoU[0.5:0.05:0.95]
AP50 mAP @ IoU=0.5
AP75 mAP @ IoU=0.75
APs AP for small objects (area < 32²)
APm AP for medium objects (32² ≤ area < 96²)
APl AP for large objects (area ≥ 96²)
APr AP for rare categories (1–10 training images)
APc AP for common categories (11–100 training images)
APf AP for frequent categories (100+ training images)
AR@300 Mean recall @ max 300 detections per image
ARs@300 AR for small objects
ARm@300 AR for medium objects
ARl@300 AR for large objects

The frequency split (rare / common / frequent) is determined by the frequency field on each category in the LVIS annotation file ("r", "c", "f"). These correspond to the number of training images in which the category appears, as defined in the LVIS paper.

get_results() returns all 13 metrics as a dict for programmatic access.

Open Images evaluation

Open Images uses a different evaluation protocol from COCO: a single AP@IoU=0.5, a category hierarchy for annotation expansion, and is_group_of annotations in place of iscrowd. hotcoco supports all three.

Quick start

from hotcoco import COCO, COCOeval

coco_gt = COCO("oid_annotations.json")
coco_dt = coco_gt.load_res("detections.json")

ev = COCOeval(coco_gt, coco_dt, "bbox", oid_style=True)
ev.run()

result = ev.get_results()
# {"AP": 0.573}

oid_style=True sets:

  • IoU threshold = 0.5 (single threshold, no sweep)
  • Area range = "all" only
  • Max detections = 100

Category hierarchy

Open Images categories form a hierarchy — a "Dog" detection also counts as an "Animal" detection if Animal is an ancestor of Dog. Pass a Hierarchy to expand GT annotations automatically at evaluation time.

from hotcoco import COCO, COCOeval, Hierarchy

# From the OID hierarchy JSON (bbox_labels_600_hierarchy.json)
label_to_id = {cat["name"]: cat["id"] for cat in coco_gt.dataset["categories"]}
h = Hierarchy.from_file("bbox_labels_600_hierarchy.json", label_to_id=label_to_id)

ev = COCOeval(coco_gt, coco_dt, "bbox", oid_style=True, hierarchy=h)
ev.run()

Without a hierarchy, oid_style=True still uses OID matching semantics (group-of handling, single IoU threshold) — the hierarchy only affects annotation expansion.

If you don't have a hierarchy JSON, you can derive a hierarchy from supercategory fields in your annotation file:

from hotcoco import Hierarchy

# Derives parent→child relationships from Category.supercategory
h = Hierarchy.from_categories(coco_gt.dataset["categories"])

Or build one manually from a parent map:

h = Hierarchy.from_parent_map({
    3: 1,   # cat 3's parent is cat 1
    4: 1,   # cat 4's parent is cat 1
    5: 2,   # cat 5's parent is cat 2
})

Detection expansion

By default only GT annotations are expanded up the hierarchy. To also expand detections (so a "Dog" detection also counts as an "Animal" detection):

ev = COCOeval(coco_gt, coco_dt, "bbox", oid_style=True, hierarchy=h)
ev.params.expand_dt = True
ev.run()

Group-of annotations

OID uses is_group_of: true on annotations that represent a cluster of objects rather than a single instance. These are handled differently from iscrowd:

  • Ignored for false negatives — a group-of GT that goes undetected does not count as a miss.
  • Multiple detections can match — if two detections both overlap a group-of GT at IoU ≥ 0.5, both are genuine TPs (no duplicate penalty).

Your annotations need "is_group_of": true in the JSON for this to take effect. Standard annotations without this field default to false.

The OID metric

summarize() reports a single metric:

Metric IoU Area MaxDets
AP 0.50 all 100

get_results() returns {"AP": <float>}.

See Hierarchy in the API reference for full construction and query methods.


Confusion matrix

The standard AP pipeline only ever matches detections against ground truth of the same category. That means it can't tell you which categories your model confuses. confusion_matrix() fixes this with a separate cross-category matching pass.

ev = COCOeval(coco_gt, coco_dt, "bbox")
cm = ev.confusion_matrix(iou_thr=0.5, max_det=100)

No evaluate() call is needed first — confusion_matrix() is fully standalone.

Reading the matrix

cm["matrix"] is a (K+1) × (K+1) numpy int64 array where K is the number of categories. Rows are ground truth, columns are predicted. The extra row and column at index K represent "background" — unmatched ground truth (missed detections / false negatives) and unmatched detections (false positives) respectively.

pred cat A pred cat B background
gt cat A TP (same cat) confusion FN
gt cat B confusion TP FN
background FP FP 0
cm = ev.confusion_matrix(iou_thr=0.5)

# Raw counts
matrix = cm["matrix"]          # np.ndarray int64, shape (K+1, K+1)
cat_ids = cm["cat_ids"]        # list of category IDs for rows/cols 0..K-1

# True positives per category (diagonal, excluding background)
tp_per_cat = matrix.diagonal()[:-1]

# False negatives per category (GT matched to background column)
fn_per_cat = matrix[:-1, -1]

# False positives per category (background row)
fp_per_cat = matrix[-1, :-1]

# Row-normalized version (each row sums to 1.0)
norm = cm["normalized"]

Finding class confusions

The off-diagonal cells (excluding the background row and column) tell you about cross-category confusions:

import numpy as np

matrix = cm["matrix"][:-1, :-1]   # drop background row/col
cat_ids = cm["cat_ids"]

# Zero the diagonal (TPs) to see only confusions
confusion_only = matrix.copy()
np.fill_diagonal(confusion_only, 0)

# Top confusions
flat = confusion_only.flatten()
top_idx = np.argsort(flat)[::-1][:10]
for idx in top_idx:
    if flat[idx] == 0:
        break
    gt_cat = cat_ids[idx // len(cat_ids)]
    pred_cat = cat_ids[idx % len(cat_ids)]
    print(f"GT {gt_cat} predicted as {pred_cat}: {flat[idx]} times")

Parameters

Parameter Default Description
iou_thr 0.5 IoU threshold for a DT↔GT match
max_det last params.max_dets value Max detections per image, sorted by score
min_score None (keep all) Drop detections below this confidence before max_det truncation
# Stricter threshold, limit to top 50 dets, ignore low-confidence dets
cm = ev.confusion_matrix(iou_thr=0.75, max_det=50, min_score=0.3)

TIDE error analysis

Once AP tells you how good your model is, TIDE (Bolya et al., ECCV 2020, code) tells you why it falls short. tide_errors() decomposes every false positive and false negative into one of six mutually exclusive error types and reports the ΔAP — how much AP would improve if each error type were eliminated.

evaluate() must be called before tide_errors().

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()

result = ev.tide_errors(pos_thr=0.5, bg_thr=0.1)

print(f"Baseline AP: {result['ap_base']:.3f}")
print("\nΔAP by error type (higher = fixing this type gives more AP gain):")
for name in ["Loc", "Bkg", "Miss", "Cls", "Both", "Dupe"]:
    print(f"  {name:4s}: {result['delta_ap'][name]:.4f}")

Error types

Each false-positive detection is assigned exactly one error type (highest-priority match wins):

Type Meaning Priority
Loc Right class, poor localization (bg_thr ≤ IoU < pos_thr) 1
Cls Wrong class, good location (cross-class IoU ≥ pos_thr) 2
Dupe Duplicate — correct GT already claimed by a higher-scored TP 3
Bkg Pure background (IoU < bg_thr with all GTs) 4
Both Wrong class AND poor localization (IoU ∈ [bg_thr, pos_thr)) 5

Every unmatched (non-ignored) ground-truth annotation that has no correctable FP DT targeting it is counted as Miss.

Return value

tide_errors() returns a dict:

Key Type Description
"delta_ap" dict[str, float] ΔAP for each error type. Keys: "Cls", "Loc", "Both", "Dupe", "Bkg", "Miss", "FP" (all FP types combined), "FN" (same as "Miss").
"counts" dict[str, int] Count of each error type. Keys: "Cls", "Loc", "Both", "Dupe", "Bkg", "Miss".
"ap_base" float Baseline mean AP at pos_thr.
"pos_thr" float IoU threshold for TP/FP classification (default 0.5).
"bg_thr" float Background IoU threshold (default 0.1).

Prioritizing improvements

The delta_ap values rank where to spend engineering effort:

result = ev.tide_errors()

# Sort errors by impact
deltas = [(k, v) for k, v in result["delta_ap"].items()
          if k not in ("FP", "FN")]
deltas.sort(key=lambda x: -x[1])

print("Priority order for improvement:")
for rank, (name, delta) in enumerate(deltas, 1):
    count = result["counts"].get(name, "—")
    print(f"  {rank}. {name:4s}  ΔAP={delta:.4f}  n={count}")

From the CLI

Pass --tide to coco eval to print the TIDE table after the standard metrics:

coco eval --gt instances_val2017.json --dt bbox_results.json --tide

# Custom thresholds
coco eval --gt instances_val2017.json --dt bbox_results.json \
    --tide --tide-pos-thr 0.75 --tide-bg-thr 0.2

Confidence calibration

A model that outputs confidence 0.9 should be correct about 90% of the time. calibration() measures how well your model's confidence scores align with actual detection accuracy by binning detections by confidence and comparing predicted confidence to the fraction of true positives in each bin.

Requires evaluate() to have been called first.

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()

cal = ev.calibration(n_bins=10, iou_threshold=0.5)

print(f"ECE: {cal['ece']:.4f}")   # Expected Calibration Error
print(f"MCE: {cal['mce']:.4f}")   # Maximum Calibration Error
print(f"Detections analyzed: {cal['num_detections']:,}")

Interpreting the results

ECE (Expected Calibration Error) is the weighted average of |accuracy - confidence| across all bins. Lower is better — a perfectly calibrated model has ECE = 0. An ECE of 0.05 means confidence scores are off by 5% on average.

MCE (Maximum Calibration Error) is the worst-case gap in any single bin. Useful for safety-critical applications where the worst bin matters more than the average.

Per-bin breakdown

The bins list shows the calibration gap per confidence range:

for b in cal["bins"]:
    if b["count"] > 0:
        gap = b["avg_accuracy"] - b["avg_confidence"]
        label = "overconfident" if gap < 0 else "underconfident"
        print(f"  [{b['bin_lower']:.1f}, {b['bin_upper']:.1f}): "
              f"conf={b['avg_confidence']:.3f} acc={b['avg_accuracy']:.3f} "
              f"({label}, n={b['count']})")

Per-category calibration

per_category maps each category name to its ECE. Some categories may be well-calibrated while others are wildly off:

# Top 5 worst-calibrated categories
worst = sorted(cal["per_category"].items(), key=lambda x: -x[1])[:5]
for name, ece in worst:
    print(f"  {name}: ECE={ece:.4f}")

Reliability diagram

Visualize calibration with plot.reliability_diagram():

from hotcoco import plot

with plot.style():
    fig, ax = plot.reliability_diagram(cal)
    # Or pass the COCOeval directly:
    fig, ax = plot.reliability_diagram(ev, n_bins=15, iou_threshold=0.5)

From the CLI

Pass --calibration to coco eval:

coco eval --gt annotations.json --dt detections.json --calibration

# Custom bins and IoU threshold
coco eval --gt annotations.json --dt detections.json \
    --calibration --cal-bins 15 --cal-iou-thr 0.75

See calibration in the API reference for full parameter details.


F-scores

f_scores() computes F-beta scores from the precision/recall curves built by accumulate(). It finds the confidence threshold that maximizes F-beta for each (IoU, category) combination, then averages — the same summarization strategy as mAP.

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.run()

scores = ev.f_scores()
# {"F1": 0.523, "F150": 0.712, "F175": 0.581}

Use beta to shift the precision/recall trade-off:

ev.f_scores(beta=0.5)  # precision-weighted  → {"F0.5": ..., "F0.550": ..., "F0.575": ...}
ev.f_scores(beta=2.0)  # recall-weighted     → {"F2.0": ..., "F2.050": ..., "F2.075": ...}

F-scores complement get_results() when you care about a specific operating point rather than area-under-curve. A high AP with a low F1 often signals that performance is concentrated at high recall or high precision, not both simultaneously.

See f_scores in the API reference for full parameter details.

Sliced evaluation

slice_by() re-computes all summary metrics for named subsets of images — without re-running IoU computation. This is useful for comparing model performance across data splits (e.g., indoor vs outdoor, day vs night, small images vs large images).

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()

# Define slices as {name: [img_ids]}
slices = {
    "indoor": [42, 139, 203, ...],
    "outdoor": [78, 412, 901, ...],
}

results = ev.slice_by(slices)

Each slice in results contains the full set of summary metrics plus a delta vs the overall baseline:

for name, sr in results.items():
    if name == "_overall":
        continue
    print(f"{name}: AP={sr['AP']:.3f}{sr['delta']['AP']:+.3f})")

You can also pass a callable instead of a dict — it receives each image dict and returns a slice name (or None to skip):

results = ev.slice_by(lambda img: "large" if img["width"] > 1000 else "small")

From the CLI

Pass --slices <path> to coco eval with a JSON file mapping slice names to image ID lists:

coco eval --gt annotations.json --dt detections.json --slices slices.json

Model comparison

Compare two models on the same dataset with hotcoco.compare(). It computes per-metric deltas, per-category AP differences, and optional bootstrap confidence intervals.

import hotcoco

gt = hotcoco.COCO("annotations.json")
dt_a = gt.load_res("model_a.json")
dt_b = gt.load_res("model_b.json")

ev_a = hotcoco.COCOeval(gt, dt_a, "bbox")
ev_a.evaluate()
ev_b = hotcoco.COCOeval(gt, dt_b, "bbox")
ev_b.evaluate()

result = hotcoco.compare(ev_a, ev_b)
# result["deltas"]["AP"]  → 0.033 (B is better by 3.3 AP points)

Bootstrap confidence intervals

Add n_bootstrap to get confidence intervals on the metric deltas. This resamples images with replacement and re-accumulates metrics for each sample — parallelized with rayon.

result = hotcoco.compare(ev_a, ev_b, n_bootstrap=1000, confidence=0.95)

ci = result["ci"]["AP"]
print(f"AP delta: {result['deltas']['AP']:+.3f}")
print(f"95% CI:   [{ci['lower']:+.3f}, {ci['upper']:+.3f}]")
print(f"P(B > A): {ci['prob_positive']:.1%}")

A CI that excludes zero indicates a statistically significant difference.

Per-category breakdown

result["per_category"] is sorted by delta ascending (worst regressions first):

for cat in result["per_category"][:5]:  # top 5 regressions
    print(f"{cat['cat_name']:<20} {cat['delta']:+.3f}")

From the CLI

coco compare --gt ann.json --dt-a baseline.json --dt-b improved.json
coco compare --gt ann.json --dt-a a.json --dt-b b.json --bootstrap 1000
coco compare --gt ann.json --dt-a a.json --dt-b b.json --json  # CI/CD

Plotting

from hotcoco import plot

result = hotcoco.compare(ev_a, ev_b, n_bootstrap=1000)
plot.comparison_bar(result, save_path="comparison.png")
plot.category_deltas(result, top_k=10, save_path="deltas.png")

See compare and comparison_bar in the API reference.


Per-image diagnostics & label error detection

image_diagnostics() gives you per-image F1 and AP scores, per-annotation TP/FP/FN classification, and automatically flags suspected label errors in your ground truth. It's the single call that answers "which images should I look at?" and "are my annotations trustworthy?"

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()

diag = ev.image_diagnostics(iou_thr=0.5, score_thr=0.5)

Per-image scores

Every image gets an F1 score, AP at the selected IoU threshold, TP/FP/FN counts, and an error profile (perfect, fp_heavy, fn_heavy, or mixed):

# Find the worst images
worst = sorted(diag["img_summary"].items(), key=lambda x: x[1]["f1"])
for img_id, s in worst[:5]:
    print(f"Image {img_id}: F1={s['f1']:.3f}  TP={s['tp']} FP={s['fp']} FN={s['fn']}")

Label error detection

Two types of GT errors are flagged:

  • Wrong label — a high-confidence FP detection that overlaps an unmatched GT of a different category (bbox IoU ≥ 0.5). The model thinks it's a dog, the annotation says cat, and they're in the same spot.
  • Missing annotation — a high-confidence FP with no nearby GT at all (max bbox IoU < 0.1). Likely a real object the annotators missed.
for le in diag["label_errors"][:5]:
    if le["type"] == "wrong_label":
        print(f"Image {le['image_id']}: {le['dt_category']}{le['gt_category']} IoU={le['iou']:.2f}")
    else:
        print(f"Image {le['image_id']}: {le['dt_category']} (score={le['dt_score']:.2f}) — no GT match")

Only detections with score >= score_thr are considered, so lower the threshold to cast a wider net or raise it to focus on high-confidence candidates.

Annotation index

The result also includes the per-annotation TP/FP/FN index that powers the browse viewer's eval coloring:

diag["dt_status"]  # {ann_id: "tp" | "fp"}
diag["gt_status"]  # {ann_id: "matched" | "fn"}
diag["dt_match"]   # {dt_id: gt_id}  (TP pairs)
diag["gt_match"]   # {gt_id: dt_id}  (reverse)

From the CLI

coco eval --gt annotations.json --dt detections.json --diagnostics

# Custom thresholds
coco eval --gt ann.json --dt det.json --diagnostics --diag-iou-thr 0.75 --diag-score-thr 0.3

See image_diagnostics in the API reference for full parameter details.


Saving results to JSON

results() and save_results() serialize the full evaluation output — parameters, summary metrics, and optionally per-category AP — to a JSON dict or file. Both require summarize() (or run()) first.

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.run()

# Get results as a dict
r = ev.results()
print(r["metrics"]["AP"])      # 0.378
print(r["params"]["iou_type"]) # "bbox"

# Save to a file
ev.save_results("results.json")

# Include per-category AP
ev.save_results("results.json", per_class=True)

The JSON structure:

{
  "params": {
    "iou_type": "bbox",
    "iou_thresholds": [0.5, 0.55, ...],
    "area_ranges": {"all": [0, 10000000000.0], "small": [0, 1024.0], ...},
    "max_dets": [1, 10, 100],
    "is_lvis": false
  },
  "metrics": {
    "AP": 0.378, "AP50": 0.584, "AP75": 0.412, ...
  },
  "per_class": {
    "person": 0.58, "car": 0.41, ...
  }
}

From the coco-eval CLI, pass --output <path> to write the same JSON automatically (always includes per-category AP):

coco-eval --gt instances_val2017.json --dt bbox_results.json --output results.json

Logging metrics

get_results() accepts an optional prefix and per_class flag, returning a flat dict[str, float] that plugs directly into any experiment tracker.

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.run()

metrics = ev.get_results(prefix="val/bbox", per_class=True)
# {"val/bbox/AP": 0.578, ..., "val/bbox/AP/person": 0.82, "val/bbox/AP/car": 0.71, ...}

Weights & Biases

import wandb
wandb.log(ev.get_results(prefix="val/bbox", per_class=True), step=epoch)

MLflow

import mlflow
mlflow.log_metrics(ev.get_results(prefix="val/bbox"), step=epoch)

TensorBoard

from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
for k, v in ev.get_results(prefix="val/bbox").items():
    writer.add_scalar(k, v, global_step=epoch)

See get_results in the API reference for full parameter details.