Skip to content

COCOeval

Run COCO evaluation to compute AP/AR metrics.

from hotcoco import COCO, COCOeval

coco_gt = COCO("instances_val2017.json")
coco_dt = coco_gt.load_res("detections.json")

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()
ev.accumulate()
ev.summarize()
use hotcoco::{COCO, COCOeval};
use hotcoco::params::IouType;
use std::path::Path;

let coco_gt = COCO::new(Path::new("instances_val2017.json"))?;
let coco_dt = coco_gt.load_res(Path::new("detections.json"))?;

let mut ev = COCOeval::new(coco_gt, coco_dt, IouType::Bbox);
ev.evaluate();
ev.accumulate();
ev.summarize();

Constructor

COCOeval(coco_gt: COCO, coco_dt: COCO, iou_type: str)
Parameter Type Description
coco_gt COCO Ground truth COCO object
coco_dt COCO Detections COCO object (from load_res)
iou_type str "bbox", "segm", or "keypoints"
COCOeval::new(coco_gt: COCO, coco_dt: COCO, iou_type: IouType) -> Self
Parameter Type Description
coco_gt COCO Ground truth COCO object
coco_dt COCO Detections COCO object (from load_res)
iou_type IouType IouType::Bbox, IouType::Segm, or IouType::Keypoints

Properties

params

params: Params

Evaluation parameters. Modify before calling evaluate().

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.params.cat_ids = [1, 2, 3]
ev.params.max_dets = [1, 10, 100]
pub params: Params
let mut ev = COCOeval::new(coco_gt, coco_dt, IouType::Bbox);
ev.params.cat_ids = vec![1, 2, 3];
ev.params.max_dets = vec![1, 10, 100];

See Params for all configurable fields.


stats

stats: list[float] | None

The 12 summary metrics (10 for keypoints), populated after summarize(). None before summarize() is called.

ev.summarize()
print(f"AP: {ev.stats[0]:.3f}")
print(f"AP50: {ev.stats[1]:.3f}")
pub stats: Option<Vec<f64>>
ev.summarize();
if let Some(stats) = &ev.stats {
    println!("AP: {:.3}", stats[0]);
    println!("AP50: {:.3}", stats[1]);
}

eval_imgs

Per-image evaluation results, populated after evaluate(). See Working with Results for details.

eval_imgs: list[dict | None]
pub eval_imgs: Vec<Option<EvalImg>>

eval

Accumulated precision/recall arrays, populated after accumulate(). See Working with Results for details.

eval: dict | None

Contains "precision", "recall", and "scores" arrays.

pub eval: Option<AccumulatedEval>

Access elements with precision_idx(t, r, k, a, m) and recall_idx(t, k, a, m).


Methods

evaluate

evaluate() -> None

Run per-image evaluation. Matches detections to ground truth annotations using greedy matching sorted by confidence. Must be called before accumulate().

Populates eval_imgs.


accumulate

accumulate() -> None

Accumulate per-image results into precision/recall curves using interpolated precision at 101 recall thresholds.

Populates eval.


summarize

summarize() -> None

Compute and print the standard COCO metrics. Populates stats.

Non-default parameters

summarize() uses a fixed display format that assumes default iou_thrs, max_dets, and area_rng_lbl. If you've changed any of these, a warning is printed to stderr and some metrics may show -1.000 (e.g. AP50 when iou_thrs doesn't include 0.50). The stats array always has 12 entries (10 for keypoints) regardless of your parameters.

Prints 12 lines for bbox/segm (10 for keypoints):

 Average Precision (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.382
 Average Precision (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.584
 ...

run

run() -> None

Run the full pipeline in one call: evaluate()accumulate()summarize(). Primarily used with LVIS pipelines (Detectron2, MMDetection) that expect a single run() call.


get_results

get_results() -> dict[str, float]

Return the summary metrics as a dict. Must be called after summarize() (or run()). Returns an empty dict if summarize() has not been called.

Standard bbox/segm keys: AP, AP50, AP75, APs, APm, APl, AR1, AR10, AR100, ARs, ARm, ARl.

Keypoint keys: AP, AP50, AP75, APm, APl, AR, AR50, AR75, ARm, ARl.

LVIS keys: AP, AP50, AP75, APs, APm, APl, APr, APc, APf, AR@300, ARs@300, ARm@300, ARl@300.

ev.run()
results = ev.get_results()
print(f"AP: {results['AP']:.3f}, AP50: {results['AP50']:.3f}")

print_results() -> None

Print a formatted results table to stdout. For LVIS, matches the lvis-api print_results() style. Must be called after summarize() (or run()).


confusion_matrix

confusion_matrix(
    iou_thr: float = 0.5,
    max_det: int | None = None,
    min_score: float | None = None,
) -> dict

Compute a per-category confusion matrix. Unlike evaluate(), this method compares all detections in an image against all ground truth boxes regardless of category, enabling cross-category confusion analysis.

This method is standalone — no evaluate() call is needed first.

Parameters:

Parameter Type Default Description
iou_thr float 0.5 IoU threshold for a DT↔GT match
max_det int \| None last params.max_dets value Max detections per image by score
min_score float \| None None Discard detections below this confidence before max_det truncation

Returns a dict with:

Key Type Description
"matrix" np.ndarray[int64] shape (K+1, K+1) Raw confusion counts. Rows = GT category, cols = predicted. Index K is background.
"normalized" np.ndarray[float64] shape (K+1, K+1) Row-normalised version (rows sum to 1.0; zero rows stay zero).
"cat_ids" list[int] Category IDs for rows/cols 0..K-1.
"num_cats" int Number of categories K.
"iou_thr" float IoU threshold used.

Matrix layout (rows = GT, cols = predicted):

  • matrix[i][j] where i ≠ K, j ≠ K — GT category i matched to predicted category j. On-diagonal = TP; off-diagonal = class confusion.
  • matrix[i][K] — GT category i unmatched (false negative).
  • matrix[K][j] — Predicted category j unmatched (false positive).
ev = COCOeval(coco_gt, coco_dt, "bbox")
cm = ev.confusion_matrix(iou_thr=0.5, max_det=100)

matrix = cm["matrix"]
cat_ids = cm["cat_ids"]

# True positives per category
tp = matrix.diagonal()[:-1]

# False negatives per category
fn = matrix[:-1, -1]

# False positives per category
fp = matrix[-1, :-1]

# Normalised view
print(cm["normalized"])

See Confusion Matrix in the evaluation guide for a full walkthrough.


tide_errors

tide_errors(
    pos_thr: float = 0.5,
    bg_thr: float = 0.1,
) -> dict

Decompose detection errors into six TIDE error types (Bolya et al., ECCV 2020) and compute ΔAP — the AP gain from eliminating each error type.

Requires evaluate() to have been called first.

Parameters:

Parameter Type Default Description
pos_thr float 0.5 IoU threshold for TP/FP classification
bg_thr float 0.1 Background IoU threshold for Loc/Both/Bkg discrimination

Returns a dict with:

Key Type Description
"delta_ap" dict[str, float] ΔAP for each error type. Keys: "Cls", "Loc", "Both", "Dupe", "Bkg", "Miss", "FP", "FN".
"counts" dict[str, int] Count of each error type. Keys: "Cls", "Loc", "Both", "Dupe", "Bkg", "Miss".
"ap_base" float Baseline mean AP at pos_thr.
"pos_thr" float IoU threshold used.
"bg_thr" float Background threshold used.
ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()
result = ev.tide_errors(pos_thr=0.5, bg_thr=0.1)

print(f"ap_base: {result['ap_base']:.3f}")
for k, v in sorted(result["delta_ap"].items(), key=lambda x: -x[1]):
    if k not in ("FP", "FN"):
        print(f"  {k}: ΔAP={v:.4f}  n={result['counts'].get(k, '—')}")

See TIDE Error Analysis in the evaluation guide for a detailed walkthrough.