COCOeval¶

Run COCO evaluation to compute AP/AR metrics.

PythonRust

from hotcoco import COCO, COCOeval

coco_gt = COCO("instances_val2017.json")
coco_dt = coco_gt.load_res("detections.json")

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()
ev.accumulate()
ev.summarize()

use hotcoco::{COCO, COCOeval};
use hotcoco::params::IouType;
use std::path::Path;

let coco_gt = COCO::new(Path::new("instances_val2017.json"))?;
let coco_dt = coco_gt.load_res(Path::new("detections.json"))?;

let mut ev = COCOeval::new(coco_gt, coco_dt, IouType::Bbox);
ev.evaluate();
ev.accumulate();
ev.summarize();

Constructor¶

PythonRust

COCOeval(coco_gt: COCO, coco_dt: COCO, iou_type: str)

Parameter	Type	Description
`coco_gt`	`COCO`	Ground truth COCO object
`coco_dt`	`COCO`	Detections COCO object (from `load_res`)
`iou_type`	`str`	`"bbox"`, `"segm"`, or `"keypoints"`

COCOeval::new(coco_gt: COCO, coco_dt: COCO, iou_type: IouType) -> Self

Parameter	Type	Description
`coco_gt`	`COCO`	Ground truth COCO object
`coco_dt`	`COCO`	Detections COCO object (from `load_res`)
`iou_type`	`IouType`	`IouType::Bbox`, `IouType::Segm`, or `IouType::Keypoints`

Properties¶

`params`¶

PythonRust

params: Params

Evaluation parameters. Modify before calling evaluate().

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.params.cat_ids = [1, 2, 3]
ev.params.max_dets = [1, 10, 100]

pub params: Params

let mut ev = COCOeval::new(coco_gt, coco_dt, IouType::Bbox);
ev.params.cat_ids = vec![1, 2, 3];
ev.params.max_dets = vec![1, 10, 100];

See Params for all configurable fields.

`stats`¶

PythonRust

stats: list[float] | None

The 12 summary metrics (10 for keypoints), populated after summarize(). None before summarize() is called.

ev.summarize()
print(f"AP: {ev.stats[0]:.3f}")
print(f"AP50: {ev.stats[1]:.3f}")

pub stats: Option<Vec<f64>>

ev.summarize();
if let Some(stats) = &ev.stats {
    println!("AP: {:.3}", stats[0]);
    println!("AP50: {:.3}", stats[1]);
}

`eval_imgs`¶

Per-image evaluation results, populated after evaluate(). See Working with Results for details.

PythonRust

eval_imgs: list[dict | None]

pub eval_imgs: Vec<Option<EvalImg>>

`eval`¶

Accumulated precision/recall arrays, populated after accumulate(). See Working with Results for details.

PythonRust

eval: dict | None

Contains "precision", "recall", and "scores" arrays.

pub eval: Option<AccumulatedEval>

Access elements with precision_idx(t, r, k, a, m) and recall_idx(t, k, a, m).

Methods¶

`evaluate`¶

evaluate() -> None

Run per-image evaluation. Matches detections to ground truth annotations using greedy matching sorted by confidence. Must be called before accumulate().

Populates eval_imgs.

`accumulate`¶

accumulate() -> None

Accumulate per-image results into precision/recall curves using interpolated precision at 101 recall thresholds.

Populates eval.

`summarize`¶

summarize() -> None

Compute and print the standard COCO metrics. Populates stats.

Non-default parameters

summarize() uses a fixed display format that assumes default iou_thrs, max_dets, and area_rng_lbl. If you've changed any of these, a warning is printed to stderr and some metrics may show -1.000 (e.g. AP50 when iou_thrs doesn't include 0.50). The stats array always has 12 entries (10 for keypoints) regardless of your parameters.

Prints 12 lines for bbox/segm (10 for keypoints):

 Average Precision (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.382
 Average Precision (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.584
 ...

`run`¶

run() -> None

Run the full pipeline in one call: evaluate() → accumulate() → summarize(). Primarily used with LVIS pipelines (Detectron2, MMDetection) that expect a single run() call.

`get_results`¶

get_results() -> dict[str, float]

Return the summary metrics as a dict. Must be called after summarize() (or run()). Returns an empty dict if summarize() has not been called.

Standard bbox/segm keys: AP, AP50, AP75, APs, APm, APl, AR1, AR10, AR100, ARs, ARm, ARl.

Keypoint keys: AP, AP50, AP75, APm, APl, AR, AR50, AR75, ARm, ARl.

LVIS keys: AP, AP50, AP75, APs, APm, APl, APr, APc, APf, AR@300, ARs@300, ARm@300, ARl@300.

ev.run()
results = ev.get_results()
print(f"AP: {results['AP']:.3f}, AP50: {results['AP50']:.3f}")

`print_results`¶

print_results() -> None

Print a formatted results table to stdout. For LVIS, matches the lvis-api print_results() style. Must be called after summarize() (or run()).

`confusion_matrix`¶

confusion_matrix(
    iou_thr: float = 0.5,
    max_det: int | None = None,
    min_score: float | None = None,
) -> dict

Compute a per-category confusion matrix. Unlike evaluate(), this method compares all detections in an image against all ground truth boxes regardless of category, enabling cross-category confusion analysis.

This method is standalone — no evaluate() call is needed first.

Parameters:

Parameter	Type	Default	Description
`iou_thr`	`float`	`0.5`	IoU threshold for a DT↔GT match
`max_det`	`int \\| None`	last `params.max_dets` value	Max detections per image by score
`min_score`	`float \\| None`	`None`	Discard detections below this confidence before `max_det` truncation

Returns a dict with:

Key	Type	Description
`"matrix"`	`np.ndarray[int64]` shape `(K+1, K+1)`	Raw confusion counts. Rows = GT category, cols = predicted. Index `K` is background.
`"normalized"`	`np.ndarray[float64]` shape `(K+1, K+1)`	Row-normalised version (rows sum to 1.0; zero rows stay zero).
`"cat_ids"`	`list[int]`	Category IDs for rows/cols `0..K-1`.
`"num_cats"`	`int`	Number of categories `K`.
`"iou_thr"`	`float`	IoU threshold used.

Matrix layout (rows = GT, cols = predicted):

matrix[i][j] where i ≠ K, j ≠ K — GT category i matched to predicted category j. On-diagonal = TP; off-diagonal = class confusion.
matrix[i][K] — GT category i unmatched (false negative).
matrix[K][j] — Predicted category j unmatched (false positive).

ev = COCOeval(coco_gt, coco_dt, "bbox")
cm = ev.confusion_matrix(iou_thr=0.5, max_det=100)

matrix = cm["matrix"]
cat_ids = cm["cat_ids"]

# True positives per category
tp = matrix.diagonal()[:-1]

# False negatives per category
fn = matrix[:-1, -1]

# False positives per category
fp = matrix[-1, :-1]

# Normalised view
print(cm["normalized"])

See Confusion Matrix in the evaluation guide for a full walkthrough.

`tide_errors`¶

tide_errors(
    pos_thr: float = 0.5,
    bg_thr: float = 0.1,
) -> dict

Decompose detection errors into six TIDE error types (Bolya et al., ECCV 2020) and compute ΔAP — the AP gain from eliminating each error type.

Requires evaluate() to have been called first.

Parameters:

Parameter	Type	Default	Description
`pos_thr`	`float`	`0.5`	IoU threshold for TP/FP classification
`bg_thr`	`float`	`0.1`	Background IoU threshold for Loc/Both/Bkg discrimination

Returns a dict with:

Key	Type	Description
`"delta_ap"`	`dict[str, float]`	ΔAP for each error type. Keys: `"Cls"`, `"Loc"`, `"Both"`, `"Dupe"`, `"Bkg"`, `"Miss"`, `"FP"`, `"FN"`.
`"counts"`	`dict[str, int]`	Count of each error type. Keys: `"Cls"`, `"Loc"`, `"Both"`, `"Dupe"`, `"Bkg"`, `"Miss"`.
`"ap_base"`	`float`	Baseline mean AP at `pos_thr`.
`"pos_thr"`	`float`	IoU threshold used.
`"bg_thr"`	`float`	Background threshold used.

ev = COCOeval(coco_gt, coco_dt, "bbox")
ev.evaluate()
result = ev.tide_errors(pos_thr=0.5, bg_thr=0.1)

print(f"ap_base: {result['ap_base']:.3f}")
for k, v in sorted(result["delta_ap"].items(), key=lambda x: -x[1]):
    if k not in ("FP", "FN"):
        print(f"  {k}: ΔAP={v:.4f}  n={result['counts'].get(k, '—')}")

See TIDE Error Analysis in the evaluation guide for a detailed walkthrough.

COCOeval¶

Constructor¶

Properties¶

params¶

stats¶

eval_imgs¶

eval¶

Methods¶

evaluate¶

accumulate¶

summarize¶

run¶

get_results¶

print_results¶

confusion_matrix¶

tide_errors¶

`params`¶

`stats`¶

`eval_imgs`¶

`eval`¶

`evaluate`¶

`accumulate`¶

`summarize`¶

`run`¶

`get_results`¶

`print_results`¶

`confusion_matrix`¶

`tide_errors`¶