Skip to content

Dataset Operations

hotcoco can do more than evaluate — it can reshape your datasets before evaluation starts.

All operations return a new COCO object and leave the original unchanged. They compose naturally, so you can chain filter → split → sample in a single expression.

Tip

Run coco.stats() first to understand your dataset before reshaping it. See the stats API reference for the full return structure.


filter

Subset a dataset to a specific set of categories, images, or annotation sizes. Returns a new COCO with matching annotations and — by default — only the images that have at least one match.

from hotcoco import COCO

coco = COCO("instances_val2017.json")

# Keep only "person" annotations
person_id = coco.get_cat_ids(cat_nms=["person"])[0]
people = coco.filter(cat_ids=[person_id])

print(len(people.dataset["images"]))       # 2693
print(len(people.dataset["annotations"])) # 10777

Pass drop_empty_images=False to keep all images even if they have no matching annotations — useful when you need consistent image IDs across filtered splits.

# Same annotations, all 5000 images preserved
people_all_imgs = coco.filter(cat_ids=[person_id], drop_empty_images=False)

Filter by annotation area to focus on a size range:

# Medium-to-large objects only (32² – 96² px²)
medium = coco.filter(area_rng=[1024.0, 9216.0])

Filters compose — all criteria are ANDed:

medium_people = coco.filter(cat_ids=[person_id], area_rng=[1024.0, 9216.0])

split

Split a dataset into train/val (or train/val/test) subsets. Images are shuffled deterministically and partitioned by fraction. Annotations follow their images; all splits share the full category list.

# 80/20 train/val split
train, val = coco.split(val_frac=0.2, seed=42)

print(len(train.dataset["images"]))  # 4000
print(len(val.dataset["images"]))    # 1000

Add a test set with a second fraction:

train, val, test = coco.split(val_frac=0.15, test_frac=0.15, seed=42)
# train ~70%, val ~15%, test ~15%

The same seed always produces the same split — important for reproducibility across experiments:

# These are identical
train_a, val_a = coco.split(val_frac=0.2, seed=42)
train_b, val_b = coco.split(val_frac=0.2, seed=42)

A typical eval workflow — filter first, then split:

people = coco.filter(cat_ids=[person_id])
train, val = people.split(val_frac=0.2, seed=42)

sample

Draw a random subset of images (with their annotations). Useful for quick iteration during development without running full-dataset evaluation.

# Sample 500 images
subset = coco.sample(n=500, seed=0)

# Or by fraction
subset = coco.sample(frac=0.1, seed=0)

Like split, the sample is deterministic for the same seed:

# Always the same 500 images
a = coco.sample(n=500, seed=0)
b = coco.sample(n=500, seed=0)

merge

Combine multiple annotation files into one. Common when annotations arrive in separate batches or from separate labeling jobs.

All datasets must share the same category taxonomy (same names and supercategories). Image and annotation IDs are remapped automatically to be globally unique.

batch1 = COCO("batch1.json")
batch2 = COCO("batch2.json")

combined = COCO.merge([batch1, batch2])

print(len(combined.dataset["images"]))
# len(batch1.images) + len(batch2.images)

Merging a dataset with itself doubles the image and annotation count — useful for stress-testing:

doubled = COCO.merge([coco, coco])

merge raises ValueError if the datasets have different category sets:

# Raises ValueError: category 'horse' not found in first dataset
COCO.merge([coco_animals, coco_vehicles])

save

Write any COCO object back to a JSON file. The output format is standard COCO JSON, readable by any tool that accepts COCO annotations.

merged = COCO.merge([batch1, batch2])
merged.save("combined.json")

save works at any point in a pipeline:

coco.filter(cat_ids=[person_id]).sample(n=1000, seed=0).save("person_sample.json")

CLI

All four operations are available as coco subcommands — no Python required beyond the initial install. See the CLI reference for full flag documentation.

coco filter instances_val2017.json --cat-ids 1 -o person.json
coco split  person.json --val-frac 0.2 -o splits/person
coco sample person.json --n 500 --seed 0 -o person_sample.json
coco merge  batch1.json batch2.json -o combined.json