Annotation Schema
Schema version: 2026.04
The EdgeFirst annotation schema uses a flat, columnar layout: one row per annotation instance. All columns are nullable unless noted otherwise. Optional columns may be absent entirely from a file.
Column Reference
Identity & Classification
| Column | Type | Description |
|---|---|---|
name |
String |
Sample identifier (derived from filename) |
frame |
UInt32 |
Sequence frame number (null for standalone images) |
object_id |
String |
Instance tracking UUID |
label |
Categorical |
Class label (JSON field: label_name) |
label_index |
UInt64 |
Source-faithful numeric class index (see label_index details) |
group |
Categorical |
Dataset split — train, val, test (JSON field: group_name) |
Geometry: Polygon
| Column | Type | Description |
|---|---|---|
polygon |
List<List<f32>> |
Interleaved [x1, y1, x2, y2, ...] coordinate pairs per ring |
polygon_score |
Float32 |
Confidence score (0..1), nullable, optional |
New in 2026.04
The polygon column replaces the 2025.10 mask: List<Float32> column that stored
NaN-separated polygon coordinates. See Migration Guide for details.
Outer list: Multiple polygon rings per instance (disjoint parts, holes).
Inner list: Interleaved [x1, y1, x2, y2, ...] pairs for one ring. Coordinates
are always normalized (0..1) relative to the full image. Multiply by image dimensions
to get pixel coordinates.
Coordinate space: Polygon coordinates are always image-space normalized, regardless of
whether box2d coexists on the same row. The box provides object location; the polygon
provides the precise boundary in full-image coordinates.
Validity rules:
- Inner lists must have an even number of values (coordinate pairs)
- Minimum 6 values (3 points) per valid ring
- Odd-length inner lists are invalid — writers reject, readers drop with a warning
Geometry: Raster Mask
| Column | Type | Description |
|---|---|---|
mask |
Binary |
PNG-encoded grayscale raster pixels |
mask_score |
Float32 |
Per-instance confidence (0..1), nullable, optional |
Type changed in 2026.04
The mask column changed from List<Float32> (NaN-separated polygons in 2025.10)
to Binary (PNG-encoded raster pixels in 2026.04). Code that assumes Float32 will
fail on 2026.04 files.
Encoding: Masks are stored as single-channel (grayscale) PNG images within the
Binary column. The PNG format provides:
- Self-describing dimensions — width and height in the PNG header (first 24 bytes), readable without full decode
- Lossless compression — typically 2–10× smaller than raw pixel arrays
- Variable bit depth — 1-bit for binary masks, 8-bit for confidence/sigmoid/logits, 16-bit for high-precision outputs
| Source | PNG bit depth | Pixel values | Use case |
|---|---|---|---|
| Binary mask (any source) | 1-bit (preferred) | 0/1 | Ground truth, thresholded output, COCO RLE import |
| Sigmoid scores | 8-bit | 0–255 (quantized) | Model confidence per-pixel |
| High-precision scores | 16-bit | 0–65535 | When 8-bit quantization is insufficient |
1-bit is the preferred encoding for all binary masks, regardless of source (COCO RLE, thresholded model output, ground truth annotations). Alternatives like 8-bit with 0/255 are valid but wasteful and ambiguous — a reader cannot distinguish "binary mask stored as 8-bit" from "8-bit score data." 1-bit encoding is self-documenting: if the PNG is 1-bit, the mask is binary.
Dimensions: Mask dimensions are defined by the PNG image itself, not by the size
column or box2d. The producer determines the resolution — it could be the original
image size, model input size, or model output size. Consumers read the PNG header to
discover the mask dimensions and rescale to the target coordinate space as needed.
Coverage: The mask covers the full image, not a crop of the bounding box. For instance segmentation, most pixels are 0 (background) and the object region has confidence scores or binary 1 values. This avoids lossy cropping and handles interpolation that extends beyond box bounds.
Interpretation by context:
| Context | Pixel values | Label source |
|---|---|---|
mask + box2d (instance seg) |
Sigmoid confidence (0–255) or binary (0/1) for a single instance | label column on the row |
mask without box2d (semantic seg) |
Argmax class indices | Optional file-level labels metadata; index ordering is model-specific |
Interpretation: Controlled by mask_interpretation file-level metadata. The PNG
bit depth determines the value range:
| Value | Description |
|---|---|
binary |
0/1 values — use 1-bit PNG (default) |
confidence |
Quantized confidence scores — use 8-bit (0–255) or 16-bit (0–65535) PNG |
sigmoid |
Quantized sigmoid outputs — use 8-bit or 16-bit PNG |
logits |
Quantized logit outputs — use 8-bit or 16-bit PNG |
JSON representation: base64-encoded PNG bytes.
Relationship to polygon: polygon and mask can coexist in the same file
(e.g., panoptic segmentation). Typically a dataset uses one or the other. Both use
full-image coordinates — polygons are normalized (0..1), masks cover the full image.
Geometry: 2D Bounding Box
| Column | Type | Description |
|---|---|---|
box2d |
Array<f32, 4> |
Layout described by box2d_format metadata |
box2d_score |
Float32 |
Confidence score (0..1), nullable, optional |
The array element order depends on the box2d_format file metadata. Default is
[center_x, center_y, width, height] (cxcywh). See Box Formats
for all layouts.
Geometry: 3D Bounding Box
| Column | Type | Description |
|---|---|---|
box3d |
Array<f32, 6> |
Layout described by box3d_format metadata |
box3d_score |
Float32 |
Confidence score (0..1), nullable, optional |
Default layout: [center_x, center_y, center_z, width, height, length] (cxcyczwhl).
- Width (w) = X-axis extent
- Height (h) = Y-axis extent
- Length (l) = Z-axis extent
All coordinates represent the geometric center of the bounding box. See Box Formats for details.
Annotation Metadata
| Column | Type | Description |
|---|---|---|
iscrowd |
Boolean |
true = crowd region, false or absent = single instance. Optional, from COCO. |
category_frequency |
Categorical |
Long-tail frequency group: "f", "c", or "r". Optional. |
New in 2026.04
These columns support COCO/LVIS dataset extensions. They are optional and will be absent from files that don't originate from COCO-family datasets.
iscrowd: COCO crowd annotations mark regions containing multiple overlapping
instances. Evaluation protocols treat them differently (matched but not penalized as
false negatives). LVIS does not use crowd annotations — this column will be absent or
null for LVIS-sourced data.
category_frequency: LVIS assigns each category to a frequency group based on how
many training images contain it:
"f"(frequent) — appears in >100 images"c"(common) — appears in 11–100 images"r"(rare) — appears in 1–10 images
This enables disaggregated AP metrics (AP_r, AP_c, AP_f) and long-tail distribution analysis. Use with Polars:
# Count annotations by frequency group
df.group_by("category_frequency").len()
# Filter to rare categories only
rare = df.filter(pl.col("category_frequency") == "r")
Sample Metadata
| Column | Type | Description |
|---|---|---|
size |
Array<u32, 2> |
[width, height] — original image dimensions in pixels. Optional. |
location |
Array<f32, 2> |
[latitude, longitude] GPS coordinates |
pose |
Array<f32, 3> |
[yaw, pitch, roll] IMU orientation in degrees |
degradation |
String |
Visual quality indicator (none, low, medium, high) |
neg_label_indices |
List<UInt32> |
label_index values for categories verified absent from this image |
not_exhaustive_label_indices |
List<UInt32> |
label_index values for categories with possibly incomplete annotation |
neg_label_indices and not_exhaustive_label_indices come from LVIS's federated
annotation protocol. They enable correct evaluation by indicating which categories have
known annotation status in each image:
neg_label_indices: Categories confirmed as not present. A model prediction for one of these categories is a valid false positive.not_exhaustive_label_indices: Categories where annotation may be incomplete. Unmatched predictions for these categories are ignored during evaluation (not penalized).
Both columns reference label_index values from the same file. They are sample-level
fields (repeated per annotation row for a given image).
UInt32 vs UInt64
These lists use UInt32 elements while label_index is UInt64. This is safe for
COCO (max ID ~90) and LVIS (max ID ~1723). If you cross-join these columns in Polars,
cast to a common type first: col("neg_label_indices").cast(List(UInt64)).
Pose array order
The pose array is always [yaw, pitch, roll] in degrees. The JSON representation
uses named fields {yaw, pitch, roll} in the sensors.imu object.
Instrumentation
| Column | Type | Description |
|---|---|---|
timing |
Struct |
Pipeline timing data (optional) |
The timing struct contains Int64 nanosecond duration fields:
| Field | Description |
|---|---|
load |
Time to load input data |
preprocess |
Time for preprocessing transforms |
inference |
Model inference time |
decode |
Time for postprocessing / decoding outputs |
Example:
timing: {load: 1500000, preprocess: 3200000, inference: 12500000, decode: 800000}
# = 1.5 ms load, 3.2 ms preprocess, 12.5 ms inference, 0.8 ms decode
Fields are extensible — future fields do not break older readers since Struct access is by name.
Score Columns
box2d_score, box3d_score, polygon_score, and mask_score are independent
per-geometry confidence values in the range 0..1.
- A single row may have different scores for different geometry types (e.g., high box confidence but lower polygon confidence)
- Raster masks additionally carry per-pixel scores via
mask_interpretationmetadata;mask_scoreis the per-instance aggregate - Ground truth files: score columns should be omitted entirely (not filled with nulls). Readers must treat absent score columns as "not applicable."
Complete Polars Schema
For reference, the full Polars-style schema:
(
# ── Identity & Classification ──────────────────────
('name', String),
('frame', UInt32),
('object_id', String),
('label', Categorical(ordering='physical')),
('label_index', UInt64), # source-faithful, may be non-contiguous
('group', Categorical(ordering='physical')),
# ── Geometry: Polygon ──────────────────────────────
('polygon', List(List(Float32))), # interleaved [x1,y1,x2,y2,...] per ring
('polygon_score', Float32), # OPTIONAL
# ── Geometry: Raster Mask ──────────────────────────
('mask', Binary), # PNG-encoded grayscale raster pixels
('mask_score', Float32), # OPTIONAL
# ── Geometry: 2D Bounding Box ──────────────────────
('box2d', Array(Float32, shape=(4,))), # layout from metadata
('box2d_score', Float32), # OPTIONAL
# ── Geometry: 3D Bounding Box ──────────────────────
('box3d', Array(Float32, shape=(6,))), # [cx, cy, cz, w, h, l]
('box3d_score', Float32), # OPTIONAL
# ── Annotation Metadata (optional) ─────────────────
('iscrowd', Boolean), # OPTIONAL - true = crowd region, false or absent
('category_frequency', Categorical(ordering='physical')), # OPTIONAL - LVIS "f"/"c"/"r"
# ── Sample Metadata (optional) ─────────────────────
('size', Array(UInt32, shape=(2,))), # [width, height]
('location', Array(Float32, shape=(2,))), # [lat, lon]
('pose', Array(Float32, shape=(3,))), # [yaw, pitch, roll]
('degradation', String),
('neg_label_indices', List(UInt32)), # OPTIONAL - LVIS negative categories
('not_exhaustive_label_indices', List(UInt32)), # OPTIONAL - LVIS incomplete categories
# ── Instrumentation (optional) ─────────────────────
('timing', Struct({
'load': Int64,
'preprocess': Int64,
'inference': Int64,
'decode': Int64,
})),
)
File-Level Metadata
Arrow IPC stores key-value metadata on the schema, while Parquet stores key-value metadata in the file footer. In both formats, all metadata values are strings.
| Key | Values | Default (absent) | Description |
|---|---|---|---|
schema_version |
"2026.04" |
"2025.10" |
Format version. Absent = legacy file. |
box2d_format |
"cxcywh", "xyxy", "ltwh" |
"cxcywh" |
Box2D layout descriptor |
box2d_normalized |
"true", "false" |
"true" |
Box2D coordinate system |
box3d_format |
"cxcyczwhl" |
"cxcyczwhl" |
Box3D layout descriptor |
box3d_normalized |
"true", "false" |
"true" |
Box3D coordinate system |
mask_interpretation |
"binary", "confidence", "sigmoid", "logits" |
"binary" |
Pixel value meaning |
category_metadata |
JSON string | absent | Per-label metadata (synset, synonyms, definition) |
labels |
JSON array ["person", "car", ...] |
absent | Ordered class names for semantic segmentation masks. labels[i] = class name for argmax pixel value i. |
Version format is YYYY.MM with mandatory zero-padding (e.g., "2025.10", "2026.04").
Versions are compared lexicographically. Unknown future versions should trigger a warning
(not an error) and attempt best-effort reading via schema introspection.
Category Metadata
The category_metadata key stores per-label reference data as a JSON-encoded string.
This enriches label semantics without adding per-row columns for data that is constant
across all annotations sharing the same label.
{
"aerosol_can": {
"id": 1,
"supercategory": "accessory",
"synset": "aerosol.n.02",
"synonyms": ["aerosol_can", "spray_can"],
"definition": "a dispenser that holds a substance under pressure"
},
"person": {
"id": 1,
"supercategory": "human",
"synset": "person.n.01",
"synonyms": ["person", "individual"],
"definition": "a human being"
}
}
| Field | Type | Description |
|---|---|---|
id |
integer | Source category ID (used to reconstruct category_id for categories with no annotations) |
supercategory |
string | Parent category name (e.g., "vehicle", "animal") |
synset |
string | WordNet synset identifier (e.g., "aerosol.n.02") |
synonyms |
array of strings | Alternate names for the category |
definition |
string | Natural language definition (LVIS def field, renamed for clarity) |
Source: When importing from COCO with LVIS extensions, these fields are populated
from the LVIS categories array. Other datasets with taxonomic metadata can populate
the same fields.
Frequency is a column, not metadata
The frequency field from LVIS is stored as the category_frequency column
(not in category_metadata) because it is directly useful for DataFrame filtering
and disaggregated metrics. The image_count and instance_count fields from LVIS
are intentionally not stored — they are recomputable statistics.
Labels Metadata
New in 2026.04
The labels key stores an ordered array of class names as a JSON-encoded string. This
metadata provides the index-to-name mapping for semantic segmentation masks where each
pixel value is an argmax class index.
Structure: JSON array where labels[i] is the class name for pixel value i:
["background", "person", "car", "bicycle", "dog"]
In this example, pixel value 0 = "background", pixel value 1 = "person",
pixel value 2 = "car", etc.
When written: Optional — only written when the source dataset provides an ordered category list (e.g., COCO categories sorted by ID).
Relationship to category_metadata: The labels array provides index ordering for
mask pixel interpretation. The category_metadata object provides rich per-label
reference data (synset, synonyms, definition). Both may be present; they complement
each other.
label_index
The label_index column stores the source-faithful category identifier. When
importing from COCO or LVIS, the original category_id is preserved directly as
label_index.
Key characteristics:
- May be non-contiguous — COCO uses IDs 1–90 for 80 categories (gaps at 12, 26, 29, 30, etc.). LVIS uses IDs up to ~1723 for 1,203 categories.
- May not start at zero — COCO starts at 1, not 0.
- Must be preserved on round-trip — exporting back to COCO/LVIS reconstructs
the original
category_idfromlabel_index.
Model training note
Models are typically trained with a dense remapping (e.g., 80 contiguous class indices for COCO). This remapping is a model-specific concern handled in the training pipeline, not in the dataset format. Some legacy models (notably older SSDs) are trained with the original gaps and produce 91 outputs (90 categories + background); this is likewise a model-specific convention.