Skip to content

Annotation Schema

Schema version: 2026.04

The EdgeFirst annotation schema uses a flat, columnar layout: one row per annotation instance. All columns are nullable unless noted otherwise. Optional columns may be absent entirely from a file.

Column Reference

Identity & Classification

Column Type Description
name String Sample identifier (derived from filename)
frame UInt32 Sequence frame number (null for standalone images)
object_id String Instance tracking UUID
label Categorical Class label (JSON field: label_name)
label_index UInt64 Source-faithful numeric class index (see label_index details)
group Categorical Dataset split — train, val, test (JSON field: group_name)

Geometry: Polygon

Column Type Description
polygon List<List<f32>> Interleaved [x1, y1, x2, y2, ...] coordinate pairs per ring
polygon_score Float32 Confidence score (0..1), nullable, optional

New in 2026.04

The polygon column replaces the 2025.10 mask: List<Float32> column that stored NaN-separated polygon coordinates. See Migration Guide for details.

Outer list: Multiple polygon rings per instance (disjoint parts, holes).

Inner list: Interleaved [x1, y1, x2, y2, ...] pairs for one ring. Coordinates are always normalized (0..1) relative to the full image. Multiply by image dimensions to get pixel coordinates.

Coordinate space: Polygon coordinates are always image-space normalized, regardless of whether box2d coexists on the same row. The box provides object location; the polygon provides the precise boundary in full-image coordinates.

Validity rules:

  • Inner lists must have an even number of values (coordinate pairs)
  • Minimum 6 values (3 points) per valid ring
  • Odd-length inner lists are invalid — writers reject, readers drop with a warning

Geometry: Raster Mask

Column Type Description
mask Binary PNG-encoded grayscale raster pixels
mask_score Float32 Per-instance confidence (0..1), nullable, optional

Type changed in 2026.04

The mask column changed from List<Float32> (NaN-separated polygons in 2025.10) to Binary (PNG-encoded raster pixels in 2026.04). Code that assumes Float32 will fail on 2026.04 files.

Encoding: Masks are stored as single-channel (grayscale) PNG images within the Binary column. The PNG format provides:

  • Self-describing dimensions — width and height in the PNG header (first 24 bytes), readable without full decode
  • Lossless compression — typically 2–10× smaller than raw pixel arrays
  • Variable bit depth — 1-bit for binary masks, 8-bit for confidence/sigmoid/logits, 16-bit for high-precision outputs
Source PNG bit depth Pixel values Use case
Binary mask (any source) 1-bit (preferred) 0/1 Ground truth, thresholded output, COCO RLE import
Sigmoid scores 8-bit 0–255 (quantized) Model confidence per-pixel
High-precision scores 16-bit 0–65535 When 8-bit quantization is insufficient

1-bit is the preferred encoding for all binary masks, regardless of source (COCO RLE, thresholded model output, ground truth annotations). Alternatives like 8-bit with 0/255 are valid but wasteful and ambiguous — a reader cannot distinguish "binary mask stored as 8-bit" from "8-bit score data." 1-bit encoding is self-documenting: if the PNG is 1-bit, the mask is binary.

Dimensions: Mask dimensions are defined by the PNG image itself, not by the size column or box2d. The producer determines the resolution — it could be the original image size, model input size, or model output size. Consumers read the PNG header to discover the mask dimensions and rescale to the target coordinate space as needed.

Coverage: The mask covers the full image, not a crop of the bounding box. For instance segmentation, most pixels are 0 (background) and the object region has confidence scores or binary 1 values. This avoids lossy cropping and handles interpolation that extends beyond box bounds.

Interpretation by context:

Context Pixel values Label source
mask + box2d (instance seg) Sigmoid confidence (0–255) or binary (0/1) for a single instance label column on the row
mask without box2d (semantic seg) Argmax class indices Optional file-level labels metadata; index ordering is model-specific

Interpretation: Controlled by mask_interpretation file-level metadata. The PNG bit depth determines the value range:

Value Description
binary 0/1 values — use 1-bit PNG (default)
confidence Quantized confidence scores — use 8-bit (0–255) or 16-bit (0–65535) PNG
sigmoid Quantized sigmoid outputs — use 8-bit or 16-bit PNG
logits Quantized logit outputs — use 8-bit or 16-bit PNG

JSON representation: base64-encoded PNG bytes.

Relationship to polygon: polygon and mask can coexist in the same file (e.g., panoptic segmentation). Typically a dataset uses one or the other. Both use full-image coordinates — polygons are normalized (0..1), masks cover the full image.

Geometry: 2D Bounding Box

Column Type Description
box2d Array<f32, 4> Layout described by box2d_format metadata
box2d_score Float32 Confidence score (0..1), nullable, optional

The array element order depends on the box2d_format file metadata. Default is [center_x, center_y, width, height] (cxcywh). See Box Formats for all layouts.

Geometry: 3D Bounding Box

Column Type Description
box3d Array<f32, 6> Layout described by box3d_format metadata
box3d_score Float32 Confidence score (0..1), nullable, optional

Default layout: [center_x, center_y, center_z, width, height, length] (cxcyczwhl).

  • Width (w) = X-axis extent
  • Height (h) = Y-axis extent
  • Length (l) = Z-axis extent

All coordinates represent the geometric center of the bounding box. See Box Formats for details.

Annotation Metadata

Column Type Description
iscrowd Boolean true = crowd region, false or absent = single instance. Optional, from COCO.
category_frequency Categorical Long-tail frequency group: "f", "c", or "r". Optional.

New in 2026.04

These columns support COCO/LVIS dataset extensions. They are optional and will be absent from files that don't originate from COCO-family datasets.

iscrowd: COCO crowd annotations mark regions containing multiple overlapping instances. Evaluation protocols treat them differently (matched but not penalized as false negatives). LVIS does not use crowd annotations — this column will be absent or null for LVIS-sourced data.

category_frequency: LVIS assigns each category to a frequency group based on how many training images contain it:

  • "f" (frequent) — appears in >100 images
  • "c" (common) — appears in 11–100 images
  • "r" (rare) — appears in 1–10 images

This enables disaggregated AP metrics (AP_r, AP_c, AP_f) and long-tail distribution analysis. Use with Polars:

# Count annotations by frequency group
df.group_by("category_frequency").len()

# Filter to rare categories only
rare = df.filter(pl.col("category_frequency") == "r")

Sample Metadata

Column Type Description
size Array<u32, 2> [width, height] — original image dimensions in pixels. Optional.
location Array<f32, 2> [latitude, longitude] GPS coordinates
pose Array<f32, 3> [yaw, pitch, roll] IMU orientation in degrees
degradation String Visual quality indicator (none, low, medium, high)
neg_label_indices List<UInt32> label_index values for categories verified absent from this image
not_exhaustive_label_indices List<UInt32> label_index values for categories with possibly incomplete annotation

neg_label_indices and not_exhaustive_label_indices come from LVIS's federated annotation protocol. They enable correct evaluation by indicating which categories have known annotation status in each image:

  • neg_label_indices: Categories confirmed as not present. A model prediction for one of these categories is a valid false positive.
  • not_exhaustive_label_indices: Categories where annotation may be incomplete. Unmatched predictions for these categories are ignored during evaluation (not penalized).

Both columns reference label_index values from the same file. They are sample-level fields (repeated per annotation row for a given image).

UInt32 vs UInt64

These lists use UInt32 elements while label_index is UInt64. This is safe for COCO (max ID ~90) and LVIS (max ID ~1723). If you cross-join these columns in Polars, cast to a common type first: col("neg_label_indices").cast(List(UInt64)).

Pose array order

The pose array is always [yaw, pitch, roll] in degrees. The JSON representation uses named fields {yaw, pitch, roll} in the sensors.imu object.

Instrumentation

Column Type Description
timing Struct Pipeline timing data (optional)

The timing struct contains Int64 nanosecond duration fields:

Field Description
load Time to load input data
preprocess Time for preprocessing transforms
inference Model inference time
decode Time for postprocessing / decoding outputs

Example:

timing: {load: 1500000, preprocess: 3200000, inference: 12500000, decode: 800000}
# = 1.5 ms load, 3.2 ms preprocess, 12.5 ms inference, 0.8 ms decode

Fields are extensible — future fields do not break older readers since Struct access is by name.

Score Columns

box2d_score, box3d_score, polygon_score, and mask_score are independent per-geometry confidence values in the range 0..1.

  • A single row may have different scores for different geometry types (e.g., high box confidence but lower polygon confidence)
  • Raster masks additionally carry per-pixel scores via mask_interpretation metadata; mask_score is the per-instance aggregate
  • Ground truth files: score columns should be omitted entirely (not filled with nulls). Readers must treat absent score columns as "not applicable."

Complete Polars Schema

For reference, the full Polars-style schema:

(
    # ── Identity & Classification ──────────────────────
    ('name', String),
    ('frame', UInt32),
    ('object_id', String),
    ('label', Categorical(ordering='physical')),
    ('label_index', UInt64),                    # source-faithful, may be non-contiguous
    ('group', Categorical(ordering='physical')),

    # ── Geometry: Polygon ──────────────────────────────
    ('polygon', List(List(Float32))),           # interleaved [x1,y1,x2,y2,...] per ring
    ('polygon_score', Float32),                 # OPTIONAL

    # ── Geometry: Raster Mask ──────────────────────────
    ('mask', Binary),                            # PNG-encoded grayscale raster pixels
    ('mask_score', Float32),                    # OPTIONAL

    # ── Geometry: 2D Bounding Box ──────────────────────
    ('box2d', Array(Float32, shape=(4,))),      # layout from metadata
    ('box2d_score', Float32),                   # OPTIONAL

    # ── Geometry: 3D Bounding Box ──────────────────────
    ('box3d', Array(Float32, shape=(6,))),      # [cx, cy, cz, w, h, l]
    ('box3d_score', Float32),                   # OPTIONAL

    # ── Annotation Metadata (optional) ─────────────────
    ('iscrowd', Boolean),                       # OPTIONAL - true = crowd region, false or absent
    ('category_frequency', Categorical(ordering='physical')),  # OPTIONAL - LVIS "f"/"c"/"r"

    # ── Sample Metadata (optional) ─────────────────────
    ('size', Array(UInt32, shape=(2,))),         # [width, height]
    ('location', Array(Float32, shape=(2,))),    # [lat, lon]
    ('pose', Array(Float32, shape=(3,))),        # [yaw, pitch, roll]
    ('degradation', String),
    ('neg_label_indices', List(UInt32)),          # OPTIONAL - LVIS negative categories
    ('not_exhaustive_label_indices', List(UInt32)),  # OPTIONAL - LVIS incomplete categories

    # ── Instrumentation (optional) ─────────────────────
    ('timing', Struct({
        'load': Int64,
        'preprocess': Int64,
        'inference': Int64,
        'decode': Int64,
    })),
)

File-Level Metadata

Arrow IPC stores key-value metadata on the schema, while Parquet stores key-value metadata in the file footer. In both formats, all metadata values are strings.

Key Values Default (absent) Description
schema_version "2026.04" "2025.10" Format version. Absent = legacy file.
box2d_format "cxcywh", "xyxy", "ltwh" "cxcywh" Box2D layout descriptor
box2d_normalized "true", "false" "true" Box2D coordinate system
box3d_format "cxcyczwhl" "cxcyczwhl" Box3D layout descriptor
box3d_normalized "true", "false" "true" Box3D coordinate system
mask_interpretation "binary", "confidence", "sigmoid", "logits" "binary" Pixel value meaning
category_metadata JSON string absent Per-label metadata (synset, synonyms, definition)
labels JSON array ["person", "car", ...] absent Ordered class names for semantic segmentation masks. labels[i] = class name for argmax pixel value i.

Version format is YYYY.MM with mandatory zero-padding (e.g., "2025.10", "2026.04"). Versions are compared lexicographically. Unknown future versions should trigger a warning (not an error) and attempt best-effort reading via schema introspection.

Category Metadata

The category_metadata key stores per-label reference data as a JSON-encoded string. This enriches label semantics without adding per-row columns for data that is constant across all annotations sharing the same label.

{
  "aerosol_can": {
    "id": 1,
    "supercategory": "accessory",
    "synset": "aerosol.n.02",
    "synonyms": ["aerosol_can", "spray_can"],
    "definition": "a dispenser that holds a substance under pressure"
  },
  "person": {
    "id": 1,
    "supercategory": "human",
    "synset": "person.n.01",
    "synonyms": ["person", "individual"],
    "definition": "a human being"
  }
}
Field Type Description
id integer Source category ID (used to reconstruct category_id for categories with no annotations)
supercategory string Parent category name (e.g., "vehicle", "animal")
synset string WordNet synset identifier (e.g., "aerosol.n.02")
synonyms array of strings Alternate names for the category
definition string Natural language definition (LVIS def field, renamed for clarity)

Source: When importing from COCO with LVIS extensions, these fields are populated from the LVIS categories array. Other datasets with taxonomic metadata can populate the same fields.

Frequency is a column, not metadata

The frequency field from LVIS is stored as the category_frequency column (not in category_metadata) because it is directly useful for DataFrame filtering and disaggregated metrics. The image_count and instance_count fields from LVIS are intentionally not stored — they are recomputable statistics.

Labels Metadata

New in 2026.04

The labels key stores an ordered array of class names as a JSON-encoded string. This metadata provides the index-to-name mapping for semantic segmentation masks where each pixel value is an argmax class index.

Structure: JSON array where labels[i] is the class name for pixel value i:

["background", "person", "car", "bicycle", "dog"]

In this example, pixel value 0 = "background", pixel value 1 = "person", pixel value 2 = "car", etc.

When written: Optional — only written when the source dataset provides an ordered category list (e.g., COCO categories sorted by ID).

Relationship to category_metadata: The labels array provides index ordering for mask pixel interpretation. The category_metadata object provides rich per-label reference data (synset, synonyms, definition). Both may be present; they complement each other.

label_index

The label_index column stores the source-faithful category identifier. When importing from COCO or LVIS, the original category_id is preserved directly as label_index.

Key characteristics:

  • May be non-contiguous — COCO uses IDs 1–90 for 80 categories (gaps at 12, 26, 29, 30, etc.). LVIS uses IDs up to ~1723 for 1,203 categories.
  • May not start at zero — COCO starts at 1, not 0.
  • Must be preserved on round-trip — exporting back to COCO/LVIS reconstructs the original category_id from label_index.

Model training note

Models are typically trained with a dense remapping (e.g., 80 contiguous class indices for COCO). This remapping is a model-specific concern handled in the training pipeline, not in the dataset format. Some legacy models (notably older SSDs) are trained with the original gaps and produce 91 outputs (90 categories + background); this is likewise a model-specific convention.