Skip to content

Dataset Format Overview

The EdgeFirst Dataset Format provides a structured, self-describing representation for multi-sensor annotations. Version 2026.04 introduces Parquet support, polygon geometry, raster masks, confidence scores, and file-level metadata that makes every file interpretable without external context.

graph TB
    subgraph Dataset["EdgeFirst Dataset"]
        direction TB
        Storage["Storage Container<br/>(ZIP or Directory)"]
        Annotations["Annotations<br/>(Arrow, Parquet, or JSON)"]
    end

    Storage --> |"Images, PCD, etc."| Sensor["Sensor Data<br/>(Immutable)"]
    Annotations --> |"Labels, Boxes, Masks"| Labels["Annotation Data<br/>(Editable)"]

    Sensor --> Camera["Camera"]
    Sensor --> Radar["Radar"]
    Sensor --> LiDAR["LiDAR"]

    Labels --> Box2D["2D Boxes"]
    Labels --> Box3D["3D Boxes"]
    Labels --> Polygons["Polygons"]
    Labels --> Masks["Raster Masks"]

    style Dataset fill:#e1f5ff,stroke:#0277bd,stroke-width:3px
    style Storage fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
    style Annotations fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    style Sensor fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
    style Labels fill:#fce4ec,stroke:#c2185b,stroke-width:2px

Key Principles

  • Normalized coordinates — all spatial data uses the 0..1 range (resolution-independent)
  • Three storage formats — Arrow IPC (local performance), Parquet (transfer / interop), JSON (human-readable)
  • Self-describing files — file-level metadata records schema_version, box layouts, and mask interpretation
  • One row per instance — flat columnar layout optimized for ML queries
  • Sensor data always external — Arrow/Parquet/JSON contain annotations only; sensor data lives in sibling folders or ZIP files
  • Lossless data representation — annotation data converts between formats without loss

Quick Start (Python)

import polars as pl

# Read Arrow IPC or Parquet
df = pl.read_ipc("dataset.arrow")
# df = pl.read_parquet("dataset.parquet")

# Quick version check — for robust metadata-based detection see Conversion Guidelines
if "polygon" in df.columns:
    print("2026.04 format (has polygon column)")
    polygons = df["polygon"]       # List<List<f32>> — interleaved xy per ring
elif "mask" in df.columns and str(df["mask"].dtype) == "Binary":
    print("2026.04 format (has Binary mask)")
else:
    print("2025.10 or earlier format")

Use the EdgeFirst Client SDK

The SDK handles version detection, format conversion, and metadata extraction automatically. Direct Polars access is shown here for illustration; prefer the SDK for production code.

Section Map

Page Contents
Schema Full column definitions, types, and field semantics
Box Formats Box2D / Box3D layout descriptors and metadata
Storage Formats Arrow IPC, Parquet, and JSON comparison
Conversion Code examples for reading and converting between formats
Sensors Camera, Radar, and LiDAR sensor file types
Directory Structure File naming, directory layout, and ZIP support
Migration Guide Upgrading from 2025.10 to 2026.04