Conversion Guidelines

2025.10 code is incompatible with 2026.04 files

Code written for the 2025.10 schema (NaN-separated masks, mask: List<Float32>) will produce corrupt data when applied to 2026.04 files. Always check the schema version before processing. See the Migration Guide for upgrade instructions.

Version Detection

Always detect the schema version before reading annotation data.

Arrow / Parquet Files

import pyarrow.ipc as ipc
import pyarrow.parquet as pq
import polars as pl

# Method 1 (preferred): Check schema_version metadata
def get_schema_version(path: str) -> str:
    """Read schema_version from Arrow IPC or Parquet file metadata."""
    if path.endswith(".parquet"):
        metadata = pq.read_schema(path).metadata or {}
    else:
        with open(path, "rb") as f:
            metadata = ipc.open_file(f).schema.metadata or {}
    return metadata.get(b"schema_version", b"").decode()

schema_version = get_schema_version("dataset.arrow")

if schema_version:
    version = schema_version  # e.g. "2025.10" or "2026.04"
else:
    # Method 2 (fallback): Inspect column presence and types
    df = pl.read_ipc("dataset.arrow")  # or pl.read_parquet(...)

    if "polygon" in df.columns:
        version = "2026.04"
    elif "mask" in df.columns:
        mask_dtype = str(df["mask"].dtype)
        if mask_dtype.startswith("List(Float32"):
            version = "2025.10"    # NaN-separated polygon coordinates
        elif str(mask_dtype) == "Binary":
            version = "2026.04"    # PNG-encoded raster pixels
        else:
            version = "unknown"
    else:
        version = "2025.10"        # no geometry columns, no metadata

JSON Files

import json

with open("annotations.json") as f:
    data = json.load(f)

if isinstance(data, list):
    # 2025.10: bare array of samples
    samples = data
    version = "2025.10"
else:
    # 2026.04: object wrapper with metadata
    samples = data["samples"]
    version = data.get("schema_version", "2025.10")

Reading 2026.04 Files

Arrow IPC / Parquet

import polars as pl

# Arrow IPC
df = pl.read_ipc("dataset.arrow")

# Parquet
df = pl.read_parquet("dataset.parquet")

# Access polygon data
if "polygon" in df.columns:
    for row in df.iter_rows(named=True):
        if row["polygon"] is not None:
            for ring in row["polygon"]:
                # ring is [x1, y1, x2, y2, ...] interleaved
                points = list(zip(ring[0::2], ring[1::2]))

# Access raster mask data
if "mask" in df.columns:
    for row in df.iter_rows(named=True):
        if row["mask"] is not None and row["size"] is not None:
            width, height = row["size"]
            png_bytes = row["mask"]  # bytes (PNG-encoded raster pixels)

# Access box2d — check format metadata
# (Schema metadata access depends on Polars version; prefer the EdgeFirst Client SDK)
if "box2d" in df.columns:
    for row in df.iter_rows(named=True):
        if row["box2d"] is not None:
            # Default: [cx, cy, w, h] — check box2d_format metadata if available
            cx, cy, w, h = row["box2d"]

# Access timing instrumentation
if "timing" in df.columns:
    for row in df.iter_rows(named=True):
        if row["timing"] is not None:
            t = row["timing"]
            load_ms = t["load"] / 1_000_000
            inference_ms = t["inference"] / 1_000_000

Reading Parquet with DuckDB

import duckdb

# Count labels
result = duckdb.sql("""
    SELECT label, count(*) as count
    FROM 'dataset.parquet'
    GROUP BY label
    ORDER BY count DESC
""")
print(result)

# Filter by score
result = duckdb.sql("""
    SELECT name, label, box2d, box2d_score
    FROM 'dataset.parquet'
    WHERE box2d_score > 0.8
""")

JSON to DataFrame Conversion (2026.04)

Column Name Mapping

Arrow / Parquet column	JSON field	Notes
`label`	`label_name`	Historical naming difference
`group`	`group_name`	Historical naming difference
`object_id`	`object_id`	2026.04 uses `object_id` (not legacy `object_reference`)
`polygon`	`polygon`	JSON: `[[x,y], ...]` pairs; Arrow: interleaved `[x,y,x,y,...]`
`mask`	`mask`	Arrow: `Binary` (PNG bytes); JSON: base64-encoded PNG string
`iscrowd`	`iscrowd`	`Boolean` (`true`/`false`) in both formats
`category_frequency`	`category_frequency`	Same in both formats (`"f"`, `"c"`, `"r"`)
`neg_label_indices`	`neg_label_indices`	Arrow: `List<UInt32>`; JSON: array of integers
`not_exhaustive_label_indices`	`not_exhaustive_label_indices`	Arrow: `List<UInt32>`; JSON: array of integers
`pose`	`sensors.imu`	Arrow: `[yaw, pitch, roll]`; JSON: `{yaw, pitch, roll}` object
`location`	`sensors.gps`	Arrow: `[lat, lon]`; JSON: `{latitude, longitude}` object

File-level metadata keys (mask_interpretation, category_metadata, box2d_format, etc.) are not per-row columns. They are stored in the Arrow/Parquet schema metadata or in the JSON top-level object. See File-Level Metadata for the full list.

Full Conversion Example

import polars as pl
import json, base64

with open("annotations.json") as f:
    data = json.load(f)

if isinstance(data, list):
    samples = data
    box2d_format = "ltwh"       # JSON default is ltwh (COCO convention)
else:
    samples = data["samples"]
    box2d_format = data.get("box2d_format", "ltwh")  # Arrow default is cxcywh; JSON default is ltwh

rows = []
for sample in samples:
    size = [sample.get("width"), sample.get("height")]
    for ann in sample.get("annotations", []):
        row = {
            "name": sample["image_name"].rsplit(".", 1)[0],
            "frame": sample.get("frame_number"),
            "object_id": ann.get("object_id"),
            "label": ann["label_name"],
            "label_index": ann.get("label_index"),
            "group": sample.get("group_name"),
        }

        # Polygon: JSON [[x,y],...] per ring -> DataFrame [x,y,x,y,...] per ring
        if ann.get("polygon"):
            row["polygon"] = [
                [coord for pt in ring for coord in pt]
                for ring in ann["polygon"]
            ]
            row["polygon_score"] = ann.get("polygon_score")

        # Mask: JSON base64 PNG -> DataFrame Binary (PNG bytes)
        if ann.get("mask") and isinstance(ann["mask"], str):
            row["mask"] = base64.b64decode(ann["mask"])  # PNG bytes
            row["mask_score"] = ann.get("mask_score")

        # Box2D: convert based on format metadata
        if ann.get("box2d"):
            b = ann["box2d"]
            if box2d_format == "ltwh":
                row["box2d"] = [b["x"] + b["w"]/2, b["y"] + b["h"]/2, b["w"], b["h"]]
            elif box2d_format == "cxcywh":
                row["box2d"] = [b["cx"], b["cy"], b["w"], b["h"]]
            row["box2d_score"] = ann.get("box2d_score")

        # Box3D: x,y,z are center coordinates (not corner)
        if ann.get("box3d"):
            b3 = ann["box3d"]
            row["box3d"] = [b3["x"], b3["y"], b3["z"], b3["w"], b3["h"], b3["l"]]
            row["box3d_score"] = ann.get("box3d_score")

        # Annotation metadata (COCO/LVIS extensions)
        if "iscrowd" in ann:
            row["iscrowd"] = bool(ann["iscrowd"])  # ensure Boolean (handles legacy 0/1)
        if "category_frequency" in ann:
            row["category_frequency"] = ann["category_frequency"]

        # Sample-level LVIS fields (repeated per annotation row)
        if "neg_label_indices" in sample:
            row["neg_label_indices"] = sample["neg_label_indices"]
        if "not_exhaustive_label_indices" in sample:
            row["not_exhaustive_label_indices"] = sample["not_exhaustive_label_indices"]

        row["size"] = size
        rows.append(row)

df = pl.DataFrame(rows)
df.write_ipc("annotations.arrow")       # Arrow IPC
# df.write_parquet("annotations.parquet")  # or Parquet

Key Conversions Summary

#	Conversion	Direction
1	Unnest: one row per annotation	JSON to DataFrame
2	Column names: `label_name` to `label`, `group_name` to `group`	JSON to DataFrame
3	Polygon: `[[x,y],...]` point pairs to `[x,y,x,y,...]` interleaved	JSON to DataFrame
4	Mask: base64 PNG string → `Binary` (PNG bytes)	JSON to DataFrame
5	Box2D: check `box2d_format` — convert `ltwh` to `cxcywh` if needed	JSON to DataFrame
6	Box3D: `{x,y,z,w,h,l}` to `[cx,cy,cz,w,h,l]`	JSON to DataFrame
7	GPS: `{latitude, longitude}` to `[lat, lon]`	JSON to DataFrame
8	IMU: `{yaw, pitch, roll}` to `[yaw, pitch, roll]`	JSON to DataFrame
9	Score columns: omit entirely for ground truth files	Both
10	`neg_label_indices` / `not_exhaustive_label_indices`: sample-level, repeated per annotation row	JSON to DataFrame
11	`label_index`: preserved as-is (source-faithful, non-contiguous)	Both
12	`mask_interpretation`: file-level metadata (`"binary"`, `"confidence"`, `"sigmoid"`, `"logits"`) — set on the Arrow schema, not per-row	Both
13	`category_metadata`: file-level metadata — JSON-encoded string of per-label synset/synonyms/definition. Extract from LVIS `categories` array when importing; attach to Arrow schema metadata when writing.	Both

Use the EdgeFirst Client SDK

The SDK handles all conversions automatically, including version detection and backward compatibility. Direct conversion code is shown here for reference and for users who need custom pipelines.