Conversion Guidelines
2025.10 code is incompatible with 2026.04 files
Code written for the 2025.10 schema (NaN-separated masks, mask: List<Float32>)
will produce corrupt data when applied to 2026.04 files. Always check the
schema version before processing. See the Migration Guide for
upgrade instructions.
Version Detection
Always detect the schema version before reading annotation data.
Arrow / Parquet Files
import pyarrow.ipc as ipc
import pyarrow.parquet as pq
import polars as pl
# Method 1 (preferred): Check schema_version metadata
def get_schema_version(path: str) -> str:
"""Read schema_version from Arrow IPC or Parquet file metadata."""
if path.endswith(".parquet"):
metadata = pq.read_schema(path).metadata or {}
else:
with open(path, "rb") as f:
metadata = ipc.open_file(f).schema.metadata or {}
return metadata.get(b"schema_version", b"").decode()
schema_version = get_schema_version("dataset.arrow")
if schema_version:
version = schema_version # e.g. "2025.10" or "2026.04"
else:
# Method 2 (fallback): Inspect column presence and types
df = pl.read_ipc("dataset.arrow") # or pl.read_parquet(...)
if "polygon" in df.columns:
version = "2026.04"
elif "mask" in df.columns:
mask_dtype = str(df["mask"].dtype)
if mask_dtype.startswith("List(Float32"):
version = "2025.10" # NaN-separated polygon coordinates
elif str(mask_dtype) == "Binary":
version = "2026.04" # PNG-encoded raster pixels
else:
version = "unknown"
else:
version = "2025.10" # no geometry columns, no metadata
JSON Files
import json
with open("annotations.json") as f:
data = json.load(f)
if isinstance(data, list):
# 2025.10: bare array of samples
samples = data
version = "2025.10"
else:
# 2026.04: object wrapper with metadata
samples = data["samples"]
version = data.get("schema_version", "2025.10")
Reading 2026.04 Files
Arrow IPC / Parquet
import polars as pl
# Arrow IPC
df = pl.read_ipc("dataset.arrow")
# Parquet
df = pl.read_parquet("dataset.parquet")
# Access polygon data
if "polygon" in df.columns:
for row in df.iter_rows(named=True):
if row["polygon"] is not None:
for ring in row["polygon"]:
# ring is [x1, y1, x2, y2, ...] interleaved
points = list(zip(ring[0::2], ring[1::2]))
# Access raster mask data
if "mask" in df.columns:
for row in df.iter_rows(named=True):
if row["mask"] is not None and row["size"] is not None:
width, height = row["size"]
png_bytes = row["mask"] # bytes (PNG-encoded raster pixels)
# Access box2d — check format metadata
# (Schema metadata access depends on Polars version; prefer the EdgeFirst Client SDK)
if "box2d" in df.columns:
for row in df.iter_rows(named=True):
if row["box2d"] is not None:
# Default: [cx, cy, w, h] — check box2d_format metadata if available
cx, cy, w, h = row["box2d"]
# Access timing instrumentation
if "timing" in df.columns:
for row in df.iter_rows(named=True):
if row["timing"] is not None:
t = row["timing"]
load_ms = t["load"] / 1_000_000
inference_ms = t["inference"] / 1_000_000
Reading Parquet with DuckDB
import duckdb
# Count labels
result = duckdb.sql("""
SELECT label, count(*) as count
FROM 'dataset.parquet'
GROUP BY label
ORDER BY count DESC
""")
print(result)
# Filter by score
result = duckdb.sql("""
SELECT name, label, box2d, box2d_score
FROM 'dataset.parquet'
WHERE box2d_score > 0.8
""")
JSON to DataFrame Conversion (2026.04)
Column Name Mapping
| Arrow / Parquet column | JSON field | Notes |
|---|---|---|
label |
label_name |
Historical naming difference |
group |
group_name |
Historical naming difference |
object_id |
object_id |
2026.04 uses object_id (not legacy object_reference) |
polygon |
polygon |
JSON: [[x,y], ...] pairs; Arrow: interleaved [x,y,x,y,...] |
mask |
mask |
Arrow: Binary (PNG bytes); JSON: base64-encoded PNG string |
iscrowd |
iscrowd |
Boolean (true/false) in both formats |
category_frequency |
category_frequency |
Same in both formats ("f", "c", "r") |
neg_label_indices |
neg_label_indices |
Arrow: List<UInt32>; JSON: array of integers |
not_exhaustive_label_indices |
not_exhaustive_label_indices |
Arrow: List<UInt32>; JSON: array of integers |
pose |
sensors.imu |
Arrow: [yaw, pitch, roll]; JSON: {yaw, pitch, roll} object |
location |
sensors.gps |
Arrow: [lat, lon]; JSON: {latitude, longitude} object |
File-level metadata keys (mask_interpretation, category_metadata, box2d_format, etc.)
are not per-row columns. They are stored in the Arrow/Parquet schema metadata or in the
JSON top-level object. See File-Level Metadata for the
full list.
Full Conversion Example
import polars as pl
import json, base64
with open("annotations.json") as f:
data = json.load(f)
if isinstance(data, list):
samples = data
box2d_format = "ltwh" # JSON default is ltwh (COCO convention)
else:
samples = data["samples"]
box2d_format = data.get("box2d_format", "ltwh") # Arrow default is cxcywh; JSON default is ltwh
rows = []
for sample in samples:
size = [sample.get("width"), sample.get("height")]
for ann in sample.get("annotations", []):
row = {
"name": sample["image_name"].rsplit(".", 1)[0],
"frame": sample.get("frame_number"),
"object_id": ann.get("object_id"),
"label": ann["label_name"],
"label_index": ann.get("label_index"),
"group": sample.get("group_name"),
}
# Polygon: JSON [[x,y],...] per ring -> DataFrame [x,y,x,y,...] per ring
if ann.get("polygon"):
row["polygon"] = [
[coord for pt in ring for coord in pt]
for ring in ann["polygon"]
]
row["polygon_score"] = ann.get("polygon_score")
# Mask: JSON base64 PNG -> DataFrame Binary (PNG bytes)
if ann.get("mask") and isinstance(ann["mask"], str):
row["mask"] = base64.b64decode(ann["mask"]) # PNG bytes
row["mask_score"] = ann.get("mask_score")
# Box2D: convert based on format metadata
if ann.get("box2d"):
b = ann["box2d"]
if box2d_format == "ltwh":
row["box2d"] = [b["x"] + b["w"]/2, b["y"] + b["h"]/2, b["w"], b["h"]]
elif box2d_format == "cxcywh":
row["box2d"] = [b["cx"], b["cy"], b["w"], b["h"]]
row["box2d_score"] = ann.get("box2d_score")
# Box3D: x,y,z are center coordinates (not corner)
if ann.get("box3d"):
b3 = ann["box3d"]
row["box3d"] = [b3["x"], b3["y"], b3["z"], b3["w"], b3["h"], b3["l"]]
row["box3d_score"] = ann.get("box3d_score")
# Annotation metadata (COCO/LVIS extensions)
if "iscrowd" in ann:
row["iscrowd"] = bool(ann["iscrowd"]) # ensure Boolean (handles legacy 0/1)
if "category_frequency" in ann:
row["category_frequency"] = ann["category_frequency"]
# Sample-level LVIS fields (repeated per annotation row)
if "neg_label_indices" in sample:
row["neg_label_indices"] = sample["neg_label_indices"]
if "not_exhaustive_label_indices" in sample:
row["not_exhaustive_label_indices"] = sample["not_exhaustive_label_indices"]
row["size"] = size
rows.append(row)
df = pl.DataFrame(rows)
df.write_ipc("annotations.arrow") # Arrow IPC
# df.write_parquet("annotations.parquet") # or Parquet
Key Conversions Summary
| # | Conversion | Direction |
|---|---|---|
| 1 | Unnest: one row per annotation | JSON to DataFrame |
| 2 | Column names: label_name to label, group_name to group |
JSON to DataFrame |
| 3 | Polygon: [[x,y],...] point pairs to [x,y,x,y,...] interleaved |
JSON to DataFrame |
| 4 | Mask: base64 PNG string → Binary (PNG bytes) |
JSON to DataFrame |
| 5 | Box2D: check box2d_format — convert ltwh to cxcywh if needed |
JSON to DataFrame |
| 6 | Box3D: {x,y,z,w,h,l} to [cx,cy,cz,w,h,l] |
JSON to DataFrame |
| 7 | GPS: {latitude, longitude} to [lat, lon] |
JSON to DataFrame |
| 8 | IMU: {yaw, pitch, roll} to [yaw, pitch, roll] |
JSON to DataFrame |
| 9 | Score columns: omit entirely for ground truth files | Both |
| 10 | neg_label_indices / not_exhaustive_label_indices: sample-level, repeated per annotation row |
JSON to DataFrame |
| 11 | label_index: preserved as-is (source-faithful, non-contiguous) |
Both |
| 12 | mask_interpretation: file-level metadata ("binary", "confidence", "sigmoid", "logits") — set on the Arrow schema, not per-row |
Both |
| 13 | category_metadata: file-level metadata — JSON-encoded string of per-label synset/synonyms/definition. Extract from LVIS categories array when importing; attach to Arrow schema metadata when writing. |
Both |
Use the EdgeFirst Client SDK
The SDK handles all conversions automatically, including version detection and backward compatibility. Direct conversion code is shown here for reference and for users who need custom pipelines.