Skip to content

Storage Formats

EdgeFirst 2026.04 supports three annotation storage formats. All three express the same logical schema — the choice is a deployment / transfer concern, not a data model concern.

Format Tiers

Arrow IPC Parquet JSON
Extension .arrow .parquet .json
Use Case Local ML training, fast random access Transfer, cloud storage, interop Human-readable, API exchange
Structure Flat columnar (one row per annotation) Flat columnar (one row per annotation) Nested (sample with annotations array)
Compression None (memory-mapped) ZSTD columnar compression None (text)
File Size Medium Smallest Largest
Read Performance Fastest (zero-copy) Fast (decompression overhead) Moderate (parse overhead)
Readability Requires viewer / library Requires viewer / library Human-readable text
Metadata Schema-level key-value pairs File-level key-value pairs Top-level JSON fields
Ecosystem Polars, PyArrow, Arrow-rs DuckDB, Spark, pandas, BigQuery Any JSON parser
Best For Training pipelines, analysis Distribution, archival, cloud queries Editing, API, documentation

Format Relationship

graph LR
    subgraph Formats["Dataset Formats"]
        direction TB
        Studio["EdgeFirst Studio<br/>(JSON-RPC API)"]
        Client["EdgeFirst Client<br/>(Python/Rust SDK)"]
        JSON["JSON Format<br/>(Nested Structure)"]
        Arrow["Arrow IPC<br/>(Local Performance)"]
        Parquet["Parquet<br/>(Transfer/Interop)"]
    end

    Studio -->|"JSON-RPC"| Client
    Client -->|"Export"| JSON
    Client -->|"Export"| Arrow
    Client -->|"Export"| Parquet

    JSON <-->|"Unnest/Nest"| Arrow
    Arrow <-->|"Same Schema"| Parquet

    style Studio fill:#bbdefb,stroke:#1976d2,stroke-width:2px
    style Client fill:#c5e1a5,stroke:#689f38,stroke-width:2px
    style JSON fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    style Arrow fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
    style Parquet fill:#e1bee7,stroke:#7b1fa2,stroke-width:2px
    style Formats fill:#f5f5f5,stroke:#616161,stroke-width:3px

Arrow IPC

Technology: Apache Arrow IPC with Polars interface

Extension: .arrow

Arrow IPC files use zero-copy memory mapping for maximum local read performance. This is the default output format for the EdgeFirst Client SDK.

import polars as pl

df = pl.read_ipc("dataset.arrow")

Characteristics:

  • Zero-copy memory mapping — fastest local performance
  • Efficient querying and filtering via Polars
  • Multi-language support (Python, Rust, JavaScript)
  • Schema-level metadata preserves schema_version, box format descriptors, etc.

When to use: Local ML training, data analysis, batch processing, any pipeline where read speed is critical.

Parquet

Technology: Apache Parquet columnar format

Extension: .parquet

Parquet provides ZSTD-compressed columnar storage optimized for transfer, cloud storage, and interoperability with the broader data ecosystem.

import polars as pl

df = pl.read_parquet("dataset.parquet")

# Or use DuckDB for SQL queries
import duckdb
result = duckdb.sql("SELECT label, count(*) FROM 'dataset.parquet' GROUP BY label")

Characteristics:

  • ZSTD compression for smaller file sizes
  • Column statistics enable predicate pushdown (DuckDB, Spark)
  • Widely supported across data tooling (pandas, BigQuery, Spark, DuckDB)
  • File-level metadata preserves schema_version, box format descriptors, etc.

When to use: Distributing datasets, cloud storage, querying with DuckDB/Spark, archival, bandwidth-constrained transfers.

Parquet Configuration

EdgeFirst uses the following Parquet defaults:

Setting Value Rationale
Compression ZSTD (level 3) Best compression/speed trade-off for transfer
Row group size 64K rows or 256 MB Balance between random access and compression
Page encoding Parquet v2 (data page v2) Better compression, widely supported
Statistics Enabled (min/max per column) Enables predicate pushdown in DuckDB/Spark

Categorical column encoding

Categorical columns (label, group) are written as Dictionary-encoded string columns in Parquet. Dictionary index ordering in Parquet is not guaranteed to match label_index values. Always use the explicit label_index column for numeric class indices.

JSON

Structure: Nested format — one object per sample with an annotations array.

Extension: .json

JSON files are human-readable and compatible with the EdgeFirst Studio JSON-RPC API.

Version detection:

  • 2025.10 (legacy): Top-level is a bare JSON array [...]
  • 2026.04: Top-level is an object with schema_version: {"schema_version": "2026.04", ...}
{
  "schema_version": "2026.04",
  "box2d_format": "cxcywh",
  "box2d_normalized": true,
  "mask_interpretation": "binary",
  "samples": [
    {
      "image_name": "deer_001.camera.jpeg",
      "width": 1920,
      "height": 1080,
      "annotations": [
        {
          "label_name": "deer",
          "label_index": 0,
          "box2d": {"cx": 0.691, "cy": 0.368, "w": 0.015, "h": 0.051},
          "box2d_score": 0.97
        }
      ]
    }
  ]
}

When to use: Manual editing, API communication, dataset documentation, distribution to consumers who prefer text formats.

JSON field names differ from Arrow columns

Some fields have different names in JSON vs. Arrow/Parquet. See Conversion for the full mapping table.

Format Selection Decision Tree

graph TD
    Start["Choose a format"] --> Q1{"Primary use?"}
    Q1 -->|"Local training<br/>or analysis"| Arrow["Arrow IPC (.arrow)"]
    Q1 -->|"Transfer, cloud,<br/>or interop"| Parquet["Parquet (.parquet)"]
    Q1 -->|"Human editing<br/>or API"| JSON["JSON (.json)"]

    Arrow --> Note1["Fastest reads<br/>Zero-copy memory map<br/>Polars/PyArrow"]
    Parquet --> Note2["Smallest files<br/>ZSTD compression<br/>DuckDB/Spark/pandas"]
    JSON --> Note3["Human-readable<br/>Studio API compatible<br/>Manual editing"]

    style Arrow fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
    style Parquet fill:#e1bee7,stroke:#7b1fa2,stroke-width:2px
    style JSON fill:#fff9c4,stroke:#f57f17,stroke-width:2px