Storage Formats
EdgeFirst 2026.04 supports three annotation storage formats. All three express the same logical schema — the choice is a deployment / transfer concern, not a data model concern.
Format Tiers
| Arrow IPC | Parquet | JSON | |
|---|---|---|---|
| Extension | .arrow |
.parquet |
.json |
| Use Case | Local ML training, fast random access | Transfer, cloud storage, interop | Human-readable, API exchange |
| Structure | Flat columnar (one row per annotation) | Flat columnar (one row per annotation) | Nested (sample with annotations array) |
| Compression | None (memory-mapped) | ZSTD columnar compression | None (text) |
| File Size | Medium | Smallest | Largest |
| Read Performance | Fastest (zero-copy) | Fast (decompression overhead) | Moderate (parse overhead) |
| Readability | Requires viewer / library | Requires viewer / library | Human-readable text |
| Metadata | Schema-level key-value pairs | File-level key-value pairs | Top-level JSON fields |
| Ecosystem | Polars, PyArrow, Arrow-rs | DuckDB, Spark, pandas, BigQuery | Any JSON parser |
| Best For | Training pipelines, analysis | Distribution, archival, cloud queries | Editing, API, documentation |
Format Relationship
graph LR
subgraph Formats["Dataset Formats"]
direction TB
Studio["EdgeFirst Studio<br/>(JSON-RPC API)"]
Client["EdgeFirst Client<br/>(Python/Rust SDK)"]
JSON["JSON Format<br/>(Nested Structure)"]
Arrow["Arrow IPC<br/>(Local Performance)"]
Parquet["Parquet<br/>(Transfer/Interop)"]
end
Studio -->|"JSON-RPC"| Client
Client -->|"Export"| JSON
Client -->|"Export"| Arrow
Client -->|"Export"| Parquet
JSON <-->|"Unnest/Nest"| Arrow
Arrow <-->|"Same Schema"| Parquet
style Studio fill:#bbdefb,stroke:#1976d2,stroke-width:2px
style Client fill:#c5e1a5,stroke:#689f38,stroke-width:2px
style JSON fill:#fff9c4,stroke:#f57f17,stroke-width:2px
style Arrow fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
style Parquet fill:#e1bee7,stroke:#7b1fa2,stroke-width:2px
style Formats fill:#f5f5f5,stroke:#616161,stroke-width:3px
Arrow IPC
Technology: Apache Arrow IPC with Polars interface
Extension: .arrow
Arrow IPC files use zero-copy memory mapping for maximum local read performance. This is the default output format for the EdgeFirst Client SDK.
import polars as pl
df = pl.read_ipc("dataset.arrow")
Characteristics:
- Zero-copy memory mapping — fastest local performance
- Efficient querying and filtering via Polars
- Multi-language support (Python, Rust, JavaScript)
- Schema-level metadata preserves
schema_version, box format descriptors, etc.
When to use: Local ML training, data analysis, batch processing, any pipeline where read speed is critical.
Parquet
Technology: Apache Parquet columnar format
Extension: .parquet
Parquet provides ZSTD-compressed columnar storage optimized for transfer, cloud storage, and interoperability with the broader data ecosystem.
import polars as pl
df = pl.read_parquet("dataset.parquet")
# Or use DuckDB for SQL queries
import duckdb
result = duckdb.sql("SELECT label, count(*) FROM 'dataset.parquet' GROUP BY label")
Characteristics:
- ZSTD compression for smaller file sizes
- Column statistics enable predicate pushdown (DuckDB, Spark)
- Widely supported across data tooling (pandas, BigQuery, Spark, DuckDB)
- File-level metadata preserves
schema_version, box format descriptors, etc.
When to use: Distributing datasets, cloud storage, querying with DuckDB/Spark, archival, bandwidth-constrained transfers.
Parquet Configuration
EdgeFirst uses the following Parquet defaults:
| Setting | Value | Rationale |
|---|---|---|
| Compression | ZSTD (level 3) | Best compression/speed trade-off for transfer |
| Row group size | 64K rows or 256 MB | Balance between random access and compression |
| Page encoding | Parquet v2 (data page v2) | Better compression, widely supported |
| Statistics | Enabled (min/max per column) | Enables predicate pushdown in DuckDB/Spark |
Categorical column encoding
Categorical columns (label, group) are written as Dictionary-encoded
string columns in Parquet. Dictionary index ordering in Parquet is not guaranteed
to match label_index values. Always use the explicit label_index column for
numeric class indices.
JSON
Structure: Nested format — one object per sample with an annotations array.
Extension: .json
JSON files are human-readable and compatible with the EdgeFirst Studio JSON-RPC API.
Version detection:
- 2025.10 (legacy): Top-level is a bare JSON array
[...] - 2026.04: Top-level is an object with
schema_version:{"schema_version": "2026.04", ...}
{
"schema_version": "2026.04",
"box2d_format": "cxcywh",
"box2d_normalized": true,
"mask_interpretation": "binary",
"samples": [
{
"image_name": "deer_001.camera.jpeg",
"width": 1920,
"height": 1080,
"annotations": [
{
"label_name": "deer",
"label_index": 0,
"box2d": {"cx": 0.691, "cy": 0.368, "w": 0.015, "h": 0.051},
"box2d_score": 0.97
}
]
}
]
}
When to use: Manual editing, API communication, dataset documentation, distribution to consumers who prefer text formats.
JSON field names differ from Arrow columns
Some fields have different names in JSON vs. Arrow/Parquet. See Conversion for the full mapping table.
Format Selection Decision Tree
graph TD
Start["Choose a format"] --> Q1{"Primary use?"}
Q1 -->|"Local training<br/>or analysis"| Arrow["Arrow IPC (.arrow)"]
Q1 -->|"Transfer, cloud,<br/>or interop"| Parquet["Parquet (.parquet)"]
Q1 -->|"Human editing<br/>or API"| JSON["JSON (.json)"]
Arrow --> Note1["Fastest reads<br/>Zero-copy memory map<br/>Polars/PyArrow"]
Parquet --> Note2["Smallest files<br/>ZSTD compression<br/>DuckDB/Spark/pandas"]
JSON --> Note3["Human-readable<br/>Studio API compatible<br/>Manual editing"]
style Arrow fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
style Parquet fill:#e1bee7,stroke:#7b1fa2,stroke-width:2px
style JSON fill:#fff9c4,stroke:#f57f17,stroke-width:2px