Skip to content

Dataset Import

Import annotated or raw data into EdgeFirst Studio using edgefirst-client. Choose the pathway that matches your source format.

flowchart TD
  start[Import data into Studio]
  start --> cocoPath{Source is COCO or LVIS JSON?}
  start --> efPath{Source is EdgeFirst ZIP/Arrow?}
  start --> customPath{Custom source e.g. TFDS?}
  cocoPath -->|Yes| nativeCoco["CLI: import-coco / coco-to-arrow"]
  efPath -->|Yes| efFormat["CLI: upload-dataset / create-snapshot"]
  customPath -->|Yes| pythonApi["Python API: populate_samples workflow"]
  nativeCoco --> studio[EdgeFirst Studio dataset]
  efFormat --> studio
  pythonApi --> studio

For Darknet/YOLO imports through the Studio web UI, see the dataset import tutorial.

Native COCO and LVIS support

edgefirst-client includes built-in COCO interchange commands. LVIS v1 annotations are handled through the same pipeline — coco-to-arrow accepts LVIS JSON (including coco_url-derived filenames) and preserves LVIS-specific columns documented in the format schema.

Command Purpose
import-coco Upload COCO annotations and images directly into Studio
export-coco Export a Studio dataset to COCO JSON or ZIP
coco-to-arrow Convert COCO/LVIS JSON to EdgeFirst Arrow
arrow-to-coco Convert EdgeFirst Arrow to COCO JSON

import-coco

Import an extracted COCO directory or annotation JSON file. ZIP archives are not supported — extract images and annotations first.

# Create a new dataset in a project
edgefirst-client import-coco ./coco --project p-123 --name "COCO 2017"

# Import into an existing dataset and annotation set
edgefirst-client import-coco ./coco --dataset ds-123 --annotation-set as-456

# Bounding boxes only (no segmentation masks)
edgefirst-client import-coco ./coco/annotations/instances_train2017.json \
    --dataset ds-123 --annotation-set as-456 --masks=false

Group assignment

Standard COCO JSON references images by bare filename (e.g. 000000397133.jpg) with no detectable train/val group. To assign splits, convert with coco-to-arrow --group train and upload with upload-dataset instead.

See the CLI reference for --verify, --update, and batch options.

coco-to-arrow and arrow-to-coco

Convert between COCO/LVIS and EdgeFirst Dataset Format without uploading:

# COCO/LVIS JSON to Arrow (preserves category_id and object_id)
edgefirst-client coco-to-arrow instances.json -o dataset.arrow --group train

# Arrow back to COCO JSON
edgefirst-client arrow-to-coco dataset.arrow -o instances.json --groups train,val

LVIS taxonomies with more than 255 categories are supported (Arrow label_index uses U16).

export-coco

Download a Studio dataset as COCO:

edgefirst-client export-coco ds-123 as-456 -o instances.json
edgefirst-client export-coco ds-123 as-456 -o coco.zip --images --groups train,val

When restoring MCAP snapshots with auto-annotation, --autolabel accepts COCO labels.

EdgeFirst Dataset Format

Import data natively in the EdgeFirst Dataset Format (ZIP + Arrow pairs).

upload-dataset

# Images only
edgefirst-client upload-dataset ds-123 --images ./photos/

# Arrow annotations with auto-discovered images
edgefirst-client upload-dataset ds-123 \
    --annotations dataset.arrow \
    --annotation-set-id as-456

Arrow files must conform to the current schema (2026.04). Use edgefirst-client migrate to upgrade legacy 2025.10 Arrow files — see the migration guide.

Snapshots

Upload a local directory or MCAP file as a snapshot, then restore into a project:

edgefirst-client create-snapshot ./sensor_data/
edgefirst-client restore-snapshot p-123 ss-abc --dataset-name "Imported" --monitor

Preparation utilities:

edgefirst-client generate-arrow ./images --output dataset.arrow
edgefirst-client migrate dataset.arrow --output dataset-2026.arrow
edgefirst-client validate-snapshot ./my_dataset

Arrow schema version

generate-arrow produces a 2025.10 Arrow file. Before upload-dataset, migrate to 2026.04 with edgefirst-client migrate (see the migration guide).

See CLI: MCAP snapshot workflow and Studio Snapshots.

Custom imports via Python API

For sources without a native CLI importer (TensorFlow Datasets, Hugging Face, proprietary formats), use the Python API to transform data into Studio samples programmatically.

General workflow:

  1. AuthenticateClient() reuses the CLI token (Tutorial 1)
  2. Create or target a datasetcreate_dataset(project_id, name, description)
  3. Create an annotation setcreate_annotation_set(dataset_id, name, description)
  4. Define labelsadd_label / add_labels with explicit indices if needed (Tutorial 7)
  5. Upload samplespopulate_samples(dataset_id, annotation_set_id, samples, progress=...)

Tutorial 6: Create annotations demonstrates the minimal write path: create a sandbox dataset, build Sample objects with Annotation and Box2d, and call populate_samples.

Typical custom pipeline:

flowchart LR
  source[External source] --> transform[Your converter]
  transform --> samples[List of Sample objects]
  samples --> populate[populate_samples]
  populate --> studio[Studio dataset]

Future examples

Practical tutorials for importing from TensorFlow Datasets, Hugging Face Datasets, and similar sources are planned. Until then, use Tutorial 6 as the reference implementation for the upload API.

See also