Dataset Organization
EdgeFirst datasets support three organizational patterns based on your data type. This page explains the differences and shows you how to structure your files correctly.
How Datasets Are Created
There are several ways to get data into EdgeFirst Studio:
1. From EdgeFirst Platforms (MCAP Recordings)
The primary workflow for EdgeFirst Perception users:
flowchart LR
subgraph Device["πΉ Recording"]
A["MCAP File"]
end
subgraph Studio["βοΈ EdgeFirst Studio"]
B["Snapshot<br>(ZIP + Arrow)"]
C["Dataset"]
end
subgraph Local["πΎ Local"]
D["Download<br>(ZIP + Arrow)"]
end
A -->|"Upload"| B
B -->|"Restore"| C
C -->|"Create Snapshot"| B
B -->|"Download"| D
D -->|"Import"| B
- MCAP files are recorded on devices and uploaded to Studio
- Snapshots are the portable formatβa ZIP file (sensor data) paired with an Arrow file (annotations)
- Datasets are expanded snapshots that you can browse, annotate, and train on
- When you create a snapshot from a dataset, Studio generates the ZIP+Arrow pair for download and sharing
2. From Pre-Annotated Datasets
If you have existing annotated datasets (e.g., from COCO, custom collections, or other tools), you can convert them directly to the EdgeFirst Dataset Format:
flowchart LR
A["π Existing Dataset<br>(COCO, custom, etc.)"] -->|"Convert"| B["π¦ ZIP + Arrow"]
B -->|"Import"| C["βοΈ Snapshot"]
C -->|"Restore"| D["ποΈ Dataset"]
See Format Conversion for details on converting existing datasets.
3. Simple Videos and Images
For quick experimentation, you can also upload videos or images directly to Studio for auto-annotationβno format conversion required. This is covered in the Capture Data tutorial.
ZIP + Arrow = EdgeFirst Dataset Format
Whether you're downloading a snapshot or sharing a dataset, the format is always the same:
- ZIP file: Contains sensor data organized by sequence/frame
- Arrow file: Contains annotations in columnar format
See Format Overview for details.
The Three Patterns
graph TB
subgraph SeqBased["Sequence-Based"]
S["Multiple sequences<br/>with temporal frames"]
end
subgraph ImgBased["Image-Based"]
I["Independent images<br/>no specific order"]
end
subgraph MixedBased["Mixed"]
M["Sequences + standalone<br/>images together"]
end
style SeqBased fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
style ImgBased fill:#ffccbc,stroke:#d84315,stroke-width:2px
style MixedBased fill:#e1bee7,stroke:#6a1b9a,stroke-width:2px
When to Use Each
| Pattern | When | Example |
|---|---|---|
| Sequence-Based | You have video recordings or MCAP files | MCAP from EdgeFirst Platform, MP4 videos |
| Image-Based | You have individual images with no order | COCO dataset, photos from mobile device |
| Mixed | You have both sequences and loose images | MCAP recordings + calibration images |
1. Sequence-Based Datasets
Use this pattern when your data comes from video recordings (MCAP files, MP4, etc.) where frames have temporal order.
Directory Structure
my_video_dataset/
βββ my_video_dataset.arrow # Annotations
βββ my_video_dataset/ # Sensor container
βββ hostname_date_time_001/ # First sequence
β βββ hostname_date_time_001_001.camera.jpeg
β βββ hostname_date_time_001_002.camera.jpeg
β βββ hostname_date_time_001_003.camera.jpeg
β βββ hostname_date_time_001_001.radar.pcd
β βββ hostname_date_time_001_001.lidar.pcd
β
βββ hostname_date_time_002/ # Second sequence
β βββ hostname_date_time_002_001.camera.jpeg
β βββ hostname_date_time_002_002.camera.jpeg
β βββ ...
β
βββ hostname_date_time_003/
βββ ...
File Naming Convention
{sequence_name}_{frame_number}.{sensor}.{extension}
Where:
sequence_name: Usuallyhostname_date_time(from MCAP filename)frame_number: Sequential frame index (001, 002, 003, ...) padded with zerossensor: Type of sensor (camera,radar,lidar,depth)extension: Image format (jpeg,png,pcd)
Examples
9331381uhd_2025_01_15_143022_001.camera.jpeg
9331381uhd_2025_01_15_143022_002.camera.jpeg
9331381uhd_2025_01_15_143022_001.radar.pcd
9331381uhd_2025_01_15_143022_001.lidar.pcd
Important Notes
- Frame numbers don't need to be continuous β MCAP files can be cropped or downsampled
- All frames must have matching frame numbers across sensors β if you have
frame_001.camera.jpeg, you should haveframe_001.radar.pcdandframe_001.lidar.pcd - Sequential ordering is preserved β frame 001 comes before frame 002 in the dataset
2. Image-Based Datasets
Use this pattern when you have standalone images without temporal ordering, like datasets downloaded from COCO or photos taken with a mobile device.
Directory Structure
my_image_dataset/
βββ my_image_dataset.arrow # Annotations
βββ my_image_dataset/ # Sensor container
βββ image_001.jpg
βββ image_002.jpg
βββ image_003.png
βββ street_scene_24.jpg
βββ parking_lot_156.jpg
βββ ...
File Naming Convention
Any descriptive filename works:
{descriptive_name}.{extension}
Examples:
person_001.jpg
dog_standing.png
traffic_scene_morning.jpg
beach_sunset.jpg
Key Characteristics
- No
sequence_prefix required - No frame numbers
- Files can be in any order (annotations will have
frame: null) - Can mix different image sources in same dataset
3. Mixed Datasets
Use this pattern when you have both sequences and standalone images in the same dataset.
Directory Structure
my_mixed_dataset/
βββ my_mixed_dataset.arrow # Annotations
βββ my_mixed_dataset/ # Sensor container
β
βββ video_sequence_001/ # Video sequences
β βββ video_sequence_001_001.camera.jpeg
β βββ video_sequence_001_002.camera.jpeg
β βββ video_sequence_001_001.radar.pcd
β
βββ video_sequence_002/
β βββ ...
β
βββ calibration_image_001.jpg # Standalone images
βββ reference_scene.png
βββ test_pattern.jpg
βββ ...
Organization Strategy
- Sequences: In subdirectories (same as sequence-based pattern)
- Images: Directly in dataset root (same as image-based pattern)
- Mixed annotations: Arrow file has
frame: {number}for sequences,frame: nullfor images
Example Use Cases
- Training set includes video sequences + manually curated reference images
- Calibration images stored alongside operational video data
- Augmented dataset combining MCAP recordings + external image sources
Understanding the Arrow File Location
The Arrow file always lives at the dataset root level:
my_dataset/
βββ my_dataset.arrow # β Always here
βββ my_dataset/
βββ ... sensor data ...
This centralized location makes it easy to find and load annotations for any dataset structure (sequence, image, or mixed).
File Container Options
Sensor data can be stored in two ways:
1. Directory (Recommended for Development)
my_dataset/
βββ my_dataset.arrow
βββ my_dataset/ # Regular directory
βββ sequence_001/
β βββ ...
βββ image_001.jpg
Pros: Easy to add/remove files, no extraction needed
Cons: Harder to transfer, manage permissions
2. ZIP File (Recommended for Distribution)
my_dataset.zip # Contains everything inside
βββ my_dataset/
β βββ sequence_001/
β β βββ ...
β βββ image_001.jpg
βββ my_dataset.arrow
Pros: Single file, easy to transfer, automatic compression
Cons: Need to extract to access files
Both formats work identicallyβEdgeFirst automatically handles both.
Data Flow Example
Here's how data flows from MCAP to your final dataset:
graph TD
MCAP["π¬ MCAP File - hostname_date_time.mcap"]
MCAP -->|"Convert & Extract Frames"| Frames["πΉ Sequence Folder - hostname_date_time/"]
Frames -->|"Create Arrow annotations"| Arrow["π Arrow File - dataset.arrow"]
Frames --> FinalDir["π Final Dataset"]
Arrow --> FinalDir
FinalDir -->|"Optional:"| ZIP["π¦ ZIP File - for sharing"]
style MCAP fill:#fff9c4,stroke:#f57f17,stroke-width:2px
style Frames fill:#bbdefb,stroke:#1976d2,stroke-width:2px
style Arrow fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
style FinalDir fill:#e1f5ff,stroke:#0277bd,stroke-width:3px
style ZIP fill:#f8bbd0,stroke:#c2185b,stroke-width:2px
Naming Best Practices
For Sequences
# Good: Follows MCAP convention
hostname_date_time_001
system_2025_01_15_143022
# Avoid: Ambiguous
sequence_1
video_1
data
For Standalone Images
# Good: Descriptive
person_walking_001
street_intersection_morning_02
reference_calibration
# Okay: Generic but clear
image_001
photo_156
# Avoid: Unclear
img1
pic
data_new
Checking Your Dataset Structure
You can verify your dataset is organized correctly:
import polars as pl
import os
# Load annotations
df = pl.read_ipc("path/to/dataset.arrow")
# Check structure
print(f"Total annotations: {len(df)}")
print(f"Unique samples: {df['name'].n_unique()}")
print(f"Sequences: {df.filter(pl.col('frame').is_not_null())['name'].n_unique()}")
print(f"Images: {df.filter(pl.col('frame').is_null())['name'].n_unique()}")
print(f"Splits: {df['group'].unique().to_list()}")
# List all sensor files
sensor_dir = "path/to/dataset/dataset"
for root, dirs, files in os.walk(sensor_dir):
for file in files:
if file.endswith(('.jpeg', '.jpg', '.png', '.pcd')):
print(f" {file}")
Further Reading
- Annotation Schema β Understand what data is in your Arrow file
- Bounding Box Formats β Learn coordinate systems and conversions
- Sensor Data β Details on camera, radar, and LiDAR formats
- Snapshots Dashboard β Download and restore snapshots in Studio
- Publishing Workflows β Upload MCAP recordings as snapshots