Skip to content

Automatic Ground Truth Generation (AGTG)

This page is dedicated to describing the Automatic Ground Truth Generation (AGTG) that's available in EdgeFirst Studio. The context, navigation, and the features of the AGTG will be described here. For instructions to run the AGTG workflow, please see this tutorial.

The AGTG allows datasets to have annotations populated on a dataset with minimal human interaction. The AGTG pipeline is based on driving SAM-2 to auto-segment objects throughout the frames. The segmentation masks generated by SAM are then processed to formulate 2D and 3D (from LiDAR or Radar PCDs) bounding boxes to complete the object's annotations as described in the EdgeFirst Dataset Format. There are two modes of operation.

  1. Fully Automatic: This is invoked at the time of importing the dataset as a background process which deploys a detection model to drive SAM-2.
  2. Semi-Automatic: This is invoked when users trigger the AI assisted annotations in the dataset gallery. Users can select portions of the dataset to auto-annotate, but SAM-2 requires initial annotations from the user as prompts.

A complete annotation set will have 2D and 3D annotations as shown below which is a sample snapshot from the AGTG process. The 2D annotations are shown on the left which are pixel-based bounding boxes and masks on the image. The 3D annotations are shown on the right which are world-based 3D boxes surrounding the object in meters.

Sample Annotations
Sample Annotations

Fully Automatic Ground Truth Generation

This functionality is available at the time of restoring a snapshot. To invoke the restore feature, import/create a snapshot and then enable AGTG while restoring the snapshot. All the details for this type of AGTG is available under the Snapshot Dashboard. The tutorial for this workflow is found under Dataset Annotations.

Semi-Automatic Ground Truth Generation

This functionality is available after importing the dataset into EdgeFirst Studio. This type of AGTG requires user annotations in the starting frame to give SAM-2 context as to which objects to annotate throughout the rest of the frames. This section will describe this type of AGTG. However, the tutorial for this workflow is found under Dataset Annotations.

This AGTG feature can be found in the dataset gallery. The dataset gallery can contain sequences or images which are distinguished by the presence of the sequence icon Sequence Icon on the image card as shown below.

Dataset Sequence
Dataset Sequence

The SAM-2 propagation step is only available for sequences. However, images can still be annotated using SAM-2, but through individual annotations which requires more effort over sequences as shown in the Manual Annotations.

There are three steps involved in this process: Initialize AGTG Server, Annotate Starting Frame, Propagate. This AGTG feature can be invoked by clicking on the "AI Segment Tool" button inside the dataset gallery as shown below.

Select the AI Segment Tool
Select the AI Segment Tool

Clicking on this feature will prompt you to start an AGTG server. This is a cloud based server that hosts SAM-2 and the AGTG backend. As indicated, this server will take some time (~3 minutes) to initialize and once initialized, 15 minutes of inactivity will auto-terminate the server. This is a safety mechanism to prevent extreme usage of the credits available in your account. As a safety precaution, ensure that all unused servers are terminated to prevent any unnecessary server costs.

Launch AGTG Server
Launch AGTG Server

Once the AGTG server has been initialized, you can now proceed to the next step which is to annotate the starting frame in the sequence. This will bring the extended sidebar which contains the AGTG features to control the annotation process. This step is necessary in order to give SAM-2 context of the objects to track in the current frame.

Below is a detailed breakdown of the sidebar.

AGTG Sibebar
AGTG Sidebar
AGTG Object Card
AGTG Object Card

An object is a single annotation or a single instance of an object in the image. The number of object cards should equate to the number of objects in the image. For each object, annotate by either drawing a bounding box (click and drag) around the object or markers (mouse clicks) to specify the region that contains the object. To draw a bounding box, click anywhere on the frame and then drag the mouse to expand the bounding box. The bounding box should cover the object to annotate in the frame.

Markers Boxes
Markers Boxes

Multiple Objects

For adding subsequent objects you need to press the "+" button besides the "Select Objects". Also the object class (label) should be selected from the object label drop down.

The figure below shows multiple instances of coffee cup annotated using the AI assisted annotations as described above.

AGTG Initial Prompts
AGTG Initial Prompts

Once the current frame has been annotated, you can move forward to the last step which is to propagate. In order to propagate (track) the selected objects in the current frame to subsequent frames, select the ending frames. Please note that the starting frame is fixed to the current frame. Click "Reverse Propagate" if you require to track objects from current frame to previous frames.

Click the PROPAGATE button for SAM-2 to start tracking and annotating the objects throughout the frames. During propagation, the frame and the counter will update as shown below. Optionally, you can stop the propagation by clicking the "Stop Propagation" button.

Propagation Process
Propagation Process

Once the propagation completes, click on "SAVE ANNOTATIONS" to save the annotations. The "SAVE ANNOTATIONS" button will save the edited, deleted, or created annotations for this image. Otherwise, moving to any other image or going to another page will discard the changes. A completed propagation will show the 2D annotations with masks and 2D bounding boxes for each object across the video frames. If LiDAR or Radar readings are available in the dataset, the 3D annotations will also be generated.

Tip

For cases where the object exits and then re-enters the frame, the object might not be tracked properly. Repeat the steps as necessary to annotate the objects that were missed.

The AGTG Algorithm

The AGTG algorithm is designed to generate the ground truth annotations of a sequential dataset. The annotations are categorized as image-based 2D annotations and spatial-based 3D annotations. Image-based annotations are 2D bounding boxes and segmentation masks marking the objects in an image in pixel coordinates. Spatial-based annotations are 3D bounding boxes around objects in world coordinates such as meters. More information regarding these annotation types are found here.

Image-based annotations are generated using Vision models. A YOLOx ONNX model is used to generate 2D bounding boxes. The large SAM-2 PyTorch model is used to generate 2D segmentation masks by taking advantage of the model’s object tracking and propagation capabilities.

Spatial-based annotations are generated using sensor information such as the LiDAR and Radar and model predictions such as the 2D annotations (bounding boxes and masks) and depth estimations from a large ONNX depth model from Metrics3D. These annotations will be generated for recordings captured using the Raivin with either a Radar or a LiDAR module attached.

The AGTG Algorithm can be visualized using the following flow chart.

%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%
flowchart TB
    %% Definitions
    lidar{Has LiDAR PCDs?}
    yolox[YOLOx COCO Detections]
    propagation_type[Set SAM-2 Propagation Direction]
    propagation[SAM-2 Propagation]
    2D[SAM-2 Masks to 2D Bounding Boxes]

    filter[Mask and LiDAR PCD Filter]
    lidar_cluster[LiDAR PCD DBSCAN Clustering]
    specify_cluster[PCD Cluster Specification]
    3D[3D Box Formulation]

    depth_map[Depth Map Localization]
    projection[2D Annotations to 3D Projections]
    radar_cluster[Radar PCD DBSCAN Clustering ]

    %% Flowchart
    yolox --> propagation_type--> propagation --> 2D --> lidar
    lidar -- Yes --> filter --> lidar_cluster --> specify_cluster --> 3D
    lidar -- No --> radar_cluster --> projection --> depth_map --> 3D

The logic shown starts with object detection using the YOLOx model. These detections are used to drive the SAM-2 propagation which relies on the input frames and bounding box prompts around the objects in these frames. The segmentation masks are then converted into an array of polygons and a 2D bounding box is formulated for each mask completing the set of 2D annotations.

The spatial-based annotations are formulated based on one of the following sensors and 2D annotation combinations.

  1. Radar Point Clouds + Depth map Estimations + Segmentation Masks
  2. LiDAR Points Clouds + Segmentation Masks

For Raivin recordings without the LiDAR PCDs, the first combination is applied to formulate 3D bounding box annotations using Radar PCDs instead. This process applies a DBSCAN clustering algorithm to the Radar PCDs to cluster groups of points belonging to a single object. Next the process intends to find the Radar PCD cluster that corresponds to the object that is segmented in the image. The logic here is to project the 2D bounding box into world coordinates and to take the depth estimations enclosed by the segmentation mask to find the nearest distance between the (x, y) coordinates and the cluster’s centroid. The cluster with the smallest distance from the estimated coordinates is the PCD cluster that represents the object. The 3D bounding box is formulated by taking the x, y, z center coordinates (centroid) of the cluster and the depth, width, height of the bounding box are taken from both the 2D image projections and the dimensions of the cluster.

For Raivin recordings with the LiDAR module, the second combination is applied to formulate the 3D bounding box annotations. Since the LiDAR PCDs have higher resolution over Radar, there is no clustering needed to be applied to the PCDs at the start. The process starts by filtering the LiDAR PCDs that intersects only with the segmentation mask. However, the PCDs that intersect with the mask are not always guaranteed to belong to the object, a DBSCAN clustering algorithm is applied to the set of filtered PCDs. The cluster with the highest number of points is taken as the cluster that represents the object segmented in the image. Finally, the 3D bounding box is formulated by taking the x, y, z center coordinates (centroid) of the cluster and the depth, width, height dimensions of the cluster.

Next Steps

Now that you have been introduced to the auto-annotation features in EdgeFirst Studio, proceed to the Datasets section to learn more about managing your own datasets from following the capture and annotation workflows.