Deploying Models on the Target
This section will provide a demo for deploying a quantized ONNX and TFLite model from Ultralytics using the NPU in the i.MX 8M Plus EVK.
Deploying Quantized ONNX
In this example, we have taken the pretrained quantized ONNX model from the ONNX model zoo. In particular the SSD-MobilenetV1-12-int8 was downloaded.
ONNX Deployment
When deploying ONNX models on target, it is recommended to quantize the ONNX to deploy on the NPU using these providers ['NnapiExecutionProvider', 'VsiNpuExecutionProvider']
or convert it to FP16 to deploy on the GPU using this provider ["CUDAExecutionProvider"]
.
Download our Python Script for running the example.
Lastly, you can try this sample image 000000000064.jpg taken from COCO128.

Once the files have been downloaded, SCP the files into the embedded platform.
Run the script with the command python3 run-onnx.py
. The script should print the inference time in milliseconds and the model detections as follows.
2025-08-21 00:02:49.820540730 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 19 Memcpy nodes are added to the graph tf2onnx for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2025-08-21 00:02:50.229739399 [W:onnxruntime:, transformer_memcpy.cc:83 ApplyImpl] 5 Memcpy nodes are added to the graph tf2onnx__44 for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
Time: 89 ms
Found objects:
85 clock 0.9623478 [0.04242745 0.23638025 0.32066926 0.566869 ]
3 car 0.6170904 [0.6252131 0.11427537 0.8362978 0.40256432]
3 car 0.35680234 [0.60697377 0.10544738 0.8504381 0.52560043]
3 car 0.34608585 [0.6352387 0.10221773 0.83975834 0.29047823]
A new image should be saved img_vis.jpg
showing the model output visualizations.

The following breakdown of the script describing the steps of the model inference is provided below.
-
Load the model specifying the execution provider using the device's inference engine.
model_path = "ssd_mobilenet_v1_12-int8.onnx" session = ort.InferenceSession(model_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
Execution Providers
The device NPU can be specified with
['NnapiExecutionProvider', 'VsiNpuExecutionProvider']
. The NPU exeution providers are not seen in later BSPs such as 6.12. However, these providers can be seen in lower BSPs like 5.15.The device GPU can be specified with
['CUDAExecutionProvider']
. -
Preprocess input image by resizing to the input shape of the model and type-casting the values to the input data type requirements of the model.
image_path = "000000000064.jpg" input_name = session.get_inputs()[0].name input_shape = session.get_inputs()[0].shape dtype = session.get_inputs()[0].type input_dtype = np.uint8 if "uint8" in dtype else np.float32 height, width = 640, 640 image = Image.open(image_path) img = np.array(image.resize((width, height))) img = np.expand_dims(img, axis=0).astype(input_dtype)
-
Run model inference by calling the run() function.
outputs = session.run(None, {input_name: img})
-
Query model outputs and filter by score.
boxes, classes, scores, _ = outputs boxes = boxes.squeeze() classes = classes.squeeze() scores = scores.squeeze() mask = scores >= score_threshold boxes = boxes[mask] classes = classes[mask] scores = scores[mask]
These outputs can then be taken and visualized as shown in the Python script.
Deploying Quantized TFLite
In this example, we have taken the small PyTorch segmentation model from ultralytics and exported the model to a quantized TFLite using this command yolo export model=yolo11s-seg.pt format=tflite int8=True
. You can find more information on exporting models in Ultralytics or you can follow these steps for quantizing ONNX to TFLite. Once the TFLite is ready, we can deploy it on target as shown below.
Download our Python Script for running the example.
Next, you can also download this sample TFLite model for running the demo. Otherwise, update the path to your own YOLO segmentation TFLite model in the python script.
Lastly, you can try this sample image 000000000064.jpg taken from COCO128.

Once the files have been downloaded, SCP the files into the embedded platform.
Run the script with the command python3 run-tflite.py
. The script should print the inference time in milliseconds and the model detections as follows.
$ python3 run-tflite.py
INFO: Vx delegate: allowed_cache_mode set to 0.
INFO: Vx delegate: device num set to 0.
INFO: Vx delegate: allowed_builtin_code set to 0.
INFO: Vx delegate: error_during_init set to 0.
INFO: Vx delegate: error_during_prepare set to 0.
INFO: Vx delegate: error_during_invoke set to 0.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
warning at CreateOutputsTensor, #90
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 56: default layout inference pass.
W [HandleLayoutInfer:332]Op 56: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
W [HandleLayoutInfer:332]Op 162: default layout inference pass.
Time: 203 ms
Found objects:
74 label 0.66109437 [0.22954665 0.06427306 0.5417301 0.30300158]
2 label 0.33054718 [0.11018239 0.550912 0.25709224 0.58763945]
2 label 0.25709227 [0.1101824 0.615185 0.47745705 0.8171861 ]
A new image should be saved img_vis.jpg
showing the model output visualizations.

The following breakdown of the script describing the steps of the model inference is provided below.
-
Load the model specifying the external delegate to use the device's NPU.
model_path = "yolo11s-seg.tflite" delegate = "/usr/lib/libvx_delegate.so" ext_delegate = load_delegate(delegate, {}) ip = Interpreter(model_path=model_path, experimental_delegates=[ext_delegate])
OpenVX Delegate
The OpenVX delegate is specified with
experimental_delegates=[ext_delegate]
. To use the CPU, remove this specification. -
Allocate tensors to allocate memory and sets up input/output tensor bindings.
ip.allocate_tensors()
-
Call invoke() once at the start as a model warmup since the first call may take up to 9 seconds to run.
ip.invoke()
-
Preprocess input image by resizing to the input shape of the model and type-casting the values to the input data type requirements of the model.
image_path = "000000000064.jpg" input_det = ip.get_input_details()[0] _, height, width, _ = input_det.get("shape") image = Image.open(image_path) size = (image.height, image.width) img = np.array(image.resize((width, height))) # is TFLite quantized int8 model int8 = input_det["dtype"] == np.int8 # is TFLite quantized uint8 model uint8 = input_det["dtype"] == np.uint8 if int8 or uint8: img = img.astype(np.uint8) if uint8 else img.astype(np.int8) else: img = img.astype(np.float32) img = np.array([img])
-
Query inputs from the model's input details and set the input tensor.
inp_id = ip.get_input_details()[0]["index"] ip.set_tensor(inp_id, img)
-
Run model inference by calling the invoke() function.
ip.invoke()
-
Query and dequantize the model outputs.
# Decode Boxes box_id, mask_id = None, None outputs = [] for i, out in enumerate(out_det): x = ip.get_tensor(out["index"]) if (int8 or uint8) and x.dtype != np.float32: scale, zero_point = out["quantization"] x = (x.astype(np.float32) - zero_point) * scale # re-scale outputs.append(x) if len(out.get("shape")) > 3: mask_id = i else: box_id = i
-
Decode bounding box outputs and apply NMS.
score_threshold = 0.25 iou_threshold = 0.70 boxes, scores, classes, masks = decode_boxes(outputs[box_id], nc) # Filter By Score scores_masks = scores > score_threshold boxes = boxes[scores_masks] classes = classes[scores_masks] scores = scores[scores_masks] masks = masks[scores_masks] # Filter By NMS keep = numpy_nms(boxes, scores, thresh=iou_threshold) boxes = boxes[keep] classes = classes[keep] scores = scores[keep] masks = masks[keep]
-
Decode mask outputs.
masks = decode_masks(masks, np.array(outputs[mask_id], dtype=np.float32)) masks = resize_mask(masks, size) masks = crop_mask(masks, boxes)
The decoded outputs can then be taken and visualized as shown above.
Output Decoding
For the decoding steps 8-9, see the python script provided above to see the functions decode_boxes
and decode_masks
.
Next Steps
You can find more examples for deploying models in various platforms here.