Meet YOLO26: next-gen vision AI.

Qualcomm QNN Export for Ultralytics YOLO Models

Deploying computer vision models on Qualcomm Snapdragon devices requires a model format tuned for the Qualcomm AI Engine Direct (QNN) runtime. Exporting Ultralytics YOLO models to the QNN format lets you run accelerated, on-device inference across Snapdragon CPU, Adreno GPU, and Hexagon NPU hardware found in billions of mobile phones, laptops, automotive systems, and IoT devices. This guide walks through how to export YOLO to Qualcomm QNN and deploy it for fast, low-power inference on Snapdragon hardware.

What is Qualcomm QNN?

Qualcomm QNN on-device inference

Qualcomm AI Engine Direct — commonly referred to as QNN and distributed as part of the Qualcomm AI Runtime (QAIRT) SDK — is Qualcomm's low-level inference stack for Snapdragon processors. It provides a unified API with backend-specific libraries that target the Snapdragon CPU, the Adreno GPU, and the Hexagon Tensor Processor (HTP), the dedicated neural network processing unit (NPU) inside modern Snapdragon SoCs. QNN gives developers full-stack access to these Snapdragon AI accelerators and is the modern successor to the older Snapdragon Neural Processing Engine (SNPE) SDK. It powers on-device AI across the Snapdragon 8 Gen 2, 8 Gen 3, and 8 Elite mobile platforms, Snapdragon X laptops, and automotive and XR products.

Why Export to Qualcomm QNN?

Snapdragon is the most widely deployed mobile compute platform in the world. Exporting Ultralytics YOLO to the Qualcomm QNN format unlocks the dedicated AI hardware on these devices:

  • Hexagon NPU acceleration: Running YOLO on the Hexagon Tensor Processor delivers dramatically higher throughput and lower power than CPU inference — ideal for real-time inference and always-on computer vision on Snapdragon.
  • On-device and offline: QNN inference runs entirely on the Snapdragon device, so there are no cloud round-trips, latency stays low, and data never leaves the device.
  • INT8 efficiency: QNN export quantizes YOLO to INT8, the Hexagon NPU's native precision, shrinking model size and maximizing frames per second on battery-powered hardware.
  • One format, many devices: A single Qualcomm QNN export targets Snapdragon CPU, Adreno GPU, and Hexagon NPU across the Snapdragon 8 Gen 2, 8 Gen 3, and 8 Elite families and beyond.
  • Production-ready Qualcomm AI stack: QNN (Qualcomm AI Engine Direct / QAIRT) is Qualcomm's current, actively maintained on-device AI runtime and the recommended replacement for SNPE.

QNN Export Format

Ultralytics compiles YOLO models to QNN locally using the ONNX Runtime QNN Execution Provider (the pip-installable onnxruntime-qnn package, which bundles the QAIRT libraries). The exporter converts your model to ONNX, INT8-quantizes it with calibration data (the Hexagon NPU is an int8 accelerator), then initializes an ONNX Runtime session with context-binary caching enabled — this compiles the quantized graph into a QNN context binary embedded in <model>_qnn.onnx. No Qualcomm account, cloud upload, or separate SDK download is required.

Unlike the cloud-based Qualcomm AI Hub, which compiles and profiles models on Qualcomm-hosted Snapdragon devices and requires a Qualcomm account, the Ultralytics QNN export runs entirely on your own machine with a single export(format="qnn") call. You get the same QNN/QAIRT runtime target — Snapdragon CPU, Adreno GPU, and Hexagon NPU — without sign-up, upload limits, or queue times, and it drops straight into the standard YOLO export workflow.

The exported _qnn_model/ directory bundles the context-binary ONNX and a metadata.yaml describing class names, image size, and task.

Key Features of QNN Models

  • INT8 Quantization: The model is quantized to INT8 with the ONNX Runtime QNN QDQ flow and a calibration dataset, matching the Hexagon NPU's native precision for maximum throughput and minimal size. Learn more about model quantization.
  • Fully Local Compilation: The context binary is generated entirely on your host machine — no Qualcomm account, API token, or cloud upload.
  • Full Snapdragon Acceleration: Run inference on the Hexagon NPU (HTP), Adreno GPU, or CPU through a single unified runtime.
  • Broad Device Reach: Target the wide range of Snapdragon platforms shipping in phones, PCs (Windows on Snapdragon), automotive, XR, and embedded products.
  • Precompiled Context Binary: Shipping a context binary minimizes on-device graph compilation, reducing model load latency on the target.
  • Self-Contained Output: The exported directory includes the context-binary ONNX and metadata for straightforward deployment.

Supported Tasks

QNN export supports the standard task set available in each model family, including YOLO26 semantic segmentation.

Export to QNN: Converting Your YOLO Model

Export an Ultralytics YOLO model to QNN format for deployment on Snapdragon hardware. The context binary is finalized for a target Hexagon Tensor Processor (HTP) architecture, which you select with the name argument — the same argument used to target a chip in RKNN export.

Supported HTP Architectures

Pass the target architecture via name (e.g. name="73"). Valid values:

nameHexagon HTPSnapdragon platform
68v68Snapdragon 865
69v69Snapdragon 888 / 8 Gen 1
73v73Snapdragon 8 Gen 2 (default)
75v75Snapdragon 8 Gen 3
79v79Snapdragon 8 Elite
Platform support

QNN export uses the onnxruntime-qnn package. Stable wheels are published for Windows (x64 and ARM64) and Linux ARM64 (aarch64); a Linux x86-64 wheel is available on the ONNX Runtime nightly feed. There is no macOS wheel — on macOS build ONNX Runtime from source with --use_qnn, or run the export on a supported platform. QNN context-binary generation works on an x64 host (no Snapdragon device required for the export step).

Installation

To install the required packages, run:

Installation
# Install the required package for YOLO
pip install ultralytics

The onnxruntime-qnn package (which provides the ONNX Runtime QNN Execution Provider and bundles the QAIRT libraries) is installed automatically on first export. For detailed instructions and best practices related to the installation process, check our Ultralytics Installation guide. While installing the required packages for YOLO, if you encounter any difficulties, consult our Common Issues guide for solutions and tips.

Usage

The QNN format supports the Export, Predict, and Validate modes. Inference and validation run on Qualcomm Snapdragon hardware through ONNX Runtime's QNN Execution Provider (the same onnxruntime-qnn package used for export). Export your model, then load the exported model on a Snapdragon device to run inference or validate its accuracy.

Export
from ultralytics import YOLO

# Load a YOLO26 model
model = YOLO("yolo26n.pt")

# Export to Qualcomm QNN format (INT8, enforced automatically), targeting an HTP architecture via 'name'
# 'name' can be one of 68, 69, 73, 75, 79 (Snapdragon 865, 888/8 Gen 1, 8 Gen 2, 8 Gen 3, 8 Elite)
model.export(format="qnn", name="73")  # creates 'yolo26n_qnn_model/'
Predict
from ultralytics import YOLO

# Load the exported QNN model (on a Snapdragon device with onnxruntime-qnn)
model = YOLO("yolo26n_qnn_model")

# Run inference
results = model("https://ultralytics.com/images/bus.jpg")
Validate
from ultralytics import YOLO

# Load the exported QNN model (on a Snapdragon device with onnxruntime-qnn)
model = YOLO("yolo26n_qnn_model")

# Validate accuracy on the COCO8 dataset
metrics = model.val(data="coco8.yaml")

Export Arguments

ArgumentTypeDefaultDescription
formatstr'qnn'Target format for the exported model, defining compatibility with the Qualcomm QNN runtime.
imgszint or tuple640Desired image size for the model input. Can be an integer for square images or a tuple (height, width).
batchint1Specifies the export model batch size, which is baked into the generated QNN context binary.
namestr'73'Target Hexagon HTP architecture version: 68, 69, 73, 75, or 79 (Snapdragon 865, 888/8 Gen 1, 8 Gen 2, 8 Gen 3, 8 Elite). The context binary is finalized for this architecture.
int8boolTrueEnables INT8 quantization. Required for QNN HTP export — automatically set to True if not specified.
datastr'coco8.yaml'Dataset configuration file used for INT8 calibration. Specifies the calibration image source.
fractionfloat1.0Fraction of the calibration dataset to use for INT8 quantization.
devicestrNoneSpecifies the device for the ONNX export step: GPU (device=0) or CPU (device=cpu).
Precision

The Hexagon NPU (HTP) is an int8 accelerator, so QNN export quantizes the model to INT8 using the ONNX Runtime QDQ quantization flow with calibration images from data. int8=True is enforced automatically.

For more details about the export process, visit the Ultralytics documentation page on exporting.

Output Structure

After a successful export, a model directory is created with the following layout:

yolo26n_qnn_model/
├── yolo26n_qnn.onnx   # ONNX wrapping the precompiled QNN context binary
└── metadata.yaml      # Model metadata (classes, image size, task, etc.)

The yolo26n_qnn.onnx file embeds the QNN context binary and is loaded by ONNX Runtime with the QNN Execution Provider on the Snapdragon device. The metadata.yaml contains class names, image size, and other information used by the Ultralytics pipeline.

Deploying Exported YOLO QNN Models

QNN models run on Qualcomm Snapdragon hardware, making on-device model deployment straightforward. On a Snapdragon device with onnxruntime-qnn installed, run the exported model directly with the Ultralytics API (yolo predict/yolo val, see Usage above) — Ultralytics loads the context binary through the ONNX Runtime QNN Execution Provider and selects the HTP (NPU), GPU, or CPU backend.

For custom pipelines, you can also load the context-binary ONNX directly with ONNX Runtime. onnxruntime-qnn is a plugin Execution Provider, so register it at runtime:

import onnxruntime as ort
import onnxruntime_qnn as qnn_ep

# On the Snapdragon device, register the QNN plugin EP and select its device(s)
ort.register_execution_provider_library("QNNExecutionProvider", qnn_ep.get_library_path())
devices = [d for d in ort.get_ep_devices() if d.ep_name == "QNNExecutionProvider"]

options = ort.SessionOptions()
options.add_provider_for_devices(devices, {"backend_path": qnn_ep.get_qnn_htp_path()})
session = ort.InferenceSession("yolo26n_qnn_model/yolo26n_qnn.onnx", sess_options=options)
outputs = session.run(None, {"images": input_tensor})  # input_tensor: float32 NCHW

Because the QNN context binary is precompiled, the session loads quickly without recompiling the graph on-device.

  1. Train your model using Ultralytics Train Mode
  2. Export to QNN format using model.export(format="qnn") on a supported platform (Windows or Linux ARM64)
  3. Deploy the exported _qnn_model/ directory to your Snapdragon device
  4. Run inference with ONNX Runtime and the QNN Execution Provider, selecting the HTP, GPU, or CPU backend

Real-World Applications

YOLO models running on Qualcomm Snapdragon hardware are well suited for a wide range of edge AI applications:

  • Smartphones: Real-time object detection and scene understanding in camera and photo apps with NPU acceleration.
  • Windows on Snapdragon: On-device computer vision in Copilot+ PCs without offloading to the cloud.
  • Automotive: Driver monitoring, occupant detection, and ADAS features on Snapdragon Digital Chassis platforms.
  • XR and Wearables: Low-power, low-latency perception for AR/VR headsets and smart glasses.
  • IoT and Robotics: Efficient vision inference on Snapdragon-powered cameras, drones, and embedded systems.

Summary

In this guide, you've learned how to export Ultralytics YOLO models to the Qualcomm QNN format locally with the ONNX Runtime QNN Execution Provider. The export pipeline converts your model to ONNX, then compiles it into a QNN context binary on your host machine — no Qualcomm account or cloud required — producing a _qnn.onnx optimized for Snapdragon CPU, Adreno GPU, and Hexagon NPU hardware via the QNN/QAIRT runtime.

The combination of Ultralytics YOLO and Qualcomm's on-device AI stack provides an effective solution for running advanced computer vision workloads across the broad Snapdragon ecosystem.

For other on-device and mobile deployment targets, see the related ONNX, CoreML, NCNN, TFLite, ExecuTorch, RKNN, Sony IMX500, and TensorRT export guides. To compare formats before shipping, use Benchmark mode. For the full list of formats and options, visit the Export mode documentation and the integrations guide page.

FAQ

How do I export my Ultralytics YOLO model to QNN format?

You can export your model using the export() method in Python or via the CLI with format="qnn". The export first creates an ONNX model, then compiles it locally into a QNN context binary using the ONNX Runtime QNN Execution Provider. The onnxruntime-qnn package is installed automatically on first export.

Example
from ultralytics import YOLO

model = YOLO("yolo26n.pt")
model.export(format="qnn")

Do I need a Qualcomm account or cloud access?

No. QNN export runs entirely on your local machine using the onnxruntime-qnn package, which bundles the QAIRT libraries. No Qualcomm account, API token, or network access is required.

How does Ultralytics QNN export compare to Qualcomm AI Hub?

Qualcomm AI Hub is Qualcomm's cloud service for compiling, profiling, and benchmarking models on hosted Snapdragon devices, and it requires a Qualcomm account. Ultralytics QNN export targets the same QNN/QAIRT runtime (Snapdragon CPU, Adreno GPU, and Hexagon NPU) but compiles the context binary locally with the ONNX Runtime QNN Execution Provider — no account, no upload, and no queue. It is the fastest way to go from a .pt model to a Snapdragon-ready build directly inside the standard YOLO export workflow.

Which platforms can I export on?

onnxruntime-qnn provides stable wheels for Windows (x64 and ARM64) and Linux ARM64 (aarch64), plus a Linux x86-64 wheel on the ONNX Runtime nightly feed. macOS has no wheel — build ONNX Runtime from source with --use_qnn or export on a supported platform. Context-binary generation runs on an x64 host and does not require a physical Snapdragon device.

How do I run YOLO on a Qualcomm Snapdragon NPU?

Export with model.export(format="qnn"), copy the resulting yolo26n_qnn_model directory to your Snapdragon device, and run yolo predict model=yolo26n_qnn_model source=image.jpg (or yolo val). Ultralytics loads the context binary through the ONNX Runtime QNN Execution Provider and runs it on the Hexagon NPU — see Deploying Exported YOLO QNN Models.

What is the difference between QNN and SNPE?

QNN (Qualcomm AI Engine Direct, part of the QAIRT SDK) is Qualcomm's current inference stack and the recommended replacement for the older Snapdragon Neural Processing Engine (SNPE) SDK. New deployments should target QNN.

Can I run a QNN model with yolo predict and yolo val?

Yes, on a Qualcomm Snapdragon device with onnxruntime-qnn installed — YOLO("yolo26n_qnn_model") loads the context binary through the QNN Execution Provider and runs predict/val like any other format. On an x86 host without QNN hardware the model cannot execute, since the context binary targets the Snapdragon NPU.

What is the output of a QNN export?

The export creates a directory (e.g., yolo26n_qnn_model/) containing the context-binary ONNX (yolo26n_qnn.onnx) and a metadata.yaml with class names, image size, and task information.

Contributors

Comments