Link to this sectionQualcomm QNN Export for Ultralytics YOLO Models#

Q: How do I export my Ultralytics YOLO model to QNN format?

You can export your model using export(format="qnn", imgsz=640) (imgsz=224 for classification) or the equivalent CLI arguments. The export first creates an ONNX model, then compiles it locally into a QNN context binary using the ONNX Runtime QNN Execution Provider. The onnxruntime-qnn package is installed automatically on first export.

Q: Which platforms can I export on?

onnxruntime-qnn provides prebuilt wheels for Windows (x64 and ARM64) and Linux ARM64 (aarch64); on Linux x86-64 build ONNX Runtime from source with --use_qnn (no prebuilt wheel is published, and macOS is not a supported QNN host). Context-binary generation runs on an x64 host — Windows x64 or Linux x86-64 — and does not require a physical Snapdragon device.

Q: How do I run YOLO on a Qualcomm Snapdragon NPU?

Export with model.export(format="qnn", imgsz=640) (imgsz=224 for classification), copy the resulting yolo26n_qnn.onnx file to your Snapdragon device, and run yolo predict model=yolo26n_qnn.onnx source=image.jpg (or yolo val). Ultralytics loads the context binary through the ONNX Runtime QNN Execution Provider and runs it on the Hexagon NPU — see Deploying Exported YOLO QNN Models.

Q: Can I run a QNN model with yolo predict and yolo val?

Yes, on a Qualcomm Snapdragon device with onnxruntime-qnn installed — YOLO("yolo26n_qnn.onnx") loads the context binary through the QNN Execution Provider and runs predict/val like any other format. On an x86 host without QNN hardware the model cannot execute, since the context binary targets the Snapdragon NPU.

Deploying computer vision models on Qualcomm Snapdragon devices requires a model format tuned for the Qualcomm AI Engine Direct (QNN) runtime. Exporting Ultralytics YOLO models to the QNN format lets you run accelerated, on-device inference across Snapdragon CPU, Adreno GPU, and Hexagon NPU hardware found in billions of mobile phones, laptops, automotive systems, and IoT devices. This guide walks through how to export YOLO to Qualcomm QNN and deploy it for fast, low-power inference on Snapdragon hardware.

Run YOLO on Snapdragon NPUs today with the official mobile apps

The official Ultralytics Flutter plugin provides opt-in QNN support for real-time camera inference and single-image prediction across all seven YOLO26 tasks. Enable the QNN runtime and add its ONNX Runtime dependency as described in the plugin README. For iOS deployment, see the Ultralytics YOLO iOS SDK and the CoreML integration.

Official mobile input sizes

Export classification models at imgsz=224. Export detect, segment, semantic, depth, pose, and OBB models at imgsz=640. This 224/640 standard is shared by the official QNN, LiteRT, and CoreML mobile assets. Ready-to-run v73 and v81 assets for all seven nano tasks are published in the yolo-flutter-app v0.6.6 release.

Link to this sectionWhat is Qualcomm QNN?#

Qualcomm QNN on-device inference

Qualcomm AI Engine Direct — commonly referred to as QNN and distributed as part of the Qualcomm AI Runtime (QAIRT) SDK — is Qualcomm's low-level inference stack for Snapdragon processors. It provides a unified API with backend-specific libraries that target the Snapdragon CPU, the Adreno GPU, and the Hexagon Tensor Processor (HTP), the dedicated neural network processing unit (NPU) inside modern Snapdragon SoCs. QNN gives developers full-stack access to these Snapdragon AI accelerators and is the modern successor to the older Snapdragon Neural Processing Engine (SNPE) SDK. It powers on-device AI across the Snapdragon 8 Gen 2, 8 Gen 3, and 8 Elite mobile platforms, Snapdragon X laptops, and automotive and XR products.

Link to this sectionWhy Export to Qualcomm QNN?#

Snapdragon is the most widely deployed mobile compute platform in the world. Exporting Ultralytics YOLO to the Qualcomm QNN format unlocks the dedicated AI hardware on these devices:

Hexagon NPU acceleration: Running YOLO on the Hexagon Tensor Processor delivers dramatically higher throughput and lower power than CPU inference — ideal for real-time inference and always-on computer vision on Snapdragon.
On-device and offline: QNN inference runs entirely on the Snapdragon device, so there are no cloud round-trips, latency stays low, and data never leaves the device.
Quantized efficiency: QNN export quantizes YOLO to INT8 weights with 16-bit activations, the Hexagon NPU's preferred accuracy/performance balance, shrinking model size and maximizing frames per second on battery-powered hardware.
One format, many devices: A single Qualcomm QNN export targets Snapdragon CPU, Adreno GPU, and Hexagon NPU across the Snapdragon 8 Gen 2, 8 Gen 3, and 8 Elite families and beyond.
Production-ready Qualcomm AI stack: QNN (Qualcomm AI Engine Direct / QAIRT) is Qualcomm's current, actively maintained on-device AI runtime and the recommended replacement for SNPE.

Link to this sectionQNN Export Format#

Ultralytics compiles YOLO models to QNN locally using the ONNX Runtime QNN Execution Provider (the pip-installable onnxruntime-qnn package, which bundles the QAIRT libraries). The exporter converts your model to ONNX, quantizes it with calibration data to 16-bit activations and INT8 weights (the recommended balance for the Hexagon NPU), then initializes an ONNX Runtime session with context-binary caching enabled — this compiles the quantized graph into a QNN context binary embedded in <model>_qnn.onnx. No Qualcomm account, cloud upload, or separate SDK download is required.

Unlike the cloud-based Qualcomm AI Hub, which compiles and profiles models on Qualcomm-hosted Snapdragon devices and requires a Qualcomm account, the Ultralytics QNN export runs entirely on your own machine with a single export(format="qnn", imgsz=640) call (imgsz=224 for classification). You get the same QNN/QAIRT runtime target — Snapdragon CPU, Adreno GPU, and Hexagon NPU — without sign-up, upload limits, or queue times, and it drops straight into the standard YOLO export workflow.

The exported *_qnn.onnx file is self-contained: it embeds the QNN context binary and ONNX metadata such as class names, image size, and task.

Link to this sectionKey Features of QNN Models#

Quantization: The model is quantized to 16-bit activations and INT8 weights with the ONNX Runtime QNN QDQ flow and a calibration dataset, the Hexagon NPU's recommended accuracy/performance balance. Learn more about model quantization.
Fully Local Compilation: The context binary is generated entirely on your host machine — no Qualcomm account, API token, or cloud upload.
Full Snapdragon Acceleration: Run inference on the Hexagon NPU (HTP), Adreno GPU, or CPU through a single unified runtime.
Broad Device Reach: Target the wide range of Snapdragon platforms shipping in phones, PCs (Windows on Snapdragon), automotive, XR, and embedded products.
Precompiled Context Binary: Shipping a context binary minimizes on-device graph compilation, reducing model load latency on the target.
Self-Contained Output: The exported ONNX file includes the precompiled QNN context binary and metadata for straightforward deployment.

Link to this sectionMeasured Performance#

Link to this sectionAndroid Phone#

Hardware: Xiaomi 17 with 12 GB LPDDR5X memory and Android 16 / API 36. Its 3 nm Snapdragon 8 Elite Gen 5 (SM8850) has an 8-core Qualcomm Oryon CPU (2 Prime cores up to 4.6 GHz and 6 Performance cores up to 3.62 GHz), Adreno GPU, and Hexagon NPU (HTP v81).

Model	Task	size ^(pixels)	CPU ^{w8a32 LiteRT (ms)}	GPU ^{w8a32 LiteRT (ms)}	NPU ^{QNN W8A16 (ms)}
YOLO26n	Detect	640	52.2 ^{1.8 / 48.1 / 2.4}	15.8 ^{2.3 / 8.9 / 4.6}	10.7 ^{1.8 / 6.7 / 2.2}
YOLO26n-seg	Segment	640	73.4 ^{1.8 / 65.6 / 6.0}	33.2 ^{1.8 / 23.8 / 7.6}	17.4 ^{1.8 / 9.9 / 5.7}
YOLO26n-sem	Semantic	640	61.2 ^{1.8 / 51.1 / 8.3}	34.2 ^{1.8 / 24.0 / 8.3}	11.5 ^{1.8 / 7.1 / 2.6}
YOLO26n-depth	Depth	640	124.4 ^{1.9 / 115.1 / 7.4}	23.0 ^{1.8 / 13.5 / 7.7}	35.2 ^{1.8 / 26.1 / 7.3}
YOLO26n-cls	Classify	224	4.4 ^{0.4 / 4.0 / 0.0}	3.1 ^{0.8 / 2.1 / 0.2}	1.2 ^{0.6 / 0.6 / 0.0}
YOLO26n-pose	Pose	640	57.4 ^{1.8 / 53.8 / 1.8}	16.6 ^{2.7 / 10.1 / 3.9}	10.9 ^{1.8 / 7.0 / 2.0}
YOLO26n-obb	OBB	640	50.3 ^{1.8 / 47.2 / 1.4}	11.7 ^{1.8 / 7.8 / 2.0}	8.6 ^{1.8 / 5.7 / 1.1}

Speed values are single-image burst latencies — the mean of 15 runs after 3 warmup runs on bus.jpg, measured with the Flutter plugin's 0.6.10 on-device benchmark harness and the standardized v0.6.6 assets. Backend order rotated between tasks in one sequential sweep. Native logs confirmed that every CPU row used LiteRT CPU/XNNPACK, every GPU row delegated the complete graph to LiteRT OpenCL (LITERT_CL), and every NPU row used the QNN Hexagon HTP backend.
The detailed benchmark record is in the Flutter performance doc.
Compare other Android devices in the LiteRT integration and Apple devices in the CoreML integration.

Link to this sectionWindows on Snapdragon Laptop#

This historical sweep used pre-standard v73 QNN binaries; semantic and OBB used 1024px inputs. It ran on a Lenovo laptop with 32 GB memory and Windows 11. Its Snapdragon X Elite (X1E78100) has a 12-core Qualcomm Oryon CPU, Adreno GPU, and Hexagon NPU (HTP v73); the exact Lenovo model was not recorded. This Windows-on-Snapdragon comparison runs the native PyTorch FP32 CPU baseline that most desktop developers start from against the ONNX Runtime QNN Hexagon HTP path. Each cell shows the full model.predict() wall time with the reported preprocessing / inference / postprocessing timings beneath it; the total can include framework overhead outside those three stages. CPU numbers are PyTorch FP32 (torch==2.10.0+cpu) and NPU numbers are ONNX Runtime QNN (onnxruntime-qnn==2.2.0, INT8 weights / 16-bit activations).

Model	Task	size ^(pixels)	CPU ^{PT FP32 (ms)}	NPU Hexagon ^{QNN W8A16 (ms)}
YOLO26n	Detect	640	91.4 ^{4.3 / 75.2 / 0.1}	27.2 ^{4.9 / 19.4 / 0.9}
YOLO26n-seg	Segment	640	138.8 ^{4.5 / 127.1 / 2.8}	34.3 ^{5.0 / 24.0 / 5.1}
YOLO26n-sem	Semantic	1024	295.8 ^{9.1 / 189.2 / 94.8}	133.0 ^{8.8 / 37.4 / 83.9}
YOLO26n-cls	Classify	224	15.4 ^{3.0 / 9.8 / 0.0}	11.7 ^{2.7 / 5.5 / 0.0}
YOLO26n-pose	Pose	640	109.6 ^{4.6 / 102.9 / 0.2}	28.9 ^{5.3 / 23.3 / 0.6}
YOLO26n-obb	OBB	1024	267.8 ^{8.1 / 254.6 / 0.1}	64.8 ^{8.9 / 54.7 / 0.6}

Speed values are single-image burst latencies — the mean of 100 runs after 10 warmup runs on bus.jpg, measured with time.perf_counter() around the full model.predict() call on a thermally rested device (ultralytics==8.4.67, Python 3.12.10).
The Hexagon NPU runs roughly 2-4x faster than the PyTorch CPU baseline across the 640-1024 px tasks (detection ~3.4x), narrowing to ~1.3x on the 224 px classifier where fixed preprocessing overhead dominates the tiny workload.

Link to this sectionSupported Tasks#

QNN export supports the standard task set available in each model family, including YOLO26 semantic segmentation.

Task	Supported
Object Detection	✅
Instance Segmentation	✅
Semantic Segmentation	✅
Pose Estimation	✅
OBB Detection	✅
Classification	✅
Depth Estimation	✅

Link to this sectionExport to QNN: Converting Your YOLO Model#

Export an Ultralytics YOLO model to QNN format for deployment on Snapdragon hardware. The context binary is finalized for a target Hexagon Tensor Processor (HTP) architecture, which you select with the name argument — the same argument used to target a chip in RKNN export.

Link to this sectionSupported HTP Architectures#

Pass the target architecture via name (e.g. name="73"). Valid values:

`name`	Hexagon HTP	Snapdragon platform
`68`	v68	Snapdragon 888
`69`	v69	Snapdragon 8 Gen 1 / 8+ Gen 1
`73`	v73	Snapdragon 8 Gen 2, X Elite (default)
`75`	v75	Snapdragon 8 Gen 3
`79`	v79	Snapdragon 8 Elite
`81`	v81	Snapdragon 8 Elite Gen 5

Platform support

QNN export uses the onnxruntime-qnn package. Prebuilt wheels are published for Windows (x64 and ARM64) and Linux ARM64 (aarch64); on Linux x86-64 build ONNX Runtime from source with --use_qnn (no prebuilt wheel is published, and macOS is not a supported QNN host). QNN context-binary generation runs on an x64 host — Windows x64 or Linux x86-64 — and does not require a Snapdragon device for the export step.

Link to this sectionInstallation#

To install the required packages, run:

Installation

# Install the required package for YOLO
pip install ultralytics

The onnxruntime-qnn package (which provides the ONNX Runtime QNN Execution Provider and bundles the QAIRT libraries) is installed automatically on first export. For detailed instructions and best practices related to the installation process, check our Ultralytics Installation guide. While installing the required packages for YOLO, if you encounter any difficulties, consult our Common Issues guide for solutions and tips.

Link to this sectionUsage#

The QNN format supports the Export, Predict, and Validate modes. Inference and validation run on Qualcomm Snapdragon hardware through ONNX Runtime's QNN Execution Provider (the same onnxruntime-qnn package used for export). Export your model, then load the exported model on a Snapdragon device to run inference or validate its accuracy.

Export

from ultralytics import YOLO

# Load a YOLO26 model
model = YOLO("yolo26n.pt")

# Export to Qualcomm QNN format (INT8, enforced automatically), targeting an HTP architecture via 'name'
# 'name' can be one of 68, 69, 73, 75, 79, 81 (Snapdragon 888, 8 Gen 1, 8 Gen 2, 8 Gen 3, 8 Elite, 8 Elite Gen 5)
model.export(format="qnn", name="73", imgsz=640)  # use imgsz=224 for classification

Predict

from ultralytics import YOLO

# Load the exported QNN model (on a Snapdragon device with onnxruntime-qnn)
model = YOLO("yolo26n_qnn.onnx")

# Run inference
results = model("https://ultralytics.com/images/bus.jpg")

Validate

from ultralytics import YOLO

# Load the exported QNN model (on a Snapdragon device with onnxruntime-qnn)
model = YOLO("yolo26n_qnn.onnx")

# Validate accuracy on the COCO8 dataset
metrics = model.val(data="coco8.yaml")

Link to this sectionExport Arguments#

Argument	Type	Default	Description
`format`	`str`	`'qnn'`	Target format for the exported model, defining compatibility with the Qualcomm QNN runtime.
`imgsz`	`int` or `tuple`	`640`	Desired image size for the model input. Can be an integer for square images or a tuple `(height, width)`.
`batch`	`int`	`1`	Specifies the export model batch size, which is baked into the generated QNN context binary.
`name`	`str`	`'73'`	Target Hexagon HTP architecture version: `68`, `69`, `73`, `75`, `79`, or `81` (Snapdragon 888, 8 Gen 1, 8 Gen 2, 8 Gen 3, 8 Elite, 8 Elite Gen 5). The context binary is finalized for this architecture.
`quantize`	`int` or `str`	`'w8a16'`/auto	Quantization precision. QNN HTP export is quantized to INT8 weights with 16-bit activations (`'w8a16'`) and is auto-enabled if not specified. Replaces the deprecated `half`/`int8` flags.
`simplify`	`bool`	`True`	Simplifies the intermediate ONNX graph with `onnxslim`.
`opset`	`int`	`None`	Specifies the ONNX opset version for the intermediate ONNX graph. If not set, uses the latest supported version.
`data`	`str`	`'coco8.yaml'`	Dataset configuration file used for INT8 calibration. Specifies the calibration image source.
`fraction`	`float`	`1.0`	Fraction of the calibration dataset to use for INT8 quantization.
`device`	`str`	`None`	Specifies the device for the ONNX export step: GPU (`device=0`) or CPU (`device=cpu`).

Precision

QNN export quantizes the model to 16-bit activations and INT8 weights — the recommended accuracy/performance balance for the Hexagon NPU — using the ONNX Runtime QDQ quantization flow with calibration images from data. quantize='w8a16' is enforced automatically.

For more details about the export process, visit the Ultralytics documentation page on exporting.

Link to this sectionOutput Structure#

After a successful export, a self-contained ONNX file is created:

yolo26n_qnn.onnx   # ONNX wrapping the precompiled QNN context binary and metadata

The yolo26n_qnn.onnx file embeds the QNN context binary and is loaded by ONNX Runtime with the QNN Execution Provider on the Snapdragon device. It also carries model metadata such as class names, image size, and task in ONNX metadata_props.

Link to this sectionDeploying Exported YOLO QNN Models#

QNN models run on Qualcomm Snapdragon hardware, making on-device model deployment straightforward. On a Snapdragon device with onnxruntime-qnn installed, run the exported model directly with the Ultralytics API (yolo predict/yolo val, see Usage above) — Ultralytics loads the context binary through the ONNX Runtime QNN Execution Provider and selects the HTP (NPU), GPU, or CPU backend.

For custom pipelines, you can also load the context-binary ONNX directly with ONNX Runtime. onnxruntime-qnn is a plugin Execution Provider, so register it at runtime:

import onnxruntime as ort
import onnxruntime_qnn as qnn_ep

# On the Snapdragon device, register the QNN plugin EP and select its device(s)
ort.register_execution_provider_library("QNNExecutionProvider", qnn_ep.get_library_path())
devices = [d for d in ort.get_ep_devices() if d.ep_name == "QNNExecutionProvider"]

options = ort.SessionOptions()
options.add_provider_for_devices(devices, {"backend_path": qnn_ep.get_qnn_htp_path()})
session = ort.InferenceSession("yolo26n_qnn.onnx", sess_options=options)
input_info = session.get_inputs()[0]
outputs = session.run(None, {input_info.name: input_tensor})  # input_tensor: float32 NHWC

Because the QNN context binary is precompiled, the session loads quickly without recompiling the graph on-device.

Link to this sectionRecommended Workflow#

Train your model using Ultralytics Train Mode
Export to QNN format using model.export(format="qnn", imgsz=640) on a supported platform (use imgsz=224 for classification)
Deploy the exported *_qnn.onnx file to your Snapdragon device
Run inference with ONNX Runtime and the QNN Execution Provider, selecting the HTP, GPU, or CPU backend

Link to this sectionReal-World Applications#

YOLO models running on Qualcomm Snapdragon hardware are well suited for a wide range of edge AI applications:

Smartphones: Real-time object detection and scene understanding in camera and photo apps with NPU acceleration.
Windows on Snapdragon: On-device computer vision in Copilot+ PCs without offloading to the cloud.
Automotive: Driver monitoring, occupant detection, and ADAS features on Snapdragon Digital Chassis platforms.
XR and Wearables: Low-power, low-latency perception for AR/VR headsets and smart glasses.
IoT and Robotics: Efficient vision inference on Snapdragon-powered cameras, drones, and embedded systems.

Link to this sectionSummary#

In this guide, you've learned how to export Ultralytics YOLO models to the Qualcomm QNN format locally with the ONNX Runtime QNN Execution Provider. The export pipeline converts your model to ONNX, then compiles it into a QNN context binary on your host machine — no Qualcomm account or cloud required — producing a *_qnn.onnx file optimized for Snapdragon CPU, Adreno GPU, and Hexagon NPU hardware via the QNN/QAIRT runtime.

The combination of Ultralytics YOLO and Qualcomm's on-device AI stack provides an effective solution for running advanced computer vision workloads across the broad Snapdragon ecosystem.

For other on-device and mobile deployment targets, see the related ONNX, CoreML, NCNN, LiteRT, ExecuTorch, RKNN, Sony IMX500, and TensorRT export guides. To compare formats before shipping, use Benchmark mode. For the full list of formats and options, visit the Export mode documentation and the integrations guide page.

Link to this sectionFAQ#

Link to this sectionHow do I export my Ultralytics YOLO model to QNN format?#

You can export your model using export(format="qnn", imgsz=640) (imgsz=224 for classification) or the equivalent CLI arguments. The export first creates an ONNX model, then compiles it locally into a QNN context binary using the ONNX Runtime QNN Execution Provider. The onnxruntime-qnn package is installed automatically on first export.

Example

from ultralytics import YOLO

model = YOLO("yolo26n.pt")
model.export(format="qnn", imgsz=640)  # use imgsz=224 for classification

Link to this sectionDo I need a Qualcomm account or cloud access?#

No. QNN export runs entirely on your local machine using the onnxruntime-qnn package, which bundles the QAIRT libraries. No Qualcomm account, API token, or network access is required.

Link to this sectionHow does Ultralytics QNN export compare to Qualcomm AI Hub?#

Qualcomm AI Hub is Qualcomm's cloud service for compiling, profiling, and benchmarking models on hosted Snapdragon devices, and it requires a Qualcomm account. Ultralytics QNN export targets the same QNN/QAIRT runtime (Snapdragon CPU, Adreno GPU, and Hexagon NPU) but compiles the context binary locally with the ONNX Runtime QNN Execution Provider — no account, no upload, and no queue. It is the fastest way to go from a .pt model to a Snapdragon-ready build directly inside the standard YOLO export workflow.

Link to this sectionWhich platforms can I export on?#

onnxruntime-qnn provides prebuilt wheels for Windows (x64 and ARM64) and Linux ARM64 (aarch64); on Linux x86-64 build ONNX Runtime from source with --use_qnn (no prebuilt wheel is published, and macOS is not a supported QNN host). Context-binary generation runs on an x64 host — Windows x64 or Linux x86-64 — and does not require a physical Snapdragon device.

Link to this sectionHow do I run YOLO on a Qualcomm Snapdragon NPU?#

Export with model.export(format="qnn", imgsz=640) (imgsz=224 for classification), copy the resulting yolo26n_qnn.onnx file to your Snapdragon device, and run yolo predict model=yolo26n_qnn.onnx source=image.jpg (or yolo val). Ultralytics loads the context binary through the ONNX Runtime QNN Execution Provider and runs it on the Hexagon NPU — see Deploying Exported YOLO QNN Models.

Link to this sectionWhat is the difference between QNN and SNPE?#

QNN (Qualcomm AI Engine Direct, part of the QAIRT SDK) is Qualcomm's current inference stack and the recommended replacement for the older Snapdragon Neural Processing Engine (SNPE) SDK. New deployments should target QNN.

Link to this sectionCan I run a QNN model with `yolo predict` and `yolo val`?#

Yes, on a Qualcomm Snapdragon device with onnxruntime-qnn installed — YOLO("yolo26n_qnn.onnx") loads the context binary through the QNN Execution Provider and runs predict/val like any other format. On an x86 host without QNN hardware the model cannot execute, since the context binary targets the Snapdragon NPU.

Link to this sectionWhat is the output of a QNN export?#

The export creates a self-contained context-binary ONNX file (e.g., yolo26n_qnn.onnx) with class names, image size, task, and other model metadata embedded in ONNX metadata_props.

Contributors

GLglenn-jocher¹² AMamanharshx¹ AMambitious-octopus¹ ONonuralpszr¹ RAraimbekovm¹ SHShuaiLYU¹

Created 2 months agoUpdated 10 hours ago