Link to this sectionTFLite Model Export for Deployment (Deprecated)#

Deprecated — replaced by LiteRT

As of Ultralytics 8.4.83, the standalone tflite export format has been removed and replaced by the unified Google LiteRT format. LiteRT (Lite Runtime) is the next generation and new name for TensorFlow Lite, and it exports the same .tflite model — now covering mobile, embedded, edge, and browser deployment in one format.

format="tflite" still works but emits a deprecation warning and exports a LiteRT model instead. Use format="litert" going forward; for current export instructions and options, see the LiteRT export guide.

Deploying computer vision models on edge devices or embedded devices requires a format that can ensure seamless performance.

The former TensorFlow Lite or TFLite export format optimized Ultralytics YOLO26 models for tasks like object detection and image classification in edge applications. This guide preserves the legacy TFLite deployment context; use LiteRT for new exports.

Link to this sectionWhy Was TFLite Used for Export?#

Introduced by Google in May 2017 as part of their TensorFlow framework, TensorFlow Lite, or TFLite for short, was an open-source deep learning framework designed for on-device inference, also known as edge computing. It gave developers tools to execute trained models on mobile, embedded, and IoT devices, as well as traditional computers.

TensorFlow Lite supported a wide range of platforms, including embedded Linux, Android, iOS, and microcontrollers (MCUs). TFLite exports enabled applications to run models locally and offline.

Link to this sectionKey Features of TFLite Models#

TFLite models offer a wide range of key features that enable on-device machine learning by helping developers run their models on mobile, embedded, and edge devices:

On-device Optimization: TFLite optimizes for on-device ML, reducing latency by processing data locally, enhancing privacy by not transmitting personal data, and minimizing model size to save space.
Multiple Platform Support: TFLite offers extensive platform compatibility, supporting Android, iOS, embedded Linux, and microcontrollers.
Diverse Language Support: TFLite is compatible with various programming languages, including Java, Swift, Objective-C, C++, and Python.
High Performance: Achieves superior performance through hardware acceleration and model optimization.

Link to this sectionMeasured Performance (historical)#

Before/after reference for the format migration

These TFLite numbers are kept as a historical before/after record for the onnx2tf-TFLite → LiteRT migration: the legacy onnx2tf INT8 TFLite export below versus the new LiteRT w8a32 export (see the LiteRT Measured Performance table). They are shared with the Google LiteRT team to show where the new litert-torch format still regresses against the format it replaced — see Format regressions below.

Per-task before/after on the Adreno GPU of a Xiaomi 17 (Qualcomm Snapdragon 8 Elite Gen 5, SM8850), measured through the Ultralytics Flutter plugin 0.6.8: the legacy onnx2tf INT8 TFLite assets (NHWC, input images) versus the new w8a32 LiteRT assets (NCHW, input args_0), both run on LiteRT 2.x in the same back-to-back sweep at the shipped Android imgsz. Each cell is the total time (preprocessing + inference + postprocessing) with the per-stage split beneath it; both formats compiled fully on the GPU.

Model	Task	size ^(pixels)	Before ^{onnx2tf INT8 TFLite (ms)}	After ^{w8a32 LiteRT (ms)}
YOLO26n	Detect	640	14.0 ^{1.8 / 8.1 / 4.2}	13.5 ^{1.9 / 8.1 / 3.5}
YOLO26n-seg	Segment	640	30.1 ^{1.9 / 20.3 / 8.0}	28.6 ^{1.8 / 20.1 / 6.7}
YOLO26n-sem	Semantic	640	26.4 ^{1.9 / 16.4 / 8.1}	32.9 ^{1.8 / 23.0 / 8.2}
YOLO26n-cls	Classify	224	3.5 ^{0.9 / 2.2 / 0.4}	3.2 ^{1.0 / 2.2 / 0.1}
YOLO26n-pose	Pose	640	17.4 ^{2.4 / 9.9 / 5.1}	14.0 ^{1.9 / 9.3 / 2.8}
YOLO26n-obb	OBB	640	13.9 ^{3.0 / 8.3 / 2.7}	13.0 ^{2.9 / 7.9 / 2.3}

w8a32 LiteRT matches or beats the legacy onnx2tf INT8 format on five of six tasks in total latency. Semantic remains the format regression because the w8a32 NCHW logits cost more inference time than the legacy NHWC logits, even after preprocessing cleanup. The legacy onnx2tf models run unchanged on LiteRT 2.x alongside the new NCHW exports. The official Android LiteRT assets are hosted on the yolo-flutter-app v0.6.6 release, with the detailed benchmark record in the Flutter performance doc.

Link to this sectionFormat regressions vs LiteRT#

Same-device YOLO26n detect on the Adreno GPU of a Xiaomi 17 — legacy onnx2tf INT8 TFLite versus the four LiteRT quantization formats, all measured in one sustained run (so inference is the comparable, format-dependent metric):

Android format	GPU inference (ms)	GPU-compiles
onnx2tf INT8 (legacy TFLite)	8.6	yes
LiteRT w8a32 (new official)	8.4	yes
LiteRT INT8 (`quantize=8`)	11.0	yes
LiteRT FP32	8.8	yes
LiteRT w8a16 (`quantize="w8a16"`)	(CPU fallback)	no — fails

Issues for the Google LiteRT / litert-torch team, surfaced migrating production Android assets from onnx2tf TFLite to LiteRT:

NCHW layout makes consumers layout-aware. litert-torch traces the PyTorch model and emits NCHW [1,3,H,W] with a float input, whereas the onnx2tf TFLite export was NHWC [1,H,W,3] — matching the camera/bitmap layout. The current Flutter plugin writes planar CHW directly during RGB packing, avoiding a separate HWC→CHW transpose, but simpler consumers still need either direct planar packing or an extra transpose.
quantize="w8a16" does not compile on the GPU (OpenCL) delegate and silently falls back to a CPU path that is ~40× slower (~660 ms vs ~17 ms), making the int16-activation format unusable for GPU deployment.
Static INT8 (quantize=8) is the slowest GPU format — ~11 ms vs ~8.6 ms for the equivalent legacy onnx2tf INT8 model, i.e. LiteRT's own INT8 path regresses against the format it replaced. Dynamic-range w8a32 is the only LiteRT format that matches the old INT8 speed, which is why it is now shipped.
Semantic models export as raw NCHW logits with no in-graph ArgMax option, forcing a cache-unfriendly host-side argmax over [1, C, H, W] (each class plane is a full H×W apart). The onnx2tf, CoreML, and QNN paths can emit a compact class map instead.
Output tensors were renamed output_0, output_1, … (vs onnx2tf Identity, Identity_1, …), which silently broke runtime output-shape lookup until the consumer added the new names.

The corresponding LiteRT w8a32 numbers (the format now shipped) are on the LiteRT page.

Link to this sectionDeployment Options in TFLite#

Before we look at the LiteRT replacement export example, let's understand how TFLite models are normally used.

TFLite offers various on-device deployment options for machine learning models, including:

Deploying with Android and iOS: Both Android and iOS applications with TFLite can analyze edge-based camera feeds and sensors to detect and identify objects. TFLite also offers native iOS libraries written in Swift and Objective-C. The architecture diagram below shows the process of deploying a trained model onto Android and iOS platforms using TensorFlow Lite.

TensorFlow Lite deployment architecture for mobile

Implementing with Embedded Linux: If running inferences on a Raspberry Pi using the Ultralytics Guide does not meet the speed requirements for your use case, you can use an exported TFLite model to accelerate inference times. Additionally, it's possible to further improve performance by utilizing a Coral Edge TPU device.
Deploying with Microcontrollers: TFLite models can also be deployed on microcontrollers and other devices with only a few kilobytes of memory. The core runtime just fits in 16 KB on an Arm Cortex M3 and can run many basic models. It doesn't require operating system support, any standard C or C++ libraries, or dynamic memory allocation.

Link to this sectionReplace TFLite Export with LiteRT#

For new exports, convert your model to LiteRT. The resulting model keeps the .tflite file extension.

Link to this sectionInstallation#

To install the required packages, run:

Installation

# Install the required package for YOLO26
pip install ultralytics

For detailed instructions and best practices related to the installation process, check our Ultralytics Installation guide. While installing the required packages for YOLO26, if you encounter any difficulties, consult our Common Issues guide for solutions and tips.

Link to this sectionUsage#

All Ultralytics YOLO26 models are designed to support export out of the box, making it easy to integrate them into your preferred deployment workflow. You can view the full list of supported export formats and configuration options to choose the best setup for your application.

The replacement LiteRT format supports the Export, Predict, and Validate modes. Export your model, then load the exported .tflite model to run inference or validate its accuracy.

Export

from ultralytics import YOLO

# Load a YOLO26 model
model = YOLO("yolo26n.pt")

# Export the model to LiteRT format
model.export(format="litert")  # creates 'yolo26n.tflite'

Predict

from ultralytics import YOLO

# Load the exported TFLite model
model = YOLO("yolo26n.tflite")

# Run inference
results = model("https://ultralytics.com/images/bus.jpg")

Validate

from ultralytics import YOLO

# Load the exported TFLite model
model = YOLO("yolo26n.tflite")

# Validate accuracy on the COCO8 dataset
metrics = model.val(data="coco8.yaml")

Link to this sectionExport Arguments#

Argument	Type	Default	Description
`format`	`str`	`'litert'`	Target format for the exported model, defining compatibility with various deployment environments.
`imgsz`	`int` or `tuple`	`640`	Desired image size for the model input. Can be an integer for square images or a tuple `(height, width)` for specific dimensions.
`quantize`	`int` or `str`	`None`	Quantization precision: `8` (static INT8, int8 weights + int8 activations; needs calibration `data`/`fraction`), `'w8a16'` (static, int8 weights + int16 activations; needs calibration `data`/`fraction`), `'w8a32'` (dynamic INT8, int8 weights + FP32 activations; no calibration needed), or `32`/unset (FP32). FP16 is not exported separately — an FP32 model runs in FP16 automatically on GPU delegates. Replaces the deprecated `half`/`int8` flags.
`batch`	`int`	`1`	Specifies export model batch inference size or the max number of images the exported model will process concurrently in `predict` mode.
`data`	`str`	`'coco8.yaml'`	Path to the dataset configuration file (default: `coco8.yaml`), essential for quantization.
`fraction`	`float`	`1.0`	Specifies the fraction of the dataset to use for INT8 quantization calibration. Allows for calibrating on a subset of the full dataset, useful for experiments or when resources are limited. If not specified with INT8 enabled, the full dataset will be used.
`device`	`str`	`None`	Specifies the device for exporting: CPU (`device=cpu`), MPS for Apple silicon (`device=mps`).

For more details about the export process, visit the Ultralytics documentation page on exporting.

Link to this sectionDeploying Exported YOLO26 TFLite Models#

After exporting your Ultralytics YOLO26 model to LiteRT format, you can deploy the resulting .tflite model. The primary and recommended first step for running a TFLite model is to use the YOLO("model.tflite") method, as outlined in the previous usage code snippet. However, for in-depth instructions on deploying your TFLite models in various other settings, take a look at the following resources:

Android: A quick start guide for integrating TensorFlow Lite into Android applications, providing easy-to-follow steps for setting up and running machine learning models.
iOS: Check out this detailed guide for developers on integrating and deploying TensorFlow Lite models in iOS applications, offering step-by-step instructions and resources.
End-To-End Examples: This page provides an overview of various TensorFlow Lite examples, showcasing practical applications and tutorials designed to help developers implement TensorFlow Lite in their machine learning projects on mobile and edge devices.

Link to this sectionSummary#

This guide preserves the legacy TFLite deployment workflow. For new exports, use LiteRT to create .tflite models for edge computing environments.

For further details on usage, visit the TFLite official documentation.

Also, if you're curious about other Ultralytics YOLO26 integrations, check out our integration guide page. You'll find plenty of helpful information and insights there.

Link to this sectionFAQ#

Link to this sectionHow do I replace a TFLite export with LiteRT?#

For a new export, use the LiteRT format. First, install the required package using:

pip install ultralytics

Then, use the following code snippet to export your model:

from ultralytics import YOLO

# Load a YOLO26 model
model = YOLO("yolo26n.pt")

# Export the model to LiteRT format
model.export(format="litert")  # creates 'yolo26n.tflite'

For CLI users, you can achieve this with:

yolo export model=yolo26n.pt format=litert # creates 'yolo26n.tflite'

For more details, visit the Ultralytics export guide.

Link to this sectionWhat are the benefits of using TensorFlow Lite for YOLO26 model deployment?#

TensorFlow Lite (TFLite) is an open-source deep learning framework designed for on-device inference, making it ideal for deploying YOLO26 models on mobile, embedded, and IoT devices. Key benefits include:

On-device optimization: Minimize latency and enhance privacy by processing data locally.
Platform compatibility: Supports Android, iOS, embedded Linux, and MCU.
Performance: Utilizes hardware acceleration to optimize model speed and efficiency.

To learn more, check out the TFLite guide.

Link to this sectionIs it possible to run YOLO26 TFLite models on Raspberry Pi?#

Yes, you can run YOLO26 TFLite models on Raspberry Pi to improve inference speeds. First, export your model to LiteRT format as explained above. Then, use a tool like TensorFlow Lite Interpreter to execute the model on your Raspberry Pi.

For further optimizations, you might consider using Coral Edge TPU. For detailed steps, refer to our Raspberry Pi deployment guide and the Edge TPU integration guide.

Link to this sectionCan I use TFLite models on microcontrollers for YOLO26 predictions?#

Yes, TFLite supports deployment on microcontrollers with limited resources. TFLite's core runtime requires only 16 KB of memory on an Arm Cortex M3 and can run basic YOLO26 models. This makes it suitable for deployment on devices with minimal computational power and memory.

To get started, visit the TFLite Micro for Microcontrollers guide.

Link to this sectionWhat platforms are compatible with TFLite exported YOLO26 models?#

TensorFlow Lite provides extensive platform compatibility, allowing you to deploy YOLO26 models on a wide range of devices, including:

Android and iOS: Native support through TFLite Android and iOS libraries.
Embedded Linux: Ideal for single-board computers such as Raspberry Pi.
Microcontrollers: Suitable for MCUs with constrained resources.

For more information on deployment options, see our detailed deployment guide.

Link to this sectionHow do I troubleshoot common issues during YOLO26 model export to LiteRT?#

If you encounter errors while exporting YOLO26 models to LiteRT, common solutions include:

Check package compatibility: Ensure you're using compatible versions of Ultralytics, litert-torch, and ai-edge-litert. Refer to our installation guide.
Model support: Verify that the specific YOLO26 model supports LiteRT export by checking the Ultralytics export documentation page.
Quantization issues: When using INT8 quantization, make sure your dataset path is correctly specified in the data parameter.

For additional troubleshooting tips, visit our Common Issues guide.

Contributors

GLglenn-jocher²⁴ LAlakshanthad⁴ AMambitious-octopus² ONonuralpszr¹ RAraimbekovm¹ PDpderrenger¹ MAMatthewNoyce¹ RIRizwanMunawar¹ ABabirami-vina¹

Created Feb 29, 2024Updated 5 days ago