Link to this sectionOpenVINO Inference Optimization for YOLO#

Link to this sectionIntroduction#

When deploying deep learning models, particularly those for object detection such as Ultralytics YOLO models, achieving optimal performance is crucial. This guide delves into leveraging Intel's OpenVINO toolkit to optimize inference, focusing on latency and throughput. Whether you're working on consumer-grade applications or large-scale deployments, understanding and applying these optimization strategies will ensure your models run efficiently on various devices.

Link to this sectionOptimizing for Latency#

Latency optimization is vital for applications requiring immediate response from a single model given a single input, typical in consumer scenarios. The goal is to minimize the delay between input and inference result. However, achieving low latency involves careful consideration, especially when running concurrent inferences or managing multiple models.

Link to this sectionKey Strategies for Latency Optimization:#

Single Inference per Device: The simplest way to achieve low latency is by limiting to one inference at a time per device. Additional concurrency often leads to increased latency.
Leveraging Sub-Devices: Devices like multi-socket CPUs or multi-tile GPUs can execute multiple requests with minimal latency increase by utilizing their internal sub-devices.
OpenVINO Performance Hints: Utilizing OpenVINO's ov::LATENCY for the ov::performance_mode property during model compilation simplifies performance tuning, offering a device-agnostic and future-proof approach.

Link to this sectionManaging First-Inference Latency:#

Model Caching: To mitigate model load and compile times impacting latency, use model caching where possible. For scenarios where caching isn't viable, CPUs generally offer the fastest model load times.
Model Mapping vs. Reading: To reduce load times, OpenVINO replaced model reading with mapping. However, if the model is on a removable or network drive, consider using ov::enable_mmap(false) to switch back to reading.
AUTO Device Selection: This mode begins inference on the CPU, shifting to an accelerator once ready, seamlessly reducing first-inference latency.

Link to this sectionOptimizing for Throughput#

Throughput optimization is crucial for scenarios serving numerous inference requests simultaneously, maximizing resource utilization without significantly sacrificing individual request performance.

Link to this sectionApproaches to Throughput Optimization:#

OpenVINO Performance Hints: A high-level, future-proof method to enhance throughput across devices using performance hints.

import openvino.properties.hint as hints

config = {hints.performance_mode: hints.PerformanceMode.THROUGHPUT}
compiled_model = core.compile_model(model, "GPU", config)

Explicit Batching and Streams: A more granular approach involving explicit batching and the use of streams for advanced performance tuning.

Link to this sectionDesigning Throughput-Oriented Applications:#

To maximize throughput, applications should:

Process inputs in parallel, making full use of the device's capabilities.
Decompose data flow into concurrent inference requests, scheduled for parallel execution.
Utilize the Async API with callbacks to maintain efficiency and avoid device starvation.

Link to this sectionMulti-Device Execution:#

OpenVINO's multi-device mode simplifies scaling throughput by automatically balancing inference requests across devices without requiring application-level device management.

Link to this sectionReal-World Performance Gains#

Implementing OpenVINO optimizations with Ultralytics YOLO models can yield significant performance improvements. As demonstrated in benchmarks, users can experience up to 3x faster inference speeds on Intel CPUs, with even greater accelerations possible across Intel's hardware spectrum including integrated GPUs, dedicated GPUs, and VPUs.

For example, when running YOLO26 models on Intel Xeon CPUs, the OpenVINO-optimized versions consistently outperform their PyTorch counterparts in terms of inference time per image, without compromising on accuracy.

Link to this sectionPractical Implementation#

To export and optimize your Ultralytics YOLO model for OpenVINO, you can use the export functionality:

from ultralytics import YOLO

# Load a model
model = YOLO("yolo26n.pt")

# Export the model to OpenVINO format
model.export(format="openvino", quantize=16)  # Export with FP16 precision

After exporting, you can run inference with the optimized model:

# Load the OpenVINO model
ov_model = YOLO("yolo26n_openvino_model/")

# Run inference (Ultralytics auto-selects OpenVINO LATENCY mode for batch=1)
results = ov_model("https://ultralytics.com/images/bus.jpg", verbose=True)

Link to this sectionConclusion#

Optimizing Ultralytics YOLO models for latency and throughput with OpenVINO can significantly enhance your application's performance. By carefully applying the strategies outlined in this guide, developers can ensure their models run efficiently, meeting the demands of various deployment scenarios. Remember, the choice between optimizing for latency or throughput depends on your specific application needs and the characteristics of the deployment environment.

For more detailed technical information and the latest updates, refer to the OpenVINO documentation and Ultralytics YOLO repository. These resources provide in-depth guides, tutorials, and community support to help you get the most out of your deep learning models.

Ensuring your models achieve optimal performance is not just about tweaking configurations; it's about understanding your application's needs and making informed decisions. Whether you're optimizing for real-time responses or maximizing throughput for large-scale processing, the combination of Ultralytics YOLO models and OpenVINO offers a powerful toolkit for developers to deploy high-performance AI solutions.

Link to this sectionFAQ#

Link to this sectionHow do I optimize Ultralytics YOLO models for low latency using OpenVINO?#

Optimizing Ultralytics YOLO models for low latency involves several key strategies:

Single Inference per Device: Limit inferences to one at a time per device to minimize delays.
Leveraging Sub-Devices: Utilize devices like multi-socket CPUs or multi-tile GPUs which can handle multiple requests with minimal latency increase.
OpenVINO Performance Hints: Use OpenVINO's ov::LATENCY during model compilation for simplified, device-agnostic tuning.

For more practical tips on optimizing latency, check out the Latency Optimization section of our guide.

Link to this sectionWhy should I use OpenVINO for optimizing Ultralytics YOLO throughput?#

OpenVINO enhances Ultralytics YOLO model throughput by maximizing device resource utilization without sacrificing performance. Key benefits include:

Performance Hints: Simple, high-level performance tuning across devices.
Explicit Batching and Streams: Fine-tuning for advanced performance.
Multi-Device Execution: Automated inference load balancing, easing application-level management.

Example configuration:

import openvino.properties.hint as hints

config = {hints.performance_mode: hints.PerformanceMode.THROUGHPUT}
compiled_model = core.compile_model(model, "GPU", config)

Learn more about throughput optimization in the Throughput Optimization section of our detailed guide.

Link to this sectionWhat is the best practice for reducing first-inference latency in OpenVINO?#

To reduce first-inference latency, consider these practices:

Model Caching: Use model caching to decrease load and compile times.
Model Mapping vs. Reading: Use mapping (ov::enable_mmap(true)) by default but switch to reading (ov::enable_mmap(false)) if the model is on a removable or network drive.
AUTO Device Selection: Utilize AUTO mode to start with CPU inference and transition to an accelerator seamlessly.

For detailed strategies on managing first-inference latency, refer to the Managing First-Inference Latency section.

Link to this sectionHow do I balance optimizing for latency and throughput with Ultralytics YOLO and OpenVINO?#

Balancing latency and throughput optimization requires understanding your application needs:

Latency Optimization: Ideal for real-time applications requiring immediate responses (e.g., consumer-grade apps).
Throughput Optimization: Best for scenarios with many concurrent inferences, maximizing resource use (e.g., large-scale deployments).

Using OpenVINO's high-level performance hints and multi-device modes can help strike the right balance. Choose the appropriate OpenVINO performance hints based on your specific requirements.

Link to this sectionCan I use Ultralytics YOLO models with other AI frameworks besides OpenVINO?#

Yes, Ultralytics YOLO models are highly versatile and can be integrated with various AI frameworks. Options include:

TensorRT: For NVIDIA GPU optimization, follow the TensorRT integration guide.
CoreML: For Apple devices, refer to our CoreML export instructions.
LiteRT.js: For web and Node.js apps, see the LiteRT integration guide and the LiteRT.js web runtime.

Explore more integrations on the Ultralytics Integrations page.

Contributors

GLglenn-jocher¹² RAraimbekovm² RIRizwanMunawar² AMambitious-octopus¹ ONonuralpszr¹

Created Mar 17, 2024Updated 3 weeks ago