Skip to content

Optimizing OpenVINO Inference for Ultralytics YOLO Models: A Comprehensive Guide

OpenVINO Ecosystem


When deploying deep learning models, particularly those for object detection such as Ultralytics YOLO models, achieving optimal performance is crucial. This guide delves into leveraging Intel's OpenVINO toolkit to optimize inference, focusing on latency and throughput. Whether you're working on consumer-grade applications or large-scale deployments, understanding and applying these optimization strategies will ensure your models run efficiently on various devices.

Optimizing for Latency

Latency optimization is vital for applications requiring immediate response from a single model given a single input, typical in consumer scenarios. The goal is to minimize the delay between input and inference result. However, achieving low latency involves careful consideration, especially when running concurrent inferences or managing multiple models.

Key Strategies for Latency Optimization:

  • Single Inference per Device: The simplest way to achieve low latency is by limiting to one inference at a time per device. Additional concurrency often leads to increased latency.
  • Leveraging Sub-Devices: Devices like multi-socket CPUs or multi-tile GPUs can execute multiple requests with minimal latency increase by utilizing their internal sub-devices.
  • OpenVINO Performance Hints: Utilizing OpenVINO's ov::hint::PerformanceMode::LATENCY for the ov::hint::performance_mode property during model compilation simplifies performance tuning, offering a device-agnostic and future-proof approach.

Managing First-Inference Latency:

  • Model Caching: To mitigate model load and compile times impacting latency, use model caching where possible. For scenarios where caching isn't viable, CPUs generally offer the fastest model load times.
  • Model Mapping vs. Reading: To reduce load times, OpenVINO replaced model reading with mapping. However, if the model is on a removable or network drive, consider using ov::enable_mmap(false) to switch back to reading.
  • AUTO Device Selection: This mode begins inference on the CPU, shifting to an accelerator once ready, seamlessly reducing first-inference latency.

Optimizing for Throughput

Throughput optimization is crucial for scenarios serving numerous inference requests simultaneously, maximizing resource utilization without significantly sacrificing individual request performance.

Approaches to Throughput Optimization:

  1. OpenVINO Performance Hints: A high-level, future-proof method to enhance throughput across devices using performance hints.

    import as hints
    config = {hints.performance_mode: hints.PerformanceMode.THROUGHPUT}
    compiled_model = core.compile_model(model, "GPU", config)
  2. Explicit Batching and Streams: A more granular approach involving explicit batching and the use of streams for advanced performance tuning.

Designing Throughput-Oriented Applications:

To maximize throughput, applications should:

  • Process inputs in parallel, making full use of the device's capabilities.
  • Decompose data flow into concurrent inference requests, scheduled for parallel execution.
  • Utilize the Async API with callbacks to maintain efficiency and avoid device starvation.

Multi-Device Execution:

OpenVINO's multi-device mode simplifies scaling throughput by automatically balancing inference requests across devices without requiring application-level device management.


Optimizing Ultralytics YOLO models for latency and throughput with OpenVINO can significantly enhance your application's performance. By carefully applying the strategies outlined in this guide, developers can ensure their models run efficiently, meeting the demands of various deployment scenarios. Remember, the choice between optimizing for latency or throughput depends on your specific application needs and the characteristics of the deployment environment.

For more detailed technical information and the latest updates, refer to the OpenVINO documentation and Ultralytics YOLO repository. These resources provide in-depth guides, tutorials, and community support to help you get the most out of your deep learning models.

Ensuring your models achieve optimal performance is not just about tweaking configurations; it's about understanding your application's needs and making informed decisions. Whether you're optimizing for real-time responses or maximizing throughput for large-scale processing, the combination of Ultralytics YOLO models and OpenVINO offers a powerful toolkit for developers to deploy high-performance AI solutions.

Created 2024-03-17, Updated 2024-06-10
Authors: glenn-jocher (4)