Triton Inference Server with Ultralytics YOLO11

The Triton Inference Server (formerly known as TensorRT Inference Server) is an open-source software solution developed by NVIDIA. It provides a cloud inference solution optimized for NVIDIA GPUs. Triton simplifies the deployment of AI models at scale in production. Integrating Ultralytics YOLO11 with Triton Inference Server allows you to deploy scalable, high-performance deep learning inference workloads. This guide provides steps to set up and test the integration.

Watch: Getting Started with NVIDIA Triton Inference Server.

What is Triton Inference Server?

Triton Inference Server is designed to deploy a variety of AI models in production. It supports a wide range of deep learning and machine learning frameworks, including TensorFlow, PyTorch, ONNX Runtime, and many others. Its primary use cases are:

Serving multiple models from a single server instance
Dynamic model loading and unloading without server restart
Ensemble inference, allowing multiple models to be used together to achieve results
Model versioning for A/B testing and rolling updates

Key Benefits of Triton Inference Server

Using Triton Inference Server with Ultralytics YOLO11 provides several advantages:

Automatic batching: Groups multiple AI requests together before processing them, reducing latency and improving inference speed
Kubernetes integration: Cloud-native design works seamlessly with Kubernetes for managing and scaling AI applications
Hardware-specific optimizations: Takes full advantage of NVIDIA GPUs for maximum performance
Framework flexibility: Supports multiple AI frameworks including TensorFlow, PyTorch, ONNX, and TensorRT
Open-source and customizable: Can be modified to fit specific needs, ensuring flexibility for various AI applications

Prerequisites

Ensure you have the following prerequisites before proceeding:

Docker installed on your machine
Install tritonclient:
```
pip install tritonclient[all]
```

Exporting YOLO11 to ONNX Format

Before deploying the model on Triton, it must be exported to the ONNX format. ONNX (Open Neural Network Exchange) is a format that allows models to be transferred between different deep learning frameworks. Use the export function from the YOLO class:

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Retrieve metadata during export. Metadata needs to be added to config.pbtxt. See next section.
metadata = []


def export_cb(exporter):
    metadata.append(exporter.metadata)


model.add_callback("on_export_end", export_cb)

# Export the model
onnx_file = model.export(format="onnx", dynamic=True)

Setting Up Triton Model Repository

The Triton Model Repository is a storage location where Triton can access and load models.

Create the necessary directory structure:

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)

Move the exported ONNX model to the Triton repository:

from pathlib import Path

# Move ONNX model to Triton Model path
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")

# Create config file
(triton_model_path / "config.pbtxt").touch()

data = """
# Add metadata
parameters {
  key: "metadata"
  value {
    string_value: "%s"
  }
}

# (Optional) Enable TensorRT for GPU inference
# First run will be slow due to TensorRT engine conversion
optimization {
  execution_accelerators {
    gpu_execution_accelerator {
      name: "tensorrt"
      parameters {
        key: "precision_mode"
        value: "FP16"
      }
      parameters {
        key: "max_workspace_size_bytes"
        value: "3221225472"
      }
      parameters {
        key: "trt_engine_cache_enable"
        value: "1"
      }
      parameters {
        key: "trt_engine_cache_path"
        value: "/models/yolo/1"
      }
    }
  }
}
""" % metadata[0]  # noqa

with open(triton_model_path / "config.pbtxt", "w") as f:
    f.write(data)

Running Triton Inference Server

Run the Triton Inference Server using Docker:

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"  # 8.57 GB

# Pull the image
subprocess.call(f"docker pull {tag}", shell=True)

# Run the Triton server and capture the container ID
container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

# Wait for the Triton server to start
triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

# Wait until model is ready
for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

Then run inference using the Triton Server model:

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

Cleanup the container:

# Kill and remove the container at the end of the test
subprocess.call(f"docker kill {container_id}", shell=True)

TensorRT Optimization (Optional)

For even greater performance, you can use TensorRT with Triton Inference Server. TensorRT is a high-performance deep learning optimizer built specifically for NVIDIA GPUs that can significantly increase inference speed.

Key benefits of using TensorRT with Triton include:

Up to 36x faster inference compared to unoptimized models
Hardware-specific optimizations for maximum GPU utilization
Support for reduced precision formats (INT8, FP16) while maintaining accuracy
Layer fusion to reduce computational overhead

To use TensorRT directly, you can export your YOLO11 model to TensorRT format:

from ultralytics import YOLO

# Load the YOLO11 model
model = YOLO("yolo11n.pt")

# Export the model to TensorRT format
model.export(format="engine")  # creates 'yolo11n.engine'

For more information on TensorRT optimization, see the TensorRT integration guide.

By following the above steps, you can deploy and run Ultralytics YOLO11 models efficiently on Triton Inference Server, providing a scalable and high-performance solution for deep learning inference tasks. If you face any issues or have further queries, refer to the official Triton documentation or reach out to the Ultralytics community for support.

FAQ

How do I set up Ultralytics YOLO11 with NVIDIA Triton Inference Server?

Setting up Ultralytics YOLO11 with NVIDIA Triton Inference Server involves a few key steps:

Export YOLO11 to ONNX format:

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Export the model to ONNX format
onnx_file = model.export(format="onnx", dynamic=True)

Set up Triton Model Repository:

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")
(triton_model_path / "config.pbtxt").touch()

Run the Triton Server:

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"

subprocess.call(f"docker pull {tag}", shell=True)

container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

This setup can help you efficiently deploy YOLO11 models at scale on Triton Inference Server for high-performance AI model inference.

What benefits does using Ultralytics YOLO11 with NVIDIA Triton Inference Server offer?

Integrating Ultralytics YOLO11 with NVIDIA Triton Inference Server provides several advantages:

Scalable AI Inference: Triton allows serving multiple models from a single server instance, supporting dynamic model loading and unloading, making it highly scalable for diverse AI workloads.
High Performance: Optimized for NVIDIA GPUs, Triton Inference Server ensures high-speed inference operations, perfect for real-time applications such as object detection.
Ensemble and Model Versioning: Triton's ensemble mode enables combining multiple models to improve results, and its model versioning supports A/B testing and rolling updates.
Automatic Batching: Triton automatically groups multiple inference requests together, significantly improving throughput and reducing latency.
Simplified Deployment: Gradual optimization of AI workflows without requiring complete system overhauls, making it easier to scale efficiently.

For detailed instructions on setting up and running YOLO11 with Triton, you can refer to the setup guide.

Why should I export my YOLO11 model to ONNX format before using Triton Inference Server?

Using ONNX (Open Neural Network Exchange) format for your Ultralytics YOLO11 model before deploying it on NVIDIA Triton Inference Server offers several key benefits:

Interoperability: ONNX format supports transfer between different deep learning frameworks (such as PyTorch, TensorFlow), ensuring broader compatibility.
Optimization: Many deployment environments, including Triton, optimize for ONNX, enabling faster inference and better performance.
Ease of Deployment: ONNX is widely supported across frameworks and platforms, simplifying the deployment process in various operating systems and hardware configurations.
Framework Independence: Once converted to ONNX, your model is no longer tied to its original framework, making it more portable.
Standardization: ONNX provides a standardized representation that helps overcome compatibility issues between different AI frameworks.

To export your model, use:

from ultralytics import YOLO

model = YOLO("yolo11n.pt")
onnx_file = model.export(format="onnx", dynamic=True)

You can follow the steps in the ONNX integration guide to complete the process.

Can I run inference using the Ultralytics YOLO11 model on Triton Inference Server?

Yes, you can run inference using the Ultralytics YOLO11 model on NVIDIA Triton Inference Server. Once your model is set up in the Triton Model Repository and the server is running, you can load and run inference on your model as follows:

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

This approach allows you to leverage Triton's optimizations while using the familiar Ultralytics YOLO interface. For an in-depth guide on setting up and running Triton Server with YOLO11, refer to the running triton inference server section.

How does Ultralytics YOLO11 compare to TensorFlow and PyTorch models for deployment?

Ultralytics YOLO11 offers several unique advantages compared to TensorFlow and PyTorch models for deployment:

Real-time Performance: Optimized for real-time object detection tasks, YOLO11 provides state-of-the-art accuracy and speed, making it ideal for applications requiring live video analytics.
Ease of Use: YOLO11 integrates seamlessly with Triton Inference Server and supports diverse export formats (ONNX, TensorRT, CoreML), making it flexible for various deployment scenarios.
Advanced Features: YOLO11 includes features like dynamic model loading, model versioning, and ensemble inference, which are crucial for scalable and reliable AI deployments.
Simplified API: The Ultralytics API provides a consistent interface across different deployment targets, reducing the learning curve and development time.
Edge Optimization: YOLO11 models are designed with edge deployment in mind, offering excellent performance even on resource-constrained devices.

For more details, compare the deployment options in the model export guide.

📅 Created 1 year ago ✏️ Updated 22 days ago