Triton 推理服务器Ultralytics YOLO11

Q: How do I set up Ultralytics YOLO11 with NVIDIA Triton Inference Server?

通过NVIDIA Triton Inference Server 设置Ultralytics YOLO11 涉及几个关键步骤：此设置可帮助您在Triton Inference Server 上高效地大规模部署YOLO11 模型，以实现高性能的人工智能模型推理。

Q: What benefits does using Ultralytics YOLO11 with NVIDIA Triton Inference Server offer?

将Ultralytics YOLO11 与NVIDIA Triton Inference Server（推理服务器）整合在一起具有多种优势：有关设置和运行YOLO11 与Triton 的详细说明，请参阅设置指南。

Q: Why should I export my YOLO11 model to ONNX format before using Triton Inference Server?

在将Ultralytics YOLO11 模型部署到NVIDIA Triton Inference Server（推理服务器）之前，使用ONNX （开放神经网络交换）格式可带来几大好处：要导出模型，请使用您可以按照导出指南中的步骤完成导出过程。

Q: Can I run inference using the Ultralytics YOLO11 model on Triton Inference Server?

是的，您可以在NVIDIA Triton 推论服务器上使用Ultralytics YOLO11 模型运行推论。一旦您的模型在Triton 模型库中设置完毕，服务器也已运行，您就可以按照以下步骤加载模型并运行推理：有关通过YOLO11 设置和运行Triton 服务器的深入指南，请参阅运行triton Inference 服务器部分。

Q: How does Ultralytics YOLO11 compare to TensorFlow and PyTorch models for deployment?

Ultralytics YOLO11 与TensorFlow 和PyTorch 模型相比，该系统在部署方面具有多项独特优势：更多详情，请比较模型部署指南中的部署选项。

Triton Inference Server（原名TensorRT Inference Server）是NVIDIA 开发的一个开源软件解决方案。它提供了一个针对NVIDIA GPU 进行了优化的云推理解决方案。Triton 简化了人工智能模型在生产中的大规模部署。将Ultralytics YOLO11 与Triton Inference Server 集成，可以部署可扩展的高性能深度学习推理工作负载。本指南提供了设置和测试集成的步骤。

观看： NVIDIA Triton Inference Server 入门。

什么是Triton Inference Server？

Triton 推理服务器旨在在生产中部署各种人工智能模型。它支持多种深度学习和机器学习框架，包括TensorFlow 、 PyTorchONNX Runtime 等。它的主要用例包括

从一个服务器实例为多个模型提供服务。
动态加载和卸载模型，无需重启服务器。
集合推理，允许同时使用多个模型来获得结果。
模型版本化，用于 A/B 测试和滚动更新。

先决条件

确保在继续之前具备以下先决条件：

您的计算机上安装了 Docker。
安装 tritonclient:
```
pip install tritonclient[all]
```

将YOLO11 导出为ONNX 格式

在Triton 上部署模型之前，必须将其导出为ONNX 格式。ONNX (Open Neural Network Exchange）是一种允许在不同深度学习框架之间传输模型的格式。使用 export 功能中的 YOLO 类：

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Retreive metadata during export
metadata = []


def export_cb(exporter):
    metadata.append(exporter.metadata)


model.add_callback("on_export_end", export_cb)

# Export the model
onnx_file = model.export(format="onnx", dynamic=True)

设置Triton 模型库

Triton 模型库是Triton 可以访问和加载模型的存储位置。

创建必要的目录结构：

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)

将导出的ONNX 模型移至Triton 资源库：

from pathlib import Path

# Move ONNX model to Triton Model path
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")

# Create config file
(triton_model_path / "config.pbtxt").touch()

# (Optional) Enable TensorRT for GPU inference
# First run will be slow due to TensorRT engine conversion
data = """
optimization {
  execution_accelerators {
    gpu_execution_accelerator {
      name: "tensorrt"
      parameters {
        key: "precision_mode"
        value: "FP16"
      }
      parameters {
        key: "max_workspace_size_bytes"
        value: "3221225472"
      }
      parameters {
        key: "trt_engine_cache_enable"
        value: "1"
      }
      parameters {
        key: "trt_engine_cache_path"
        value: "/models/yolo/1"
      }
    }
  }
}
parameters {
  key: "metadata"
  value: {
    string_value: "%s"
  }
}
""" % metadata[0]

with open(triton_model_path / "config.pbtxt", "w") as f:
    f.write(data)

运行Triton 推断服务器

使用 Docker 运行Triton Inference Server：

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"  # 8.57 GB

# Pull the image
subprocess.call(f"docker pull {tag}", shell=True)

# Run the Triton server and capture the container ID
container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

# Wait for the Triton server to start
triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

# Wait until model is ready
for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

然后使用Triton 服务器模型运行推理：

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

清理容器：

# Kill and remove the container at the end of the test
subprocess.call(f"docker kill {container_id}", shell=True)

按照上述步骤，您可以在Triton Inference Server 上高效地部署和运行Ultralytics YOLO11 模型，为深度学习推理任务提供可扩展的高性能解决方案。如果您遇到任何问题或有进一步的疑问，请参阅 Triton 官方文档或向Ultralytics 社区寻求支持。

常见问题

如何通过NVIDIA Triton Inference Server 设置Ultralytics YOLO11 ？

设置 Ultralytics YOLO11 NVIDIA Triton Inference Server涉及几个关键步骤：

将YOLO11 导出为ONNX 格式：

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Export the model to ONNX format
onnx_file = model.export(format="onnx", dynamic=True)

建立Triton 模型库：

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")
(triton_model_path / "config.pbtxt").touch()

运行Triton 服务器：

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"

subprocess.call(f"docker pull {tag}", shell=True)

container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

这种设置可帮助您在Triton Inference Server 上高效地大规模部署YOLO11 模型，以实现高性能的人工智能模型推理。

使用Ultralytics YOLO11 和NVIDIA Triton Inference Server 有什么好处？

与 Ultralytics YOLO11 NVIDIA Triton Inference Server（推理服务器）提供了多项优势：

可扩展的人工智能推理：Triton 允许从单个服务器实例为多个模型提供服务，支持动态模型加载和卸载，使其具有高度可扩展性，适用于各种人工智能工作负载。
高性能：Triton Inference Server 针对NVIDIA GPU 进行了优化，可确保高速推理操作，非常适合对象检测等实时应用。
集合和模型版本化：Triton 的集合模式可将多个模型结合起来以改进结果，其模型版本化支持 A/B 测试和滚动更新。

有关使用Triton 设置和运行YOLO11 的详细说明，请参阅设置指南。

在使用Triton Inference Server 之前，为什么要将YOLO11 模型导出为ONNX 格式？

使用ONNX （开放神经网络交换）格式为您的 Ultralytics YOLO11 在NVIDIA Triton Inference Server 上部署模型之前，使用（开放式神经网络交换格式）可带来几大好处：

互操作性：ONNX 格式支持不同深度学习框架（如PyTorch,TensorFlow ）之间的传输，确保更广泛的兼容性。
优化：包括Triton 在内的许多部署环境都对ONNX 进行了优化，从而实现更快的推理和更好的性能。
易于部署：ONNX 广泛支持各种框架和平台，简化了各种操作系统和硬件配置的部署过程。

要导出模型，请使用

from ultralytics import YOLO

model = YOLO("yolo11n.pt")
onnx_file = model.export(format="onnx", dynamic=True)

您可以按照导出指南中的步骤完成这一过程。

我能否在Triton Inference Server 上使用Ultralytics YOLO11 模型运行推理？

是的，您可以使用 Ultralytics YOLO11模型在NVIDIA Triton 推理服务器上运行推理。在Triton 模型库中设置好模型并运行服务器后，您就可以按以下步骤加载模型并运行推理：

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

有关使用YOLO11 设置和运行Triton 服务器的深入指南，请参阅运行triton 推理服务器部分。

Ultralytics YOLO11 与 TensorFlow和PyTorch 部署模式相比如何？

Ultralytics YOLO11与TensorFlow 和PyTorch 部署模式相比，该系统具有若干独特优势：

实时性能：YOLO11 针对实时目标检测任务进行了优化，具有最先进的精度和速度，是需要实时视频分析的应用的理想之选。
易用性：YOLO11 与Triton Inference Server 无缝集成，支持多种导出格式（ONNX,TensorRT,CoreML ），可灵活适用于各种部署场景。
高级功能：YOLO11 包括动态模型加载、模型版本化和集合推理等功能，这些功能对于可扩展和可靠的人工智能部署至关重要。

有关详细信息，请比较模型部署指南中的部署选项。

📅创建于 1 年前 ✏️已更新 6 天前