Triton 推理服务器Ultralytics YOLO11

Q: How do I set up Ultralytics YOLO11 with NVIDIA Triton Inference Server?

通过NVIDIA Triton Inference Server 设置Ultralytics YOLO11 涉及几个关键步骤：此设置可帮助您在Triton Inference Server 上高效地大规模部署YOLO11 模型，以实现高性能的人工智能模型推理。

Q: What benefits does using Ultralytics YOLO11 with NVIDIA Triton Inference Server offer?

将Ultralytics YOLO11 与NVIDIA Triton Inference Server（推理服务器）整合在一起具有多种优势：有关设置和运行YOLO11 与Triton 的详细说明，请参阅设置指南。

Q: Can I run inference using the Ultralytics YOLO11 model on Triton Inference Server?

是的，您可以在NVIDIA Triton 推理服务器上使用Ultralytics YOLO11 模型运行推理。一旦您的模型在Triton 模型库（Model Repository）中设置完毕，并且服务器正在运行，您就可以按照以下步骤加载并运行模型推理：这种方法允许您在使用熟悉的Ultralytics YOLO 界面的同时，充分利用Triton 的优化功能。有关在YOLO11 中设置和运行Triton 服务器的深入指南，请参阅运行triton 推理服务器部分。

Q: How does Ultralytics YOLO11 compare to TensorFlow and PyTorch models for deployment?

与TensorFlow 和PyTorch 模型相比，Ultralytics YOLO11 在部署方面具有多项独特优势：有关详细信息，请比较模型导出指南中的部署选项。

Triton Inference Server（原名TensorRT Inference Server）是NVIDIA 开发的一个开源软件解决方案。它提供了一个针对NVIDIA GPU 进行了优化的云推理解决方案。Triton 简化了人工智能模型在生产中的大规模部署。将Ultralytics YOLO11 与Triton Inference Server 集成，可以部署可扩展的高性能深度学习推理工作负载。本指南提供了设置和测试集成的步骤。

观看： NVIDIA Triton Inference Server 入门。

什么是Triton Inference Server？

Triton 推理服务器旨在在生产中部署各种人工智能模型。它支持多种深度学习和机器学习框架，包括TensorFlow 、 PyTorchONNX Runtime 等。它的主要用例包括

从单个服务器实例为多个模型提供服务
动态加载和卸载模型，无需重启服务器
集合推理，允许同时使用多个模型来获得结果
模型版本化，用于 A/B 测试和滚动更新

Triton 推理服务器的主要优势

将Triton Inference Server 与Ultralytics YOLO11 结合使用具有多项优势：

自动批处理：在处理多个人工智能请求之前将其组合在一起，从而减少延迟并提高推理速度
Kubernetes 集成：云原生设计可与 Kubernetes 无缝协作，用于管理和扩展人工智能应用程序
硬件特定优化：充分利用NVIDIA ®）图形处理器，实现最高性能
框架灵活性：支持多种人工智能框架，包括TensorFlow、PyTorch、ONNX 和TensorRT
开源且可定制：可根据具体需求进行修改，确保各种人工智能应用的灵活性

先决条件

确保在继续之前具备以下先决条件：

机器上已安装 Docker
安装 tritonclient:
```
pip install tritonclient[all]
```

将YOLO11 导出为ONNX 格式

在Triton 上部署模型之前，必须将其导出为ONNX 格式。ONNX (Open Neural Network Exchange）是一种允许在不同深度学习框架之间传输模型的格式。使用 export 功能中的 YOLO 类：

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Retrieve metadata during export. Metadata needs to be added to config.pbtxt. See next section.
metadata = []


def export_cb(exporter):
    metadata.append(exporter.metadata)


model.add_callback("on_export_end", export_cb)

# Export the model
onnx_file = model.export(format="onnx", dynamic=True)

设置Triton 模型库

Triton 模型库是Triton 可以访问和加载模型的存储位置。

创建必要的目录结构：

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)

将导出的ONNX 模型移至Triton 资源库：

from pathlib import Path

# Move ONNX model to Triton Model path
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")

# Create config file
(triton_model_path / "config.pbtxt").touch()

data = """
# Add metadata
parameters {
  key: "metadata"
  value {
    string_value: "%s"
  }
}

# (Optional) Enable TensorRT for GPU inference
# First run will be slow due to TensorRT engine conversion
optimization {
  execution_accelerators {
    gpu_execution_accelerator {
      name: "tensorrt"
      parameters {
        key: "precision_mode"
        value: "FP16"
      }
      parameters {
        key: "max_workspace_size_bytes"
        value: "3221225472"
      }
      parameters {
        key: "trt_engine_cache_enable"
        value: "1"
      }
      parameters {
        key: "trt_engine_cache_path"
        value: "/models/yolo/1"
      }
    }
  }
}
""" % metadata[0]  # noqa

with open(triton_model_path / "config.pbtxt", "w") as f:
    f.write(data)

运行Triton 推断服务器

使用 Docker 运行Triton Inference Server：

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"  # 8.57 GB

# Pull the image
subprocess.call(f"docker pull {tag}", shell=True)

# Run the Triton server and capture the container ID
container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

# Wait for the Triton server to start
triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

# Wait until model is ready
for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

然后使用Triton 服务器模型运行推理：

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

清理容器：

# Kill and remove the container at the end of the test
subprocess.call(f"docker kill {container_id}", shell=True)

TensorRT 优化（可选）

为了获得更高的性能，您可以使用 TensorRT与Triton Inference Server 结合使用。TensorRT 是专为NVIDIA ）GPU 打造的高性能深度学习优化器，能显著提高推理速度。

将TensorRT 与Triton 结合使用的主要优势包括

与未优化的模型相比，推理速度最多可提高 36 倍
针对硬件进行优化，最大限度地利用GPU
支持精度降低的格式（INT8、FP16），同时保持精度
层融合以减少计算开销

要直接使用TensorRT ，可以将YOLO11 模型导出为TensorRT 格式：

from ultralytics import YOLO

# Load the YOLO11 model
model = YOLO("yolo11n.pt")

# Export the model to TensorRT format
model.export(format="engine")  # creates 'yolo11n.engine'

有关TensorRT 优化的更多信息，请参阅TensorRT 集成指南。

按照上述步骤，您可以在Triton Inference Server 上高效地部署和运行Ultralytics YOLO11 模型，为深度学习推理任务提供可扩展的高性能解决方案。如果您遇到任何问题或有进一步的疑问，请参阅 Triton 官方文档或向Ultralytics 社区寻求支持。

常见问题

如何通过NVIDIA Triton Inference Server 设置Ultralytics YOLO11 ？

设置 Ultralytics YOLO11 NVIDIA Triton Inference Server涉及几个关键步骤：

将YOLO11 导出为ONNX 格式：

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Export the model to ONNX format
onnx_file = model.export(format="onnx", dynamic=True)

建立Triton 模型库：

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")
(triton_model_path / "config.pbtxt").touch()

运行Triton 服务器：

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"

subprocess.call(f"docker pull {tag}", shell=True)

container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

这种设置可帮助您在Triton Inference Server 上高效地大规模部署YOLO11 模型，以实现高性能的人工智能模型推理。

使用Ultralytics YOLO11 和NVIDIA Triton Inference Server 有什么好处？

将Ultralytics YOLO11 与NVIDIA ）Triton Inference Server集成在一起具有多种优势：

可扩展的人工智能推理：Triton 允许从单个服务器实例为多个模型提供服务，支持动态模型加载和卸载，使其具有高度可扩展性，适用于各种人工智能工作负载。
高性能：Triton Inference Server 针对NVIDIA GPU 进行了优化，可确保高速推理操作，非常适合对象检测等实时应用。
集合和模型版本化：Triton 的集合模式可将多个模型结合起来以改进结果，其模型版本化支持 A/B 测试和滚动更新。
自动批处理：Triton 可自动将多个推理请求组合在一起，从而显著提高吞吐量并减少延迟。
简化部署：逐步优化人工智能工作流程，无需对系统进行全面改造，从而更容易实现高效扩展。

有关使用Triton 设置和运行YOLO11 的详细说明，请参阅设置指南。

在使用Triton Inference Server 之前，为什么要将YOLO11 模型导出为ONNX 格式？

在将Ultralytics YOLO11 模型部署到NVIDIA Triton Inference Server（NVIDIA Triton 推理服务器）之前，使用ONNX （开放神经网络交换）格式可为您带来几大优势：

互操作性：ONNX 格式支持不同深度学习框架（如PyTorch,TensorFlow ）之间的传输，确保更广泛的兼容性。
优化：包括Triton 在内的许多部署环境都对ONNX 进行了优化，从而实现更快的推理和更好的性能。
易于部署：ONNX 广泛支持各种框架和平台，简化了各种操作系统和硬件配置的部署过程。
框架独立性：一旦转换为ONNX，您的模型就不再受限于其原始框架，从而使其更具可移植性。
标准化：ONNX 提供了一种标准化的表示方法，有助于克服不同人工智能框架之间的兼容性问题。

要导出模型，请使用

from ultralytics import YOLO

model = YOLO("yolo11n.pt")
onnx_file = model.export(format="onnx", dynamic=True)

您可以按照ONNX 集成指南中的步骤完成这一过程。

我能否在Triton Inference Server 上使用Ultralytics YOLO11 模型运行推理？

是的，您可以在NVIDIA Triton 推理服务器上使用Ultralytics YOLO11 模型运行推理。一旦在Triton 模型库中设置好模型，并且服务器正在运行，就可以按如下步骤加载模型并运行推理：

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

这种方法允许您在使用熟悉的Ultralytics YOLO 界面的同时，充分利用Triton 的优化功能。有关在YOLO11 中设置和运行Triton 服务器的深入指南，请参阅运行triton 推理服务器部分。

Ultralytics YOLO11 与TensorFlow 和PyTorch 的部署模式相比如何？

Ultralytics YOLO11与 TensorFlow和PyTorch 模型相比，YOLO11 具有几个独特的优势：

实时性能：YOLO11 针对实时目标检测任务进行了优化，具有最先进的精度和速度，是需要实时视频分析的应用的理想之选。
易用性：YOLO11 与Triton Inference Server 无缝集成，支持多种导出格式（ONNX,TensorRT,CoreML ），可灵活适用于各种部署场景。
高级功能：YOLO11 包括动态模型加载、模型版本化和集合推理等功能，这些功能对于可扩展和可靠的人工智能部署至关重要。
简化的应用程序接口：Ultralytics API 为不同的部署目标提供了一致的界面，减少了学习曲线和开发时间。
边缘优化：YOLO11 型号的设计考虑到了边缘部署，即使在资源有限的设备上也能提供出色的性能。

有关详细信息，请比较模型导出指南中的部署选项。

📅创建于 1 年前 ✏️已更新 7 天前