Triton 推論サーバーUltralytics YOLO11

Q: How do I set up Ultralytics YOLO11 with NVIDIA Triton Inference Server?

Ultralytics YOLO11 とNVIDIA Triton Inference Server のセットアップには、いくつかの重要なステップがあります：このセットアップにより、YOLO11 のモデルをTriton Inference Server 上で効率的にスケール展開し、高性能な AI モデル推論を行うことができます。

Q: What benefits does using Ultralytics YOLO11 with NVIDIA Triton Inference Server offer?

Ultralytics YOLO11 をNVIDIA Triton Inference Server と統合すると、いくつかの利点がある：YOLO11 とTriton のセットアップと実行の詳細については、セットアップガイドを参照してください。

Q: Why should I export my YOLO11 model to ONNX format before using Triton Inference Server?

Ultralytics YOLO11 モデルをNVIDIA Triton Inference Server にデプロイする前に、ONNX (Open Neural Network Exchange) フォーマットを使用すると、いくつかの重要な利点があります：モデルをエクスポートするにはエクスポート・ガイドのステップに従って、プロセスを完了することができます。

Q: Can I run inference using the Ultralytics YOLO11 model on Triton Inference Server?

はい、Ultralytics YOLO11 モデルを使って、NVIDIA Triton 推論サーバー上で推論を実行することができます。モデルがTriton Model Repositoryにセットアップされ、サーバーが起動していれば、以下のようにモデルをロードして推論を実行することができます：YOLO11 を使用したTriton サーバーのセットアップと実行に関する詳細なガイドについては、triton 推論サーバーの実行のセクションを参照してください。

Q: How does Ultralytics YOLO11 compare to TensorFlow and PyTorch models for deployment?

Ultralytics YOLO11 には、TensorFlow やPyTorch モデルと比較して、配備に関するいくつかのユニークな利点があります：詳細については、モデル展開ガイドの展開オプションを比較してください。

Triton Inference Server（旧称TensorRT Inference Server）は、NVIDIA によって開発されたオープンソースのソフトウェアソリューションである。NVIDIA GPU 向けに最適化されたクラウド推論ソリューションを提供します。Triton は、本番環境における AI モデルの大規模展開を簡素化します。Ultralytics YOLO11 をTriton Inference Server と統合することで、スケーラブルで高性能な深層学習推論ワークロードを展開することができます。このガイドでは、統合のセットアップとテストの手順を説明します。

見るんだ： NVIDIA Triton Inference Serverを使い始める。

Triton 推論サーバーとは？

Triton Inference Serverは、さまざまなAIモデルを本番環境で展開するために設計されている。ディープラーニングや機械学習フレームワークを幅広くサポートしており、TensorFlow 、 PyTorchONNX Runtime、その他多数。主なユースケースは以下の通り：

単一のサーバーインスタンスから複数のモデルを提供する。
サーバーを再起動することなく、モデルの動的なロードとアンロードが可能。
アンサンブル推論。複数のモデルを一緒に使用して結果を得ることができる。
A/Bテストとローリングアップデートのためのモデルのバージョニング。

前提条件

先に進む前に、以下の前提条件が揃っていることを確認してください：

あなたのマシンにインストールされているDocker。
インストール tritonclient:
```
pip install tritonclient[all]
```

YOLO11 からONNX 形式へのエクスポート

モデルをTriton にデプロイする前に、ONNX フォーマットにエクスポートする必要がある。ONNX (Open Neural Network Exchange）は、異なるディープラーニング・フレームワーク間でモデルを転送できるフォーマットです。を使用する。 export 関数から YOLO クラスである：

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Retreive metadata during export
metadata = []


def export_cb(exporter):
    metadata.append(exporter.metadata)


model.add_callback("on_export_end", export_cb)

# Export the model
onnx_file = model.export(format="onnx", dynamic=True)

Triton モデルリポジトリの設定

Triton モデル・リポジトリーは、Triton がモデルにアクセスし、ロードできる保存場所です。

必要なディレクトリ構造を作成する：

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)

エクスポートしたONNX モデルをTriton リポジトリに移動します：

from pathlib import Path

# Move ONNX model to Triton Model path
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")

# Create config file
(triton_model_path / "config.pbtxt").touch()

# (Optional) Enable TensorRT for GPU inference
# First run will be slow due to TensorRT engine conversion
data = """
optimization {
  execution_accelerators {
    gpu_execution_accelerator {
      name: "tensorrt"
      parameters {
        key: "precision_mode"
        value: "FP16"
      }
      parameters {
        key: "max_workspace_size_bytes"
        value: "3221225472"
      }
      parameters {
        key: "trt_engine_cache_enable"
        value: "1"
      }
      parameters {
        key: "trt_engine_cache_path"
        value: "/models/yolo/1"
      }
    }
  }
}
parameters {
  key: "metadata"
  value: {
    string_value: "%s"
  }
}
""" % metadata[0]

with open(triton_model_path / "config.pbtxt", "w") as f:
    f.write(data)

Triton 推論サーバーの実行

Docker を使ってTriton Inference Server を実行する：

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"  # 8.57 GB

# Pull the image
subprocess.call(f"docker pull {tag}", shell=True)

# Run the Triton server and capture the container ID
container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

# Wait for the Triton server to start
triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

# Wait until model is ready
for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

その後、Triton Server モデルを使って推論を実行する：

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

容器を片付ける：

# Kill and remove the container at the end of the test
subprocess.call(f"docker kill {container_id}", shell=True)

上記のステップに従うことで、Ultralytics YOLO11 モデルをTriton Inference Server 上で効率的にデプロイして実行することができ、ディープラーニングの推論タスクにスケーラブルで高性能なソリューションを提供することができます。何か問題に直面したり、さらに質問がある場合は、 Triton 公式ドキュメントを参照するか、Ultralytics コミュニティにサポートを求めてください。

よくあるご質問

Ultralytics YOLO11 をNVIDIA Triton 推論サーバーに設定するには？

設定 Ultralytics YOLO11 NVIDIA Triton Inference Serverでは、いくつかの重要なステップがある：

YOLO11 をONNX フォーマットにエクスポート：

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Export the model to ONNX format
onnx_file = model.export(format="onnx", dynamic=True)

Triton Model Repositoryをセットアップする：

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")
(triton_model_path / "config.pbtxt").touch()

Triton サーバーを実行します：

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"

subprocess.call(f"docker pull {tag}", shell=True)

container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

このセットアップにより、YOLO11 モデルをTriton Inference Server 上で効率的にスケール展開し、高性能なAIモデル推論を行うことができます。

Ultralytics YOLO11 をNVIDIA Triton Inference Server と一緒に使うと、どのような利点がありますか？

推論サーバーとの統合 Ultralytics YOLO11 NVIDIA Triton Inference Serverとの統合にはいくつかの利点がある：

スケーラブルなAI推論：Triton は、単一のサーバー・インスタンスから複数のモデルを提供することができ、モデルの動的なロードとアンロードをサポートするため、多様なAIワークロードに対して高いスケーラビリティを実現します。
高性能：NVIDIA GPU用に最適化されたTriton Inference Serverは、高速推論オペレーションを保証し、物体検出などのリアルタイムアプリケーションに最適です。
アンサンブルとモデルのバージョン管理：Triton のアンサンブルモードは、複数のモデルを組み合わせて結果を改善することができ、モデルのバージョン管理はA/Bテストとローリングアップデートをサポートします。

Triton でYOLO11 をセットアップして実行するための詳細な手順については、セットアップガイドを参照してください。

Triton Inference Server を使用する前に、YOLO11 モデルをONNX フォーマットにエクスポートする必要があるのはなぜですか？

ONNX (Open Neural Network Exchange) フォーマットを使用することで、モデルを Inference Server にデプロイすることができます。 Ultralytics YOLO11 NVIDIA Triton Inference Server にデプロイする前に、モデルに (Open Neural Network Exchange) フォーマットを使用すると、いくつかの重要な利点があります：

相互運用性：ONNX フォーマットは、異なるディープラーニングフレームワーク（PyTorch やTensorFlow など）間の転送をサポートし、より幅広い互換性を確保する。
最適化：Triton を含む多くの展開環境は、ONNX のために最適化され、より高速な推論とより優れたパフォーマンスを可能にする。
導入の容易さ：ONNX は、フレームワークやプラットフォーム間で幅広くサポートされているため、さまざまなオペレーティングシステムやハードウェア構成での導入プロセスが簡素化される。

モデルをエクスポートするには

from ultralytics import YOLO

model = YOLO("yolo11n.pt")
onnx_file = model.export(format="onnx", dynamic=True)

エクスポート・ガイドの手順に従って、プロセスを完了することができます。

Triton Inference Server上で、Ultralytics YOLO11 モデルを使って推論を行うことはできますか？

はい。 Ultralytics YOLO11 NVIDIA Triton モデルを使用して推論を実行することができます。モデルがTriton モデルリポジトリにセットアップされ、サーバーが起動していれば、以下のようにモデルをロードして推論を実行することができます：

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

YOLO11 を使用したTriton サーバーのセットアップと実行に関する詳細なガイドについては、 triton 推論サーバーの実行セクションを参照してください。

Ultralytics YOLO11 との比較 TensorFlowとPyTorch 展開のためのモデル？

Ultralytics YOLO11は、TensorFlow やPyTorch の展開モデルと比較して、いくつかのユニークな利点を提供している：

リアルタイム性能：リアルタイムの物体検出タスク用に最適化されたYOLO11 は、最先端の精度とスピードを提供し、ライブビデオ解析を必要とするアプリケーションに最適です。
使いやすさ：YOLO11 はTriton Inference Server とシームレスに統合され、多様なエクスポートフォーマット (ONNX,TensorRT,CoreML) をサポートしているため、さまざまな展開シナリオに柔軟に対応できます。
高度な機能:YOLO11 には、動的モデル・ローディング、モデル・バージョニング、アンサンブル推論などの機能が含まれており、これらはスケーラブルで信頼性の高いAI導入に不可欠です。

詳細については、モデル展開ガイドの展開オプションを比較してください。

📅作成：1年前 ✏️更新しました 6日前