Triton 추론 서버 Ultralytics YOLO11

Q: How do I set up Ultralytics YOLO11 with NVIDIA Triton Inference Server?

NVIDIA Triton 추론 서버로 Ultralytics YOLO11 을 설정하려면 몇 가지 주요 단계를 거쳐야 합니다: 이 설정은 고성능 AI 모델 추론을 위해 Triton 추론 서버에서 YOLO11 모델을 대규모로 효율적으로 배포하는 데 도움이 될 수 있습니다.

Q: What benefits does using Ultralytics YOLO11 with NVIDIA Triton Inference Server offer?

Ultralytics YOLO11 과 NVIDIA Triton 추론 서버를 통합하면 몇 가지 이점이 있습니다: Triton 과 YOLO11 을 설정하고 실행하는 방법에 대한 자세한 지침은 설정 가이드를 참조하세요.

Q: Why should I export my YOLO11 model to ONNX format before using Triton Inference Server?

Ultralytics YOLO11 모델을 NVIDIA Triton 추론 서버에 배포하기 전에 ONNX (오픈 신경망 교환) 형식을 사용하면 몇 가지 주요 이점을 얻을 수 있습니다: 모델을 내보내려면 다음을 사용하세요: 내보내기 가이드의 단계에 따라 프로세스를 완료할 수 있습니다.

Q: Can I run inference using the Ultralytics YOLO11 model on Triton Inference Server?

예, NVIDIA Triton 추론 서버에서 Ultralytics YOLO11 모델을 사용하여 추론을 실행할 수 있습니다. Triton 모델 리포지토리에서 모델을 설정하고 서버가 실행 중이면 다음과 같이 모델에서 추론을 로드하고 실행할 수 있습니다: YOLO11 을 사용하여 Triton 서버를 설정하고 실행하는 방법에 대한 자세한 가이드는 triton 추론 서버 실행하기 섹션을 참조하세요.

Q: How does Ultralytics YOLO11 compare to TensorFlow and PyTorch models for deployment?

Ultralytics YOLO11 배포 시 TensorFlow 및 PyTorch 모델에 비해 몇 가지 고유한 이점을 제공합니다: 자세한 내용은 모델 배포 가이드에서 배포 옵션을 비교하세요.

Triton 추론 서버 (이전 명칭: TensorRT 추론 서버)는 NVIDIA 에서 개발한 오픈 소스 소프트웨어 솔루션입니다. NVIDIA GPU에 최적화된 클라우드 추론 솔루션을 제공합니다. Triton 은 프로덕션 환경에서 대규모로 AI 모델을 배포하는 작업을 간소화합니다. Ultralytics YOLO11 과 Triton 추론 서버를 통합하면 확장 가능한 고성능 딥 러닝 추론 워크로드를 배포할 수 있습니다. 이 가이드는 통합을 설정하고 테스트하는 단계를 제공합니다.

Watch: 시작하기 NVIDIA Triton 추론 서버.

Triton 추론 서버란 무엇인가요?

Triton 추론 서버는 프로덕션 환경에서 다양한 AI 모델을 배포하도록 설계되었습니다. TensorFlow 다음을 포함한 광범위한 딥 러닝 및 머신 러닝 프레임워크를 지원합니다, PyTorch, ONNX 런타임 및 기타 여러 프레임워크를 지원합니다. 주요 사용 사례는 다음과 같습니다:

단일 서버 인스턴스에서 여러 모델을 서비스합니다.
서버 재시작 없이 동적 모델 로딩 및 언로딩.
앙상블 추론을 통해 여러 모델을 함께 사용하여 결과를 얻을 수 있습니다.
A/B 테스트 및 롤링 업데이트를 위한 모델 버전 관리.

전제 조건

계속 진행하기 전에 다음 사전 요구 사항이 충족되는지 확인하세요:

머신에 Docker가 설치되어 있습니다.
설치 tritonclient:
```
pip install tritonclient[all]
```

YOLO11 에서 ONNX 형식으로 내보내기

모델을 Triton 에 배포하기 전에 ONNX 형식으로 내보내야 합니다. ONNX 형식은 서로 다른 딥 러닝 프레임워크 간에 모델을 전송할 수 있는 형식(Open Neural Network Exchange)입니다. 모델을 배포하기 전에 export 함수에서 YOLO 클래스:

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Retreive metadata during export
metadata = []


def export_cb(exporter):
    metadata.append(exporter.metadata)


model.add_callback("on_export_end", export_cb)

# Export the model
onnx_file = model.export(format="onnx", dynamic=True)

Triton 모델 리포지토리 설정

Triton 모델 저장소는 Triton 에서 모델에 액세스하고 로드할 수 있는 저장 위치입니다.

필요한 디렉토리 구조를 만듭니다:

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)

내보낸 ONNX 모델을 Triton 리포지토리로 이동합니다:

from pathlib import Path

# Move ONNX model to Triton Model path
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")

# Create config file
(triton_model_path / "config.pbtxt").touch()

# (Optional) Enable TensorRT for GPU inference
# First run will be slow due to TensorRT engine conversion
data = """
optimization {
  execution_accelerators {
    gpu_execution_accelerator {
      name: "tensorrt"
      parameters {
        key: "precision_mode"
        value: "FP16"
      }
      parameters {
        key: "max_workspace_size_bytes"
        value: "3221225472"
      }
      parameters {
        key: "trt_engine_cache_enable"
        value: "1"
      }
      parameters {
        key: "trt_engine_cache_path"
        value: "/models/yolo/1"
      }
    }
  }
}
parameters {
  key: "metadata"
  value: {
    string_value: "%s"
  }
}
""" % metadata[0]

with open(triton_model_path / "config.pbtxt", "w") as f:
    f.write(data)

Triton 추론 서버 실행

Docker를 사용하여 Triton 추론 서버를 실행합니다:

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"  # 8.57 GB

# Pull the image
subprocess.call(f"docker pull {tag}", shell=True)

# Run the Triton server and capture the container ID
container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

# Wait for the Triton server to start
triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

# Wait until model is ready
for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

그런 다음 Triton 서버 모델을 사용하여 추론을 실행합니다:

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

컨테이너를 정리합니다:

# Kill and remove the container at the end of the test
subprocess.call(f"docker kill {container_id}", shell=True)

위의 단계에 따라 Triton 추론 서버에서 Ultralytics YOLO11 모델을 효율적으로 배포하고 실행하여 딥 러닝 추론 작업을 위한 확장 가능한 고성능 솔루션을 제공할 수 있습니다. 문제가 발생하거나 추가 질문이 있는 경우 공식 Triton 설명서를 참조하거나 Ultralytics 커뮤니티에 문의하여 지원을 받으세요.

자주 묻는 질문

Ultralytics YOLO11 NVIDIA 추론 서버를 어떻게 설정하나요?Triton

설정 Ultralytics YOLO11 NVIDIA Triton 추론 서버 설정에는 몇 가지 주요 단계가 포함됩니다:

YOLO11 을 ONNX 형식으로 내보냅니다:

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Export the model to ONNX format
onnx_file = model.export(format="onnx", dynamic=True)

Triton 모델 리포지토리를 설정합니다:

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")
(triton_model_path / "config.pbtxt").touch()

Triton 서버를 실행합니다:

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"

subprocess.call(f"docker pull {tag}", shell=True)

container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

이 설정을 사용하면 고성능 AI 모델 추론을 위해 Triton 추론 서버에서 YOLO11 모델을 대규모로 효율적으로 배포할 수 있습니다.

NVIDIA Triton 추론 서버와 함께 Ultralytics YOLO11 을 사용하면 어떤 이점이 있나요?

통합 Ultralytics YOLO11 NVIDIA Triton 추론 서버와 통합하면 몇 가지 이점이 있습니다:

확장 가능한 AI 추론: Triton 단일 서버 인스턴스에서 여러 모델을 제공할 수 있으며, 동적 모델 로드 및 언로드를 지원하므로 다양한 AI 워크로드에 맞게 확장성이 뛰어납니다.
고성능: NVIDIA GPU에 최적화된 Triton 추론 서버는 고속 추론 작업을 보장하며, 물체 감지와 같은 실시간 애플리케이션에 적합합니다.
앙상블 및 모델 버전 관리: Triton 의 앙상블 모드를 사용하면 여러 모델을 결합하여 결과를 개선할 수 있으며, 모델 버전 관리 기능은 A/B 테스트 및 롤링 업데이트를 지원합니다.

YOLO11 설정 및 실행에 대한 자세한 지침은 설정 가이드( Triton)를 참조하세요.

Triton 추론 서버를 사용하기 전에 YOLO11 모델을 ONNX 형식으로 내보내야 하는 이유는 무엇인가요?

모델에 ONNX (오픈 신경망 교환) 형식 사용 Ultralytics YOLO11 NVIDIA Triton 추론 서버는 몇 가지 주요 이점을 제공합니다:

상호 운용성: ONNX 형식은 서로 다른 딥 러닝 프레임워크(예: PyTorch, TensorFlow)간의 전송을 지원하여 보다 폭넓은 호환성을 보장합니다.
최적화: Triton 를 포함한 많은 배포 환경이 ONNX 에 최적화되어 더 빠른 추론과 더 나은 성능을 지원합니다.
배포 용이성: ONNX 은 다양한 운영 체제와 하드웨어 구성에서 배포 프로세스를 간소화하여 프레임워크와 플랫폼 전반에서 폭넓게 지원됩니다.

모델을 내보내려면 다음을 사용하세요:

from ultralytics import YOLO

model = YOLO("yolo11n.pt")
onnx_file = model.export(format="onnx", dynamic=True)

내보내기 가이드의 단계에 따라 프로세스를 완료할 수 있습니다.

Triton 추론 서버에서 Ultralytics YOLO11 모델을 사용하여 추론을 실행할 수 있나요?

예, 추론을 실행할 수 있습니다. Ultralytics YOLO11 모델을 사용하여 추론을 실행할 수 있습니다( NVIDIA Triton 추론 서버). Triton 모델 리포지토리에서 모델을 설정하고 서버가 실행 중이면 다음과 같이 모델에서 추론을 로드하고 실행할 수 있습니다:

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

YOLO11 을 사용하여 Triton 서버를 설정하고 실행하는 방법에 대한 자세한 가이드는 triton 추론 서버 실행하기 섹션을 참조하세요.

Ultralytics YOLO11 과 TensorFlow 및 PyTorch 모델과 어떻게 다른가요?

Ultralytics YOLO11 는 배포 시 TensorFlow 및 PyTorch 모델에 비해 몇 가지 고유한 이점을 제공합니다:

실시간 성능: 실시간 객체 감지 작업에 최적화된 YOLO11 은 최첨단 정확도와 속도를 제공하므로 실시간 비디오 분석이 필요한 애플리케이션에 이상적입니다.
사용 편의성: YOLO11 Triton 추론 서버와 원활하게 통합되며 다양한 내보내기 형식(ONNX, TensorRT, CoreML)을 지원하므로 다양한 배포 시나리오에 유연하게 사용할 수 있습니다.
고급 기능: YOLO11 에는 확장 가능하고 안정적인 AI 배포에 중요한 동적 모델 로드, 모델 버전 관리 및 앙상블 추론과 같은 기능이 포함되어 있습니다.

자세한 내용은 모델 배포 가이드에서 배포 옵션을 비교하세요.

📅1 년 전 생성됨 ✏️ 업데이트됨 6 일 전