Triton Inference Server with Ultralytics YOLO11
The Triton Inference Server (formerly known as TensorRT Inference Server) is an open-source software solution developed by NVIDIA. It provides a cloud inference solution optimized for NVIDIA GPUs. Triton simplifies the deployment of AI models at scale in production. Integrating Ultralytics YOLO11 with Triton Inference Server allows you to deploy scalable, high-performance deep learning inference workloads. This guide provides steps to set up and test the integration.
Watch: ์์ํ๊ธฐ NVIDIA Triton ์ถ๋ก ์๋ฒ.
Triton ์ถ๋ก ์๋ฒ๋ ๋ฌด์์ธ๊ฐ์?
Triton Inference Server is designed to deploy a variety of AI models in production. It supports a wide range of deep learning and machine learning frameworks, including TensorFlow, PyTorch, ONNX Runtime, and many others. Its primary use cases are:
- ๋จ์ผ ์๋ฒ ์ธ์คํด์ค์์ ์ฌ๋ฌ ๋ชจ๋ธ์ ์๋น์คํฉ๋๋ค.
- ์๋ฒ ์ฌ์์ ์์ด ๋์ ๋ชจ๋ธ ๋ก๋ฉ ๋ฐ ์ธ๋ก๋ฉ.
- ์์๋ธ ์ถ๋ก ์ ํตํด ์ฌ๋ฌ ๋ชจ๋ธ์ ํจ๊ป ์ฌ์ฉํ์ฌ ๊ฒฐ๊ณผ๋ฅผ ์ป์ ์ ์์ต๋๋ค.
- A/B ํ ์คํธ ๋ฐ ๋กค๋ง ์ ๋ฐ์ดํธ๋ฅผ ์ํ ๋ชจ๋ธ ๋ฒ์ ๊ด๋ฆฌ.
์ ์ ์กฐ๊ฑด
๊ณ์ ์งํํ๊ธฐ ์ ์ ๋ค์ ์ฌ์ ์๊ตฌ ์ฌํญ์ด ์ถฉ์กฑ๋๋์ง ํ์ธํ์ธ์:
- ๋จธ์ ์ Docker๊ฐ ์ค์น๋์ด ์์ต๋๋ค.
- ์ค์น
tritonclient
:
Exporting YOLO11 to ONNX Format
๋ชจ๋ธ์ Triton ์ ๋ฐฐํฌํ๊ธฐ ์ ์ ONNX ํ์์ผ๋ก ๋ด๋ณด๋ด์ผ ํฉ๋๋ค. ONNX ํ์์ ์๋ก ๋ค๋ฅธ ๋ฅ ๋ฌ๋ ํ๋ ์์ํฌ ๊ฐ์ ๋ชจ๋ธ์ ์ ์กํ ์ ์๋ ํ์(Open Neural Network Exchange)์
๋๋ค. ๋ชจ๋ธ์ ๋ฐฐํฌํ๊ธฐ ์ ์ export
ํจ์์์ YOLO
ํด๋์ค:
from ultralytics import YOLO
# Load a model
model = YOLO("yolo11n.pt") # load an official model
# Export the model
onnx_file = model.export(format="onnx", dynamic=True)
Triton ๋ชจ๋ธ ๋ฆฌํฌ์งํ ๋ฆฌ ์ค์
Triton ๋ชจ๋ธ ์ ์ฅ์๋ Triton ์์ ๋ชจ๋ธ์ ์ก์ธ์คํ๊ณ ๋ก๋ํ ์ ์๋ ์ ์ฅ ์์น์ ๋๋ค.
-
ํ์ํ ๋๋ ํ ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ๋ง๋ญ๋๋ค:
-
๋ด๋ณด๋ธ ONNX ๋ชจ๋ธ์ Triton ๋ฆฌํฌ์งํ ๋ฆฌ๋ก ์ด๋ํฉ๋๋ค:
Triton ์ถ๋ก ์๋ฒ ์คํ
Docker๋ฅผ ์ฌ์ฉํ์ฌ Triton ์ถ๋ก ์๋ฒ๋ฅผ ์คํํฉ๋๋ค:
import contextlib
import subprocess
import time
from tritonclient.http import InferenceServerClient
# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:23.09-py3" # 6.4 GB
# Pull the image
subprocess.call(f"docker pull {tag}", shell=True)
# Run the Triton server and capture the container ID
container_id = (
subprocess.check_output(
f"docker run -d --rm -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
shell=True,
)
.decode("utf-8")
.strip()
)
# Wait for the Triton server to start
triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)
# Wait until model is ready
for _ in range(10):
with contextlib.suppress(Exception):
assert triton_client.is_model_ready(model_name)
break
time.sleep(1)
๊ทธ๋ฐ ๋ค์ Triton ์๋ฒ ๋ชจ๋ธ์ ์ฌ์ฉํ์ฌ ์ถ๋ก ์ ์คํํฉ๋๋ค:
from ultralytics import YOLO
# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")
# Run inference on the server
results = model("path/to/image.jpg")
์ปจํ ์ด๋๋ฅผ ์ ๋ฆฌํฉ๋๋ค:
# Kill and remove the container at the end of the test
subprocess.call(f"docker kill {container_id}", shell=True)
By following the above steps, you can deploy and run Ultralytics YOLO11 models efficiently on Triton Inference Server, providing a scalable and high-performance solution for deep learning inference tasks. If you face any issues or have further queries, refer to the official Triton documentation or reach out to the Ultralytics community for support.
์์ฃผ ๋ฌป๋ ์ง๋ฌธ
How do I set up Ultralytics YOLO11 with NVIDIA Triton Inference Server?
Setting up Ultralytics YOLO11 with NVIDIA Triton Inference Server involves a few key steps:
-
Export YOLO11 to ONNX format:
-
Triton ๋ชจ๋ธ ๋ฆฌํฌ์งํ ๋ฆฌ๋ฅผ ์ค์ ํฉ๋๋ค:
from pathlib import Path # Define paths model_name = "yolo" triton_repo_path = Path("tmp") / "triton_repo" triton_model_path = triton_repo_path / model_name # Create directories (triton_model_path / "1").mkdir(parents=True, exist_ok=True) Path(onnx_file).rename(triton_model_path / "1" / "model.onnx") (triton_model_path / "config.pbtxt").touch()
-
Triton ์๋ฒ๋ฅผ ์คํํฉ๋๋ค:
import contextlib import subprocess import time from tritonclient.http import InferenceServerClient # Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver tag = "nvcr.io/nvidia/tritonserver:23.09-py3" subprocess.call(f"docker pull {tag}", shell=True) container_id = ( subprocess.check_output( f"docker run -d --rm -v {triton_repo_path}/models -p 8000:8000 {tag} tritonserver --model-repository=/models", shell=True, ) .decode("utf-8") .strip() ) triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False) for _ in range(10): with contextlib.suppress(Exception): assert triton_client.is_model_ready(model_name) break time.sleep(1)
This setup can help you efficiently deploy YOLO11 models at scale on Triton Inference Server for high-performance AI model inference.
What benefits does using Ultralytics YOLO11 with NVIDIA Triton Inference Server offer?
Integrating Ultralytics YOLO11 with NVIDIA Triton Inference Server provides several advantages:
- ํ์ฅ ๊ฐ๋ฅํ AI ์ถ๋ก : Triton ๋จ์ผ ์๋ฒ ์ธ์คํด์ค์์ ์ฌ๋ฌ ๋ชจ๋ธ์ ์ ๊ณตํ ์ ์์ผ๋ฉฐ, ๋์ ๋ชจ๋ธ ๋ก๋ ๋ฐ ์ธ๋ก๋๋ฅผ ์ง์ํ๋ฏ๋ก ๋ค์ํ AI ์ํฌ๋ก๋์ ๋ง๊ฒ ํ์ฅ์ฑ์ด ๋ฐ์ด๋ฉ๋๋ค.
- High Performance: Optimized for NVIDIA GPUs, Triton Inference Server ensures high-speed inference operations, perfect for real-time applications such as object detection.
- ์์๋ธ ๋ฐ ๋ชจ๋ธ ๋ฒ์ ๊ด๋ฆฌ: Triton ์ ์์๋ธ ๋ชจ๋๋ฅผ ์ฌ์ฉํ๋ฉด ์ฌ๋ฌ ๋ชจ๋ธ์ ๊ฒฐํฉํ์ฌ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ์ ํ ์ ์์ผ๋ฉฐ, ๋ชจ๋ธ ๋ฒ์ ๊ด๋ฆฌ ๊ธฐ๋ฅ์ A/B ํ ์คํธ ๋ฐ ๋กค๋ง ์ ๋ฐ์ดํธ๋ฅผ ์ง์ํฉ๋๋ค.
For detailed instructions on setting up and running YOLO11 with Triton, you can refer to the setup guide.
Why should I export my YOLO11 model to ONNX format before using Triton Inference Server?
Using ONNX (Open Neural Network Exchange) format for your Ultralytics YOLO11 model before deploying it on NVIDIA Triton Inference Server offers several key benefits:
- ์ํธ ์ด์ฉ์ฑ: ONNX ํ์์ ์๋ก ๋ค๋ฅธ ๋ฅ ๋ฌ๋ ํ๋ ์์ํฌ(์: PyTorch, TensorFlow)๊ฐ์ ์ ์ก์ ์ง์ํ์ฌ ๋ณด๋ค ํญ๋์ ํธํ์ฑ์ ๋ณด์ฅํฉ๋๋ค.
- ์ต์ ํ: Triton ๋ฅผ ํฌํจํ ๋ง์ ๋ฐฐํฌ ํ๊ฒฝ์ด ONNX ์ ์ต์ ํ๋์ด ๋ ๋น ๋ฅธ ์ถ๋ก ๊ณผ ๋ ๋์ ์ฑ๋ฅ์ ์ง์ํฉ๋๋ค.
- ๋ฐฐํฌ ์ฉ์ด์ฑ: ONNX ์ ๋ค์ํ ์ด์ ์ฒด์ ์ ํ๋์จ์ด ๊ตฌ์ฑ์์ ๋ฐฐํฌ ํ๋ก์ธ์ค๋ฅผ ๊ฐ์ํํ์ฌ ํ๋ ์์ํฌ์ ํ๋ซํผ ์ ๋ฐ์์ ํญ๋๊ฒ ์ง์๋ฉ๋๋ค.
๋ชจ๋ธ์ ๋ด๋ณด๋ด๋ ค๋ฉด ๋ค์์ ์ฌ์ฉํ์ธ์:
from ultralytics import YOLO
model = YOLO("yolo11n.pt")
onnx_file = model.export(format="onnx", dynamic=True)
๋ด๋ณด๋ด๊ธฐ ๊ฐ์ด๋์ ๋จ๊ณ์ ๋ฐ๋ผ ํ๋ก์ธ์ค๋ฅผ ์๋ฃํ ์ ์์ต๋๋ค.
Can I run inference using the Ultralytics YOLO11 model on Triton Inference Server?
Yes, you can run inference using the Ultralytics YOLO11 model on NVIDIA Triton Inference Server. Once your model is set up in the Triton Model Repository and the server is running, you can load and run inference on your model as follows:
from ultralytics import YOLO
# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")
# Run inference on the server
results = model("path/to/image.jpg")
For an in-depth guide on setting up and running Triton Server with YOLO11, refer to the running triton inference server section.
How does Ultralytics YOLO11 compare to TensorFlow and PyTorch models for deployment?
Ultralytics YOLO11 offers several unique advantages compared to TensorFlow and PyTorch models for deployment:
- Real-time Performance: Optimized for real-time object detection tasks, YOLO11 provides state-of-the-art accuracy and speed, making it ideal for applications requiring live video analytics.
- Ease of Use: YOLO11 integrates seamlessly with Triton Inference Server and supports diverse export formats (ONNX, TensorRT, CoreML), making it flexible for various deployment scenarios.
- Advanced Features: YOLO11 includes features like dynamic model loading, model versioning, and ensemble inference, which are crucial for scalable and reliable AI deployments.
์์ธํ ๋ด์ฉ์ ๋ชจ๋ธ ๋ฐฐํฌ ๊ฐ์ด๋์์ ๋ฐฐํฌ ์ต์ ์ ๋น๊ตํ์ธ์.