Link to this sectionTiền xử lý tăng tốc GPU với NVIDIA DALI#

Link to this sectionGiới thiệu#

Khi triển khai các model Ultralytics YOLO trong môi trường production, khâu tiền xử lý thường trở thành điểm nghẽn. Mặc dù TensorRT có thể chạy inference model chỉ trong vài mili giây, nhưng việc tiền xử lý trên CPU (resize, pad, normalize) có thể mất từ 2-10ms mỗi ảnh, đặc biệt ở độ phân giải cao. NVIDIA DALI (Data Loading Library) giải quyết vấn đề này bằng cách chuyển toàn bộ pipeline tiền xử lý sang GPU.

Hướng dẫn này sẽ giúp bạn xây dựng các pipeline DALI tái lập chính xác tiền xử lý của Ultralytics YOLO, tích hợp chúng với model.predict(), xử lý các luồng video và triển khai end-to-end với Triton Inference Server.

Hướng dẫn này dành cho ai?

Hướng dẫn này dành cho các kỹ sư triển khai model YOLO trong môi trường production nơi tiền xử lý trên CPU là điểm nghẽn đo lường được — thường là các bản triển khai TensorRT trên GPU NVIDIA, các pipeline video thông lượng cao, hoặc thiết lập Triton Inference Server. Nếu bạn đang chạy inference tiêu chuẩn với model.predict() và không gặp phải điểm nghẽn tiền xử lý, pipeline CPU mặc định vẫn hoạt động tốt.

Tóm tắt nhanh

Đang xây dựng pipeline DALI? Sử dụng fn.resize(mode="not_larger") + fn.crop(out_of_bounds_policy="pad") + fn.crop_mirror_normalize để tái lập tiền xử lý letterbox của YOLO trên GPU.
Tích hợp với Ultralytics? Truyền đầu ra của DALI dưới dạng torch.Tensor vào model.predict() — Ultralytics sẽ tự động bỏ qua khâu tiền xử lý ảnh.
Triển khai với Triton? Sử dụng backend DALI với tập hợp TensorRT để đạt được tiền xử lý bằng 0 trên CPU.

Link to this sectionTại sao nên sử dụng DALI cho tiền xử lý YOLO#

Trong một pipeline inference YOLO điển hình, các bước tiền xử lý chạy trên CPU:

Giải mã (Decode) ảnh (JPEG/PNG)
Thay đổi kích thước (Resize) trong khi vẫn giữ nguyên tỷ lệ khung hình
Đệm (Pad) về kích thước mục tiêu (letterbox)
Chuẩn hóa (Normalize) giá trị pixel từ [0, 255] về [0, 1]
Chuyển đổi (Convert) bố cục từ HWC sang CHW

Với DALI, tất cả các thao tác này chạy trên GPU, loại bỏ điểm nghẽn trên CPU. Điều này đặc biệt có giá trị khi:

Kịch bản	Tại sao DALI hỗ trợ tốt
Inference trên GPU nhanh	Các engine TensorRT với tốc độ inference dưới 1 mili giây khiến tiền xử lý CPU trở thành chi phí chiếm ưu thế
Đầu vào độ phân giải cao	Các luồng video 1080p và 4K yêu cầu các thao tác resize tốn kém
Batch size lớn	Xử lý inference phía máy chủ cho nhiều ảnh cùng lúc
Số nhân CPU hạn chế	Các thiết bị biên như NVIDIA Jetson, hoặc các máy chủ GPU mật độ cao với ít nhân CPU trên mỗi GPU

Link to this sectionĐiều kiện tiên quyết#

Chỉ dành cho Linux

NVIDIA DALI chỉ hỗ trợ Linux. Nó không khả dụng trên Windows hoặc macOS.

Cài đặt các gói yêu cầu:

pip install ultralytics
pip install --extra-index-url https://pypi.nvidia.com nvidia-dali-cuda130

Yêu cầu:

GPU NVIDIA (khả năng tính toán 5.0+ / kiến trúc Maxwell trở lên)
CUDA 11.0+, 12.0+ hoặc 13.0+
Python 3.10-3.14
Hệ điều hành Linux

Link to this sectionTìm hiểu tiền xử lý YOLO#

Trước khi xây dựng pipeline DALI, việc hiểu rõ những gì Ultralytics thực hiện trong quá trình tiền xử lý là rất hữu ích. Lớp chính là LetterBox trong ultralytics/data/augment.py:

from ultralytics.data.augment import LetterBox

letterbox = LetterBox(
    new_shape=(640, 640),  # Target size
    center=True,  # Center the image (pad equally on both sides)
    stride=32,  # Stride alignment
    padding_value=114,  # Gray padding (114, 114, 114)
)

Pipeline tiền xử lý đầy đủ trong ultralytics/engine/predictor.py thực hiện các bước sau:

Bước	Thao tác	Hàm CPU	Tương đương DALI
1	Resize letterbox	`cv2.resize`	`fn.resize(mode="not_larger")`
2	Đệm căn giữa	`cv2.copyMakeBorder`	`fn.crop(out_of_bounds_policy="pad")`
3	BGR → RGB	`im[..., ::-1]`	`fn.decoders.image(output_type=types.RGB)`
4	HWC → CHW + chuẩn hóa /255	`np.transpose` + `tensor / 255`	`fn.crop_mirror_normalize(std=[255,255,255])`

Thao tác letterbox duy trì tỷ lệ khung hình bằng cách:

Tính toán tỷ lệ: r = min(target_h / h, target_w / w)
Resize về (round(w * r), round(h * r))
Đệm không gian còn lại bằng màu xám (114) để đạt kích thước mục tiêu
Căn giữa ảnh để phần đệm được phân bổ đều ở cả hai bên

Link to this sectionPipeline DALI cho YOLO#

Sử dụng pipeline căn giữa bên dưới làm tham chiếu mặc định. Nó khớp với hành vi LetterBox(center=True) của Ultralytics, đây là điều mà YOLO inference tiêu chuẩn sử dụng.

Link to this sectionPipeline Căn giữa (Khuyên dùng, khớp với LetterBox của Ultralytics)#

Phiên bản này tái lập chính xác tiền xử lý mặc định của Ultralytics với đệm căn giữa, khớp với LetterBox(center=True):

Pipeline DALI với đệm căn giữa (khuyên dùng)

import nvidia.dali as dali
import nvidia.dali.fn as fn
import nvidia.dali.types as types

@dali.pipeline_def(batch_size=8, num_threads=4, device_id=0)
def yolo_dali_pipeline_centered(image_dir, target_size=640):
    """DALI pipeline replicating YOLO preprocessing with centered padding.

    Matches Ultralytics LetterBox(center=True) behavior exactly.
    """
    # Read and decode images on GPU
    jpegs, _ = fn.readers.file(file_root=image_dir, random_shuffle=False, name="Reader")
    images = fn.decoders.image(jpegs, device="mixed", output_type=types.RGB)

    # Aspect-ratio-preserving resize
    resized = fn.resize(
        images,
        resize_x=target_size,
        resize_y=target_size,
        mode="not_larger",
        interp_type=types.INTERP_LINEAR,
        antialias=False,  # Match cv2.INTER_LINEAR (no antialiasing)
    )

    # Centered padding using fn.crop with out_of_bounds_policy
    # When crop size > image size, fn.crop centers the image and pads symmetrically
    padded = fn.crop(
        resized,
        crop=(target_size, target_size),
        out_of_bounds_policy="pad",
        fill_values=114,  # YOLO padding value
    )

    # Normalize and convert layout
    output = fn.crop_mirror_normalize(
        padded,
        dtype=types.FLOAT,
        output_layout="CHW",
        mean=[0.0, 0.0, 0.0],
        std=[255.0, 255.0, 255.0],
    )
    return output

Khi nào `fn.pad` là đủ?

Nếu bạn không cần sự tương đồng chính xác với LetterBox(center=True), bạn có thể đơn giản hóa bước đệm bằng cách sử dụng fn.pad(...) thay vì fn.crop(..., out_of_bounds_policy="pad"). Biến thể đó chỉ đệm các cạnh phải và dưới, điều này có thể chấp nhận được cho các pipeline triển khai tùy chỉnh, nhưng nó sẽ không khớp chính xác với hành vi letterbox căn giữa mặc định của Ultralytics.

Tại sao dùng `fn.crop` cho đệm căn giữa?

Toán tử fn.pad của DALI chỉ thêm đệm vào các cạnh phải và dưới. Để có đệm căn giữa (khớp với LetterBox(center=True) của Ultralytics), hãy sử dụng fn.crop với out_of_bounds_policy="pad". Với crop_pos_x=0.5 và crop_pos_y=0.5 mặc định, ảnh được tự động căn giữa với phần đệm đối xứng.

Sai lệch Antialias

Hàm fn.resize của DALI mặc định kích hoạt antialiasing (antialias=True), trong khi cv2.resize của OpenCV với INTER_LINEAR không áp dụng antialiasing. Luôn đặt antialias=False trong DALI để khớp với pipeline CPU. Việc bỏ qua bước này gây ra những khác biệt nhỏ về số liệu có thể ảnh hưởng đến độ chính xác của model.

Link to this sectionChạy Pipeline#

Xây dựng và chạy một pipeline DALI

# Build and run the pipeline
pipe = yolo_dali_pipeline_centered(image_dir="/path/to/images", target_size=640)
pipe.build()

# Get a batch of preprocessed images
(output,) = pipe.run()

# Convert to numpy or PyTorch tensors
batch_np = output.as_cpu().as_array()  # Shape: (batch_size, 3, 640, 640)
print(f"Output shape: {batch_np.shape}, dtype: {batch_np.dtype}")
print(f"Value range: [{batch_np.min():.4f}, {batch_np.max():.4f}]")

Link to this sectionSử dụng DALI với Ultralytics Predict#

Bạn có thể truyền một tensor PyTorch đã tiền xử lý trực tiếp vào model.predict(). Khi một torch.Tensor được truyền vào, Ultralytics sẽ bỏ qua tiền xử lý ảnh (letterbox, BGR→RGB, HWC→CHW, và chuẩn hóa /255) và chỉ thực hiện chuyển đổi thiết bị cũng như ép kiểu dữ liệu trước khi gửi đến model.

Vì Ultralytics không có quyền truy cập vào kích thước ảnh gốc trong trường hợp này, các tọa độ hộp phát hiện (detection box) được trả về trong không gian letterbox 640×640. Để ánh xạ chúng trở lại tọa độ ảnh gốc, hãy sử dụng scale_boxes để xử lý logic làm tròn chính xác được sử dụng bởi LetterBox:

from ultralytics.utils.ops import scale_boxes

# boxes: tensor of shape (N, 4) in xyxy format, in 640x640 letterboxed coords
# Scale boxes from letterboxed (640, 640) back to original (orig_h, orig_w)
boxes = scale_boxes((640, 640), boxes, (orig_h, orig_w))

Điều này áp dụng cho tất cả các đường dẫn tiền xử lý bên ngoài — đầu vào tensor trực tiếp, các luồng video và triển khai Triton.

DALI + Ultralytics predict

from nvidia.dali.plugin.pytorch import DALIGenericIterator

from ultralytics import YOLO

# Load model
model = YOLO("yolo26n.pt")

# Create DALI iterator
pipe = yolo_dali_pipeline_centered(image_dir="/path/to/images", target_size=640)
pipe.build()
dali_iter = DALIGenericIterator(pipe, ["images"], reader_name="Reader")

# Run inference with DALI-preprocessed tensors
for batch in dali_iter:
    images = batch[0]["images"]  # Already on GPU, shape (B, 3, 640, 640)
    results = model.predict(images, verbose=False)
    for result in results:
        print(f"Detected {len(result.boxes)} objects")

Không tiêu tốn tiền xử lý

Khi bạn truyền torch.Tensor vào model.predict(), bước tiền xử lý ảnh mất ~0.004ms (gần như bằng 0) so với ~1-10ms khi tiền xử lý bằng CPU. Tensor phải ở định dạng BCHW, float32 (hoặc float16), và được chuẩn hóa về [0, 1]. Ultralytics vẫn sẽ tự động xử lý việc chuyển thiết bị và ép kiểu dữ liệu.

Link to this sectionDALI với các luồng Video#

Đối với xử lý video thời gian thực, hãy sử dụng fn.external_source để nạp các khung hình từ bất kỳ nguồn nào — OpenCV, GStreamer, hoặc các thư viện chụp ảnh tùy chỉnh:

Pipeline DALI cho tiền xử lý luồng video

import nvidia.dali as dali
import nvidia.dali.fn as fn
import nvidia.dali.types as types

@dali.pipeline_def(batch_size=1, num_threads=4, device_id=0)
def yolo_video_pipeline(target_size=640):
    """DALI pipeline for processing video frames from external source."""
    # External source for feeding frames from OpenCV, GStreamer, etc.
    frames = fn.external_source(device="cpu", name="input")
    frames = fn.reshape(frames, layout="HWC")

    # Move to GPU and preprocess
    frames_gpu = frames.gpu()
    resized = fn.resize(
        frames_gpu,
        resize_x=target_size,
        resize_y=target_size,
        mode="not_larger",
        interp_type=types.INTERP_LINEAR,
        antialias=False,
    )
    padded = fn.crop(
        resized,
        crop=(target_size, target_size),
        out_of_bounds_policy="pad",
        fill_values=114,
    )
    output = fn.crop_mirror_normalize(
        padded,
        dtype=types.FLOAT,
        output_layout="CHW",
        mean=[0.0, 0.0, 0.0],
        std=[255.0, 255.0, 255.0],
    )
    return output

Link to this sectionTriton Inference Server với DALI#

Đối với triển khai production, hãy kết hợp tiền xử lý DALI với inference TensorRT trong Triton Inference Server bằng cách sử dụng một model tập hợp. Điều này loại bỏ hoàn toàn tiền xử lý trên CPU — byte JPEG thô được nạp vào, kết quả phát hiện được xuất ra, với mọi thứ đều được xử lý trên GPU.

Link to this sectionCấu trúc kho lưu trữ Model#

model_repository/
├── dali_preprocessing/
│   ├── 1/
│   │   └── model.dali
│   └── config.pbtxt
├── yolo_trt/
│   ├── 1/
│   │   └── model.plan
│   └── config.pbtxt
└── ensemble_dali_yolo/
    ├── 1/                  # Empty directory (required by Triton)
    └── config.pbtxt

Link to this sectionBước 1: Tạo Pipeline DALI#

Tuần tự hóa (Serialize) pipeline DALI cho backend Triton DALI:

Tuần tự hóa pipeline DALI cho Triton

import nvidia.dali as dali
import nvidia.dali.fn as fn
import nvidia.dali.types as types

@dali.pipeline_def(batch_size=8, num_threads=4, device_id=0)
def triton_dali_pipeline():
    """DALI preprocessing pipeline for Triton deployment."""
    # Input: raw encoded image bytes from Triton
    images = fn.external_source(device="cpu", name="DALI_INPUT_0")
    images = fn.decoders.image(images, device="mixed", output_type=types.RGB)

    resized = fn.resize(
        images,
        resize_x=640,
        resize_y=640,
        mode="not_larger",
        interp_type=types.INTERP_LINEAR,
        antialias=False,
    )
    padded = fn.crop(
        resized,
        crop=(640, 640),
        out_of_bounds_policy="pad",
        fill_values=114,
    )
    output = fn.crop_mirror_normalize(
        padded,
        dtype=types.FLOAT,
        output_layout="CHW",
        mean=[0.0, 0.0, 0.0],
        std=[255.0, 255.0, 255.0],
    )
    return output

# Serialize pipeline to model repository
pipe = triton_dali_pipeline()
pipe.serialize(filename="model_repository/dali_preprocessing/1/model.dali")

Link to this sectionBước 2: Xuất YOLO sang TensorRT#

Xuất model YOLO sang engine TensorRT

from ultralytics import YOLO

model = YOLO("yolo26n.pt")
model.export(format="engine", imgsz=640, half=True, batch=8)
# Copy the .engine file to model_repository/yolo_trt/1/model.plan

Link to this sectionBước 3: Cấu hình Triton#

dali_preprocessing/config.pbtxt:

name: "dali_preprocessing"
backend: "dali"
max_batch_size: 8
input [
  {
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "DALI_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]

yolo_trt/config.pbtxt:

name: "yolo_trt"
platform: "tensorrt_plan"
max_batch_size: 8
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ 300, 6 ]
  }
]

ensemble_dali_yolo/config.pbtxt:

name: "ensemble_dali_yolo"
platform: "ensemble"
max_batch_size: 8
input [
  {
    name: "INPUT"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [ 300, 6 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "dali_preprocessing"
      model_version: -1
      input_map {
        key: "DALI_INPUT_0"
        value: "INPUT"
      }
      output_map {
        key: "DALI_OUTPUT_0"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "yolo_trt"
      model_version: -1
      input_map {
        key: "images"
        value: "preprocessed_image"
      }
      output_map {
        key: "output0"
        value: "OUTPUT"
      }
    }
  ]
}

Cách thức hoạt động của Ensemble Mapping

Ensemble kết nối các model thông qua tên tensor ảo. Giá trị output_map là "preprocessed_image" trong bước DALI khớp với giá trị input_map là "preprocessed_image" trong bước TensorRT. Đây là những tên tùy chọn dùng để liên kết đầu ra của một bước với đầu vào của bước tiếp theo — chúng không cần phải khớp với tên tensor nội bộ của bất kỳ model nào.

Link to this sectionBước 4: Gửi yêu cầu inference#

!!! info "Tại sao lại dùng tritonclient thay vì YOLO(\"http://...\")?"

Ultralytics has [built-in Triton support](triton-inference-server.md#running-inference) that handles pre/postprocessing automatically. However, it won't work with the DALI ensemble because `YOLO()` sends a preprocessed float32 tensor while the ensemble expects raw JPEG bytes. Use `tritonclient` directly for DALI ensembles, and the [built-in integration](triton-inference-server.md) for standard deployments without DALI.

Gửi hình ảnh tới Triton ensemble

import numpy as np
import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(url="localhost:8000")

# Load image as raw bytes (JPEG/PNG encoded)
image_data = np.fromfile("image.jpg", dtype="uint8")
image_data = np.expand_dims(image_data, axis=0)  # Add batch dimension

# Create input
input_tensor = httpclient.InferInput("INPUT", image_data.shape, "UINT8")
input_tensor.set_data_from_numpy(image_data)

# Run inference through the ensemble
result = client.infer(model_name="ensemble_dali_yolo", inputs=[input_tensor])
detections = result.as_numpy("OUTPUT")  # Shape: (1, 300, 6) -> [x1, y1, x2, y2, conf, class_id]

# Filter by confidence (no NMS needed — YOLO26 is end-to-end)
detections = detections[0]  # First image
detections = detections[detections[:, 4] > 0.25]  # Confidence threshold
print(f"Detected {len(detections)} objects")

Xử lý theo lô (Batching) hình ảnh JPEG

Khi gửi một lô hình ảnh JPEG tới Triton, hãy đệm (pad) tất cả các mảng byte đã mã hóa về cùng độ dài (số byte tối đa trong lô đó). Triton yêu cầu các hình dạng lô đồng nhất cho tensor đầu vào.

Link to this sectionCác tác vụ được hỗ trợ#

Tiền xử lý bằng DALI hoạt động với tất cả các tác vụ YOLO sử dụng pipeline LetterBox tiêu chuẩn:

Tác vụ	Được hỗ trợ	Ghi chú
Detection	✅	Tiền xử lý letterbox tiêu chuẩn
Instance Segmentation	✅	Tiền xử lý tương tự như detection
Semantic Segmentation	✅	Tiền xử lý hình ảnh tương tự như detection
Pose Estimation	✅	Tiền xử lý tương tự như detection
Oriented Detection (OBB)	✅	Tiền xử lý tương tự như detection
Classification	❌	Sử dụng các transform của torchvision (center crop), không sử dụng letterbox

Link to this sectionHạn chế#

Chỉ dành cho Linux: DALI không hỗ trợ Windows hoặc macOS
Yêu cầu NVIDIA GPU: Không có phương án dự phòng chỉ dùng CPU
Pipeline tĩnh: Cấu trúc pipeline được định nghĩa tại thời điểm build và không thể thay đổi động
fn.pad chỉ hỗ trợ bên phải/dưới: Sử dụng fn.crop với out_of_bounds_policy="pad" để thực hiện padding căn giữa
Không có chế độ rect: Các pipeline DALI tạo ra đầu ra có kích thước cố định (ví dụ: 640×640). Chế độ rect auto=True tạo ra các đầu ra có kích thước biến đổi (ví dụ: 384×640) không được hỗ trợ. Lưu ý rằng mặc dù TensorRT có hỗ trợ các hình dạng đầu vào động, nhưng một pipeline DALI có kích thước cố định sẽ kết hợp tự nhiên với một engine có kích thước cố định để đạt được thông lượng tối đa
Bộ nhớ với nhiều instance: Việc sử dụng instance_group với count > 1 trong Triton có thể gây tiêu tốn bộ nhớ cao. Hãy sử dụng nhóm instance mặc định cho model DALI

Link to this sectionCâu hỏi thường gặp (FAQ)#

Link to this sectionTiền xử lý DALI so với tốc độ tiền xử lý trên CPU như thế nào?#

Lợi ích phụ thuộc vào pipeline của bạn. Khi quá trình inference trên GPU đã nhanh với TensorRT, việc tiền xử lý trên CPU mất 2-10ms có thể trở thành chi phí chiếm ưu thế. DALI loại bỏ nút thắt này bằng cách thực hiện tiền xử lý trên GPU. Những cải thiện lớn nhất được thấy với đầu vào có độ phân giải cao (1080p, 4K), batch sizes lớn và các hệ thống có số nhân CPU hạn chế trên mỗi GPU.

Link to this sectionTôi có thể sử dụng DALI với các model PyTorch (không chỉ TensorRT) không?#

Có. Hãy sử dụng DALIGenericIterator để nhận đầu ra torch.Tensor đã được tiền xử lý, sau đó truyền chúng tới model.predict(). Tuy nhiên, lợi ích hiệu năng lớn nhất đạt được với các model TensorRT khi quá trình inference vốn đã rất nhanh và tiền xử lý CPU trở thành nút thắt cổ chai.

Link to this sectionSự khác biệt giữa `fn.pad` và `fn.crop` khi thực hiện padding là gì?#

fn.pad chỉ thêm phần đệm vào cạnh phải và dưới. fn.crop với out_of_bounds_policy="pad" giúp căn giữa hình ảnh và thêm phần đệm đối xứng ở tất cả các cạnh, khớp với hành vi của LetterBox(center=True) trong Ultralytics.

Link to this sectionDALI có cho ra kết quả giống hệt pixel với tiền xử lý trên CPU không?#

Gần như giống hệt. Hãy đặt antialias=False trong fn.resize để khớp với cv2.INTER_LINEAR của OpenCV. Sự khác biệt nhỏ về dấu phẩy động (< 0.001) có thể xảy ra do sự khác biệt giữa phép toán trên GPU và CPU, nhưng những khác biệt này không có tác động đáng kể nào đến độ chính xác của detection.

Link to this sectionCòn CV-CUDA thì sao, nó có thể là lựa chọn thay thế cho DALI không?#

CV-CUDA là một thư viện khác của NVIDIA dành cho xử lý thị giác tăng tốc bằng GPU. Nó cung cấp khả năng điều khiển từng toán tử (giống như OpenCV nhưng chạy trên GPU) thay vì cách tiếp cận theo pipeline của DALI. Hàm cvcuda.copymakeborder() của CV-CUDA hỗ trợ padding rõ ràng cho từng cạnh, giúp việc thực hiện letterbox căn giữa trở nên đơn giản. Hãy chọn DALI cho các quy trình làm việc dựa trên pipeline (đặc biệt là với Triton), và chọn CV-CUDA để có sự kiểm soát chi tiết ở cấp độ toán tử trong mã nguồn inference tùy chỉnh.

Contributors

GLglenn-jocher¹ ONonuralpszr¹ RAraimbekovm¹

Created tháng trướcUpdated 5 ngày trước