Link to this sectionNVIDIA DALI ile GPU Hızlandırmalı Ön İşleme#

Link to this sectionGiriş#

Ultralytics YOLO modellerini üretim ortamında dağıtırken, ön işleme genellikle darboğaz haline gelir. TensorRT model çıkarımını birkaç milisaniyede çalıştırabilse de, CPU tabanlı ön işleme (yeniden boyutlandırma, dolgu, normalleştirme) özellikle yüksek çözünürlüklerde görüntü başına 2-10ms sürebilir. NVIDIA DALI (Veri Yükleme Kütüphanesi), tüm ön işleme hattını GPU'ya taşıyarak bu sorunu çözer.

Bu rehber, Ultralytics YOLO ön işlemesini birebir kopyalayan DALI hatları oluşturma, bunları model.predict() ile entegre etme, video akışlarını işleme ve Triton Inference Server ile uçtan uca dağıtma konularında sana yol gösterir.

Bu rehber kimin için?

Bu rehber, CPU ön işlemesinin ölçülebilir bir darboğaz olduğu üretim ortamlarında — tipik olarak NVIDIA GPU'lar üzerindeki TensorRT dağıtımları, yüksek verimli video hatları veya Triton Inference Server kurulumlarında — YOLO modellerini dağıtan mühendisler içindir. Eğer model.predict() ile standart çıkarım çalıştırıyorsan ve bir ön işleme darboğazın yoksa, varsayılan CPU hattı gayet iyi çalışır.

Hızlı Özet

DALI hattı mı kuruyorsun? YOLO'nun letterbox ön işlemesini GPU üzerinde kopyalamak için fn.resize(mode="not_larger") + fn.crop(out_of_bounds_policy="pad") + fn.crop_mirror_normalize kullan.
Ultralytics ile entegre mi ediyorsun? DALI çıktısını torch.Tensor olarak model.predict()e gönder — Ultralytics görüntü ön işlemesini otomatik olarak atlar.
Triton ile mi dağıtıyorsun? Sıfır CPU ön işlemesi için TensorRT topluluğu (ensemble) ile birlikte DALI arka ucunu (backend) kullan.

Link to this sectionNeden YOLO Ön İşlemesi için DALI Kullanmalısın?#

Tipik bir YOLO çıkarım hattında, ön işleme adımları CPU üzerinde çalışır:

Görüntünün kodunu çözme (JPEG/PNG)
En boy oranını koruyarak yeniden boyutlandırma
Hedef boyuta doldurma (letterbox)
Piksel değerlerini [0, 255] aralığından [0, 1] aralığına normalleştirme
Düzeni HWC'den CHW'ye dönüştürme

DALI ile tüm bu işlemler GPU'da çalışarak CPU darboğazını ortadan kaldırır. Bu durum özellikle şu senaryolarda değerlidir:

Senaryo	DALI Neden Yardımcı Olur
Hızlı GPU çıkarımı	Milisaniye altı çıkarım sunan TensorRT motorları, CPU ön işlemesini baskın maliyet haline getirir
Yüksek çözünürlüklü girdiler	1080p ve 4K video akışları pahalı yeniden boyutlandırma işlemleri gerektirir
Büyük yığın boyutları	Sunucu tarafında birçok görüntünün paralel olarak işlendiği çıkarımlar
Sınırlı CPU çekirdeği	NVIDIA Jetson gibi uç cihazlar veya GPU başına az sayıda CPU çekirdeğine sahip yoğun GPU sunucuları

Link to this sectionÖn Koşullar#

Sadece Linux

NVIDIA DALI sadece Linux destekler. Windows veya macOS üzerinde mevcut değildir.

Gerekli paketleri kur:

pip install ultralytics
pip install --extra-index-url https://pypi.nvidia.com nvidia-dali-cuda130

Gereksinimler:

NVIDIA GPU (işlem kapasitesi 5.0+ / Maxwell veya daha yeni)
CUDA 11.0+, 12.0+ veya 13.0+
Python 3.10-3.14
Linux işletim sistemi

Link to this sectionYOLO Ön İşlemesini Anlamak#

Before building a DALI pipeline, it helps to understand exactly what Ultralytics does during preprocessing. The key class is LetterBox in ultralytics/data/augment.py:

from ultralytics.data.augment import LetterBox

letterbox = LetterBox(
    new_shape=(640, 640),  # Target size
    center=True,  # Center the image (pad equally on both sides)
    stride=32,  # Stride alignment
    padding_value=114,  # Gray padding (114, 114, 114)
)

ultralytics/engine/predictor.py içindeki tam ön işleme hattı şu adımları gerçekleştirir:

Adım	İşlem	CPU Fonksiyonu	DALI Eşdeğeri
1	Letterbox yeniden boyutlandırma	`cv2.resize`	`fn.resize(mode="not_larger")`
2	Merkezlenmiş dolgu	`cv2.copyMakeBorder`	`fn.crop(out_of_bounds_policy="pad")`
3	BGR → RGB	`im[..., ::-1]`	`fn.decoders.image(output_type=types.RGB)`
4	HWC → CHW + normalleştirme /255	`np.transpose` + `tensor / 255`	`fn.crop_mirror_normalize(std=[255,255,255])`

Letterbox işlemi, en boy oranını şu şekilde korur:

Ölçeği hesaplama: r = min(target_h / h, target_w / w)
(round(w * r), round(h * r)) boyutuna getirme
Hedef boyuta ulaşmak için kalan alanı gri (114) ile doldurma
Dolgunun her iki tarafa eşit dağılması için görüntüyü merkezleme

Link to this sectionYOLO için DALI Hattı#

Aşağıdaki merkezlenmiş hattı varsayılan referans olarak kullan. Bu, standart YOLO çıkarımının kullandığı Ultralytics LetterBox(center=True) davranışı ile eşleşir.

Link to this sectionMerkezlenmiş Hat (Önerilen, Ultralytics LetterBox ile eşleşir)#

Bu sürüm, LetterBox(center=True) ile eşleşerek, merkezlenmiş dolgu ile varsayılan Ultralytics ön işlemesini tam olarak kopyalar:

Merkezlenmiş dolgulu DALI hattı (önerilen)

import nvidia.dali as dali
import nvidia.dali.fn as fn
import nvidia.dali.types as types

@dali.pipeline_def(batch_size=8, num_threads=4, device_id=0)
def yolo_dali_pipeline_centered(image_dir, target_size=640):
    """DALI pipeline replicating YOLO preprocessing with centered padding.

    Matches Ultralytics LetterBox(center=True) behavior exactly.
    """
    # Read and decode images on GPU
    jpegs, _ = fn.readers.file(file_root=image_dir, random_shuffle=False, name="Reader")
    images = fn.decoders.image(jpegs, device="mixed", output_type=types.RGB)

    # Aspect-ratio-preserving resize
    resized = fn.resize(
        images,
        resize_x=target_size,
        resize_y=target_size,
        mode="not_larger",
        interp_type=types.INTERP_LINEAR,
        antialias=False,  # Match cv2.INTER_LINEAR (no antialiasing)
    )

    # Centered padding using fn.crop with out_of_bounds_policy
    # When crop size > image size, fn.crop centers the image and pads symmetrically
    padded = fn.crop(
        resized,
        crop=(target_size, target_size),
        out_of_bounds_policy="pad",
        fill_values=114,  # YOLO padding value
    )

    # Normalize and convert layout
    output = fn.crop_mirror_normalize(
        padded,
        dtype=types.FLOAT,
        output_layout="CHW",
        mean=[0.0, 0.0, 0.0],
        std=[255.0, 255.0, 255.0],
    )
    return output

`fn.pad` ne zaman yeterlidir?

If you do not need exact LetterBox(center=True) parity, you can simplify the padding step by using fn.pad(...) instead of fn.crop(..., out_of_bounds_policy="pad"). That variant pads only the right and bottom edges, which can be acceptable for custom deployment pipelines, but it will not match Ultralytics' default centered letterbox behavior exactly.

Merkezlenmiş dolgu için neden `fn.crop`?

DALI's fn.pad operator only adds padding to the right and bottom edges. To get centered padding (matching Ultralytics LetterBox(center=True)), use fn.crop with out_of_bounds_policy="pad". With the default crop_pos_x=0.5 and crop_pos_y=0.5, the image is automatically centered with symmetric padding.

Kenar Yumuşatma (Antialias) Uyumsuzluğu

DALI's fn.resize enables antialiasing by default (antialias=True), while OpenCV's cv2.resize with INTER_LINEAR does not apply antialiasing. Always set antialias=False in DALI to match the CPU pipeline. Omitting this causes subtle numerical differences that can affect model accuracy.

Link to this sectionHattı Çalıştırma#

Bir DALI hattı oluştur ve çalıştır

# Build and run the pipeline
pipe = yolo_dali_pipeline_centered(image_dir="/path/to/images", target_size=640)
pipe.build()

# Get a batch of preprocessed images
(output,) = pipe.run()

# Convert to numpy or PyTorch tensors
batch_np = output.as_cpu().as_array()  # Shape: (batch_size, 3, 640, 640)
print(f"Output shape: {batch_np.shape}, dtype: {batch_np.dtype}")
print(f"Value range: [{batch_np.min():.4f}, {batch_np.max():.4f}]")

Link to this sectionDALI'yi Ultralytics Predict ile Kullanma#

Ön işlemesi yapılmış bir PyTorch tensörünü doğrudan model.predict()e geçirebilirsin. Bir torch.Tensor geçirildiğinde, Ultralytics görüntü ön işlemesini atlar (letterbox, BGR→RGB, HWC→CHW ve /255 normalleştirme) ve modele göndermeden önce sadece cihaz aktarımı ile veri türü dönüştürme işlemlerini gerçekleştirir.

Since Ultralytics doesn't have access to the original image dimensions in this case, detection box coordinates are returned in the 640×640 letterboxed space. To map them back to original image coordinates, use scale_boxes which handles the exact rounding logic used by LetterBox:

from ultralytics.utils.ops import scale_boxes

# boxes: tensor of shape (N, 4) in xyxy format, in 640x640 letterboxed coords
# Scale boxes from letterboxed (640, 640) back to original (orig_h, orig_w)
boxes = scale_boxes((640, 640), boxes, (orig_h, orig_w))

Bu, doğrudan tensör girişi, video akışları ve Triton dağıtımı gibi tüm harici ön işleme yolları için geçerlidir.

DALI + Ultralytics predict

from nvidia.dali.plugin.pytorch import DALIGenericIterator

from ultralytics import YOLO

# Load model
model = YOLO("yolo26n.pt")

# Create DALI iterator
pipe = yolo_dali_pipeline_centered(image_dir="/path/to/images", target_size=640)
pipe.build()
dali_iter = DALIGenericIterator(pipe, ["images"], reader_name="Reader")

# Run inference with DALI-preprocessed tensors
for batch in dali_iter:
    images = batch[0]["images"]  # Already on GPU, shape (B, 3, 640, 640)
    results = model.predict(images, verbose=False)
    for result in results:
        print(f"Detected {len(result.boxes)} objects")

Sıfır Ön İşleme Ek Yükü

When you pass a torch.Tensor to model.predict(), the image preprocessing step takes ~0.004ms (essentially zero) compared to ~1-10ms with CPU preprocessing. The tensor must be in BCHW format, float32 (or float16), and normalized to [0, 1]. Ultralytics will still handle device transfer and dtype casting automatically.

Link to this sectionVideo Akışları ile DALI#

For real-time video processing, use fn.external_source to feed frames from any source — OpenCV, GStreamer, or custom capture libraries:

Video akışı ön işlemesi için DALI hattı

import nvidia.dali as dali
import nvidia.dali.fn as fn
import nvidia.dali.types as types

@dali.pipeline_def(batch_size=1, num_threads=4, device_id=0)
def yolo_video_pipeline(target_size=640):
    """DALI pipeline for processing video frames from external source."""
    # External source for feeding frames from OpenCV, GStreamer, etc.
    frames = fn.external_source(device="cpu", name="input")
    frames = fn.reshape(frames, layout="HWC")

    # Move to GPU and preprocess
    frames_gpu = frames.gpu()
    resized = fn.resize(
        frames_gpu,
        resize_x=target_size,
        resize_y=target_size,
        mode="not_larger",
        interp_type=types.INTERP_LINEAR,
        antialias=False,
    )
    padded = fn.crop(
        resized,
        crop=(target_size, target_size),
        out_of_bounds_policy="pad",
        fill_values=114,
    )
    output = fn.crop_mirror_normalize(
        padded,
        dtype=types.FLOAT,
        output_layout="CHW",
        mean=[0.0, 0.0, 0.0],
        std=[255.0, 255.0, 255.0],
    )
    return output

Link to this sectionDALI ile Triton Inference Server#

For production deployment, combine DALI preprocessing with TensorRT inference in Triton Inference Server using an ensemble model. This eliminates CPU preprocessing entirely — raw JPEG bytes go in, detections come out, with everything processed on the GPU.

Link to this sectionModel Deposu Yapısı#

model_repository/
├── dali_preprocessing/
│   ├── 1/
│   │   └── model.dali
│   └── config.pbtxt
├── yolo_trt/
│   ├── 1/
│   │   └── model.plan
│   └── config.pbtxt
└── ensemble_dali_yolo/
    ├── 1/                  # Empty directory (required by Triton)
    └── config.pbtxt

Link to this sectionAdım 1: DALI Hattını Oluştur#

DALI hattını Triton DALI arka ucu için serileştir:

Triton için DALI hattını serileştir

import nvidia.dali as dali
import nvidia.dali.fn as fn
import nvidia.dali.types as types

@dali.pipeline_def(batch_size=8, num_threads=4, device_id=0)
def triton_dali_pipeline():
    """DALI preprocessing pipeline for Triton deployment."""
    # Input: raw encoded image bytes from Triton
    images = fn.external_source(device="cpu", name="DALI_INPUT_0")
    images = fn.decoders.image(images, device="mixed", output_type=types.RGB)

    resized = fn.resize(
        images,
        resize_x=640,
        resize_y=640,
        mode="not_larger",
        interp_type=types.INTERP_LINEAR,
        antialias=False,
    )
    padded = fn.crop(
        resized,
        crop=(640, 640),
        out_of_bounds_policy="pad",
        fill_values=114,
    )
    output = fn.crop_mirror_normalize(
        padded,
        dtype=types.FLOAT,
        output_layout="CHW",
        mean=[0.0, 0.0, 0.0],
        std=[255.0, 255.0, 255.0],
    )
    return output

# Serialize pipeline to model repository
pipe = triton_dali_pipeline()
pipe.serialize(filename="model_repository/dali_preprocessing/1/model.dali")

Link to this sectionAdım 2: YOLO'yu TensorRT'ye Aktar#

YOLO modelini TensorRT motoruna aktar

from ultralytics import YOLO

model = YOLO("yolo26n.pt")
model.export(format="engine", imgsz=640, half=True, batch=8)
# Copy the .engine file to model_repository/yolo_trt/1/model.plan

Link to this sectionAdım 3: Triton'u Yapılandır#

dali_preprocessing/config.pbtxt:

name: "dali_preprocessing"
backend: "dali"
max_batch_size: 8
input [
  {
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "DALI_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]

yolo_trt/config.pbtxt:

name: "yolo_trt"
platform: "tensorrt_plan"
max_batch_size: 8
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ 300, 6 ]
  }
]

ensemble_dali_yolo/config.pbtxt:

name: "ensemble_dali_yolo"
platform: "ensemble"
max_batch_size: 8
input [
  {
    name: "INPUT"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [ 300, 6 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "dali_preprocessing"
      model_version: -1
      input_map {
        key: "DALI_INPUT_0"
        value: "INPUT"
      }
      output_map {
        key: "DALI_OUTPUT_0"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "yolo_trt"
      model_version: -1
      input_map {
        key: "images"
        value: "preprocessed_image"
      }
      output_map {
        key: "output0"
        value: "OUTPUT"
      }
    }
  ]
}

Ensemble Eşlemesi Nasıl Çalışır

Ensemble, modelleri sanal tensör isimleri aracılığıyla birbirine bağlar. DALI adımındaki output_map değeri olan "preprocessed_image", TensorRT adımındaki input_map değeri olan "preprocessed_image" ile eşleşir. Bunlar, bir adımın çıktısını sonraki adımın girdisine bağlayan isteğe bağlı isimlerdir; herhangi bir modelin dahili tensör isimleriyle eşleşmeleri gerekmez.

Link to this sectionAdım 4: Çıkarım İstekleri Gönderin#

!!! info "Why tritonclient instead of YOLO(\"http://...\")?"

Ultralytics has [built-in Triton support](triton-inference-server.md#running-inference) that handles pre/postprocessing automatically. However, it won't work with the DALI ensemble because `YOLO()` sends a preprocessed float32 tensor while the ensemble expects raw JPEG bytes. Use `tritonclient` directly for DALI ensembles, and the [built-in integration](triton-inference-server.md) for standard deployments without DALI.

Triton ensemble'ına görüntü gönderin

import numpy as np
import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(url="localhost:8000")

# Load image as raw bytes (JPEG/PNG encoded)
image_data = np.fromfile("image.jpg", dtype="uint8")
image_data = np.expand_dims(image_data, axis=0)  # Add batch dimension

# Create input
input_tensor = httpclient.InferInput("INPUT", image_data.shape, "UINT8")
input_tensor.set_data_from_numpy(image_data)

# Run inference through the ensemble
result = client.infer(model_name="ensemble_dali_yolo", inputs=[input_tensor])
detections = result.as_numpy("OUTPUT")  # Shape: (1, 300, 6) -> [x1, y1, x2, y2, conf, class_id]

# Filter by confidence (no NMS needed — YOLO26 is end-to-end)
detections = detections[0]  # First image
detections = detections[detections[:, 4] > 0.25]  # Confidence threshold
print(f"Detected {len(detections)} objects")

JPEG Görüntülerini Gruplandırma

Triton'a bir grup JPEG görüntüsü gönderirken, tüm kodlanmış bayt dizilerini aynı uzunluğa (gruptaki en büyük bayt sayısına) tamamlayın (pad). Triton, giriş tensörü için homojen grup şekilleri gerektirir.

Link to this sectionDesteklenen Görevler#

DALI ön işleme, standart LetterBox hattını kullanan tüm YOLO görevleriyle çalışır:

Görev	Desteklenen	Notlar
Tespit	✅	Standart letterbox ön işleme
Örnek Segmentasyonu	✅	Algılama ile aynı ön işleme
Anlamsal Bölümleme	✅	Algılama ile aynı görüntü ön işleme
Poz Tahmini	✅	Algılama ile aynı ön işleme
Yönelimli Algılama (OBB)	✅	Algılama ile aynı ön işleme
Sınıflandırma	❌	Letterbox yerine torchvision dönüşümleri (orta kırpma) kullanır

Link to this sectionSınırlamalar#

Sadece Linux: DALI, Windows veya macOS'i desteklemez
NVIDIA GPU gerekli: Sadece CPU tabanlı yedek bir yöntem bulunmuyor
Statik hat: Hat yapısı derleme zamanında tanımlanır ve dinamik olarak değiştirilemez
fn.pad is right/bottom only: Use fn.crop with out_of_bounds_policy="pad" for centered padding
Rect modu yok: DALI hatları sabit boyutlu çıktılar üretir (ör. 640×640). Değişken boyutlu çıktılar (ör. 384×640) üreten auto=True rect modu desteklenmez. TensorRT dinamik girdi şekillerini desteklese de, sabit boyutlu bir DALI hattının maksimum verimlilik için sabit boyutlu bir motorla uyumlu çalıştığını unutmayın
Memory with multiple instances: Using instance_group with count > 1 in Triton can cause high memory usage. Use the default instance group for the DALI model

Link to this sectionSSS#

Link to this sectionDALI ön işlemesi, CPU ön işleme hızıyla nasıl kıyaslanır?#

Fayda, hattına bağlıdır. GPU çıkarımı TensorRT ile zaten hızlı olduğunda, 2-10 ms süren CPU ön işlemesi baskın maliyet haline gelebilir. DALI, ön işlemeyi GPU üzerinde çalıştırarak bu darboğazı ortadan kaldırır. En büyük kazanımlar yüksek çözünürlüklü girdiler (1080p, 4K), büyük grup boyutları ve GPU başına sınırlı CPU çekirdeğine sahip sistemlerde görülür.

Link to this sectionDALI'yi (sadece TensorRT değil) PyTorch modelleriyle kullanabilir miyim?#

Yes. Use DALIGenericIterator to get preprocessed torch.Tensor outputs, then pass them to model.predict(). However, the performance benefit is greatest with TensorRT models where inference is already very fast and CPU preprocessing becomes the bottleneck.

Link to this sectionDolgu (padding) için `fn.pad` ve `fn.crop` arasındaki fark nedir?#

fn.pad adds padding only to the right and bottom edges. fn.crop with out_of_bounds_policy="pad" centers the image and adds padding symmetrically on all sides, matching Ultralytics LetterBox(center=True) behavior.

Link to this sectionDALI, CPU ön işlemesiyle piksel olarak aynı sonuçları üretir mi?#

Nearly identical. Set antialias=False in fn.resize to match OpenCV's cv2.INTER_LINEAR. Minor floating-point differences (< 0.001) may occur due to GPU vs CPU arithmetic, but these have no measurable impact on detection accuracy.

Link to this sectionDALI'ye bir alternatif olarak CV-CUDA nedir?#

CV-CUDA, GPU hızlandırmalı görüntü işleme için başka bir NVIDIA kütüphanesidir. DALI'nin hat yaklaşımının aksine operatör bazlı kontrol sağlar (GPU üzerinde OpenCV gibi). CV-CUDA'nın cvcuda.copymakeborder() fonksiyonu, kenar bazlı açık dolguyu destekler ve bu da ortalanmış letterbox işlemini kolaylaştırır. Hat tabanlı iş akışları için (özellikle Triton ile) DALI'yi, özel çıkarım kodlarında ince ayarlı operatör düzeyinde kontrol için ise CV-CUDA'yı seçin.

Contributors

GLglenn-jocher¹ ONonuralpszr¹ RAraimbekovm¹

Created geçen ayUpdated 5 gün önce