Reference for ultralytics/data/build.py
Note
This file is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/data/build.py. If you spot a problem please help fix it by contributing a Pull Request 🛠️. Thank you 🙏!
ultralytics.data.build.InfiniteDataLoader
InfiniteDataLoader(*args: Any, **kwargs: Any)
Bases: DataLoader
Dataloader that reuses workers for infinite iteration.
This dataloader extends the PyTorch DataLoader to provide infinite recycling of workers, which improves efficiency for training loops that need to iterate through the dataset multiple times without recreating workers.
Attributes:
Name | Type | Description |
---|---|---|
batch_sampler |
_RepeatSampler
|
A sampler that repeats indefinitely. |
iterator |
Iterator
|
The iterator from the parent DataLoader. |
Methods:
Name | Description |
---|---|
__len__ |
Return the length of the batch sampler's sampler. |
__iter__ |
Create a sampler that repeats indefinitely. |
__del__ |
Ensure workers are properly terminated. |
reset |
Reset the iterator, useful when modifying dataset settings during training. |
Examples:
Create an infinite dataloader for training
>>> dataset = YOLODataset(...)
>>> dataloader = InfiniteDataLoader(dataset, batch_size=16, shuffle=True)
>>> for batch in dataloader: # Infinite iteration
>>> train_step(batch)
Source code in ultralytics/data/build.py
62 63 64 65 66 67 68 |
|
__del__
__del__()
Ensure that workers are properly terminated when the dataloader is deleted.
Source code in ultralytics/data/build.py
79 80 81 82 83 84 85 86 87 88 89 |
|
__iter__
__iter__() -> Iterator
Create an iterator that yields indefinitely from the underlying iterator.
Source code in ultralytics/data/build.py
74 75 76 77 |
|
__len__
__len__() -> int
Return the length of the batch sampler's sampler.
Source code in ultralytics/data/build.py
70 71 72 |
|
reset
reset()
Reset the iterator to allow modifications to the dataset during training.
Source code in ultralytics/data/build.py
91 92 93 |
|
ultralytics.data.build._RepeatSampler
_RepeatSampler(sampler: Any)
Sampler that repeats forever for infinite iteration.
This sampler wraps another sampler and yields its contents indefinitely, allowing for infinite iteration over a dataset without recreating the sampler.
Attributes:
Name | Type | Description |
---|---|---|
sampler |
sampler
|
The sampler to repeat. |
Source code in ultralytics/data/build.py
107 108 109 |
|
__iter__
__iter__() -> Iterator
Iterate over the sampler indefinitely, yielding its contents.
Source code in ultralytics/data/build.py
111 112 113 114 |
|
ultralytics.data.build.ContiguousDistributedSampler
ContiguousDistributedSampler(
dataset, num_replicas=None, batch_size=None, rank=None, shuffle=False
)
Bases: Sampler
Distributed sampler that assigns contiguous batch-aligned chunks of the dataset to each GPU.
Unlike PyTorch's DistributedSampler which distributes samples in a round-robin fashion (GPU 0 gets indices [0,2,4,...], GPU 1 gets [1,3,5,...]), this sampler gives each GPU contiguous batches of the dataset (GPU 0 gets batches [0,1,2,...], GPU 1 gets batches [k,k+1,...], etc.). This preserves any ordering or grouping in the original dataset, which is critical when samples are organized by similarity (e.g., images sorted by size to enable efficient batching without padding when using rect=True).
The sampler handles uneven batch counts by distributing remainder batches to the first few ranks, ensuring all samples are covered exactly once across all GPUs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
Dataset
|
Dataset to sample from. Must implement len. |
required |
num_replicas
|
int
|
Number of distributed processes. Defaults to world size. |
None
|
batch_size
|
int
|
Batch size used by dataloader. Defaults to dataset batch size. |
None
|
rank
|
int
|
Rank of current process. Defaults to current rank. |
None
|
shuffle
|
bool
|
Whether to shuffle indices within each rank's chunk. Defaults to False. When True, shuffling is deterministic and controlled by set_epoch() for reproducibility. |
False
|
Example
For validation with size-grouped images
sampler = ContiguousDistributedSampler(val_dataset, batch_size=32, shuffle=False) loader = DataLoader(val_dataset, batch_size=32, sampler=sampler)
For training with shuffling
sampler = ContiguousDistributedSampler(train_dataset, batch_size=32, shuffle=True) for epoch in range(num_epochs): ... sampler.set_epoch(epoch) ... for batch in loader: ... ...
Source code in ultralytics/data/build.py
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
|
__iter__
__iter__()
Generate indices for this rank's contiguous chunk of the dataset.
Source code in ultralytics/data/build.py
187 188 189 190 191 192 193 194 195 196 197 |
|
__len__
__len__()
Return the number of samples in this rank's chunk.
Source code in ultralytics/data/build.py
199 200 201 202 |
|
set_epoch
set_epoch(epoch)
Set the epoch for this sampler to ensure different shuffling patterns across epochs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
epoch
|
int
|
Epoch number to use as the random seed for shuffling. |
required |
Source code in ultralytics/data/build.py
204 205 206 207 208 209 210 211 |
|
ultralytics.data.build.seed_worker
seed_worker(worker_id: int)
Set dataloader worker seed for reproducibility across worker processes.
Source code in ultralytics/data/build.py
214 215 216 217 218 |
|
ultralytics.data.build.build_yolo_dataset
build_yolo_dataset(
cfg: IterableSimpleNamespace,
img_path: str,
batch: int,
data: dict[str, Any],
mode: str = "train",
rect: bool = False,
stride: int = 32,
multi_modal: bool = False,
)
Build and return a YOLO dataset based on configuration parameters.
Source code in ultralytics/data/build.py
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 |
|
ultralytics.data.build.build_grounding
build_grounding(
cfg: IterableSimpleNamespace,
img_path: str,
json_file: str,
batch: int,
mode: str = "train",
rect: bool = False,
stride: int = 32,
max_samples: int = 80,
)
Build and return a GroundingDataset based on configuration parameters.
Source code in ultralytics/data/build.py
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 |
|
ultralytics.data.build.build_dataloader
build_dataloader(
dataset,
batch: int,
workers: int,
shuffle: bool = True,
rank: int = -1,
drop_last: bool = False,
pin_memory: bool = True,
)
Create and return an InfiniteDataLoader or DataLoader for training or validation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
Dataset
|
Dataset to load data from. |
required |
batch
|
int
|
Batch size for the dataloader. |
required |
workers
|
int
|
Number of worker threads for loading data. |
required |
shuffle
|
bool
|
Whether to shuffle the dataset. |
True
|
rank
|
int
|
Process rank in distributed training. -1 for single-GPU training. |
-1
|
drop_last
|
bool
|
Whether to drop the last incomplete batch. |
False
|
pin_memory
|
bool
|
Whether to use pinned memory for dataloader. |
True
|
Returns:
Type | Description |
---|---|
InfiniteDataLoader
|
A dataloader that can be used for training or validation. |
Examples:
Create a dataloader for training
>>> dataset = YOLODataset(...)
>>> dataloader = build_dataloader(dataset, batch=16, workers=4, shuffle=True)
Source code in ultralytics/data/build.py
283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 |
|
ultralytics.data.build.check_source
check_source(source)
Check the type of input source and return corresponding flag values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
str | int | Path | list | tuple | ndarray | Image | Tensor
|
The input source to check. |
required |
Returns:
Name | Type | Description |
---|---|---|
source |
str | int | Path | list | tuple | ndarray | Image | Tensor
|
The processed source. |
webcam |
bool
|
Whether the source is a webcam. |
screenshot |
bool
|
Whether the source is a screenshot. |
from_img |
bool
|
Whether the source is an image or list of images. |
in_memory |
bool
|
Whether the source is an in-memory object. |
tensor |
bool
|
Whether the source is a torch.Tensor. |
Examples:
Check a file path source
>>> source, webcam, screenshot, from_img, in_memory, tensor = check_source("image.jpg")
Check a webcam source
>>> source, webcam, screenshot, from_img, in_memory, tensor = check_source(0)
Source code in ultralytics/data/build.py
339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 |
|
ultralytics.data.build.load_inference_source
load_inference_source(
source=None,
batch: int = 1,
vid_stride: int = 1,
buffer: bool = False,
channels: int = 3,
)
Load an inference source for object detection and apply necessary transformations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
str | Path | Tensor | Image | ndarray
|
The input source for inference. |
None
|
batch
|
int
|
Batch size for dataloaders. |
1
|
vid_stride
|
int
|
The frame interval for video sources. |
1
|
buffer
|
bool
|
Whether stream frames will be buffered. |
False
|
channels
|
int
|
The number of input channels for the model. |
3
|
Returns:
Type | Description |
---|---|
Dataset
|
A dataset object for the specified input source with attached source_type attribute. |
Examples:
Load an image source for inference
>>> dataset = load_inference_source("image.jpg", batch=1)
Load a video stream source
>>> dataset = load_inference_source("rtsp://example.com/stream", vid_stride=2)
Source code in ultralytics/data/build.py
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 |
|