模型训练的机器学习最佳实践和技巧

Q: How can I improve GPU utilization when training a large dataset with Ultralytics YOLO?

To improve GPU utilization, set the batch_size parameter in your training configuration to the maximum size supported by your GPU. This ensures that you make full use of the GPU's capabilities, reducing training time. If you encounter memory errors, incrementally reduce the batch size until training runs smoothly. For YOLO11, setting batch=-1 in your training script will automatically determine the optimal batch size for efficient processing. For further information, refer to the training configuration.

导言

在计算机视觉项目中，最重要的步骤之一就是模型训练。在完成这一步之前，您需要确定目标并收集和注释数据。在对数据进行预处理以确保其干净一致后，就可以开始训练模型了。

观看： Model Training Tips | How to Handle Large Datasets | Batch Size, GPU Utilization and Mixed Precision

那么，什么是模型训练？模型训练是教会模型识别视觉模式并根据数据进行预测的过程。它直接影响应用程序的性能和准确性。在本指南中，我们将介绍最佳实践、优化技术和故障排除技巧，帮助您有效地训练计算机视觉模型。

How to Train a Machine Learning Model

计算机视觉模型通过调整其内部参数来训练，以最大程度地减少错误。最初，模型被馈送大量标记图像。它对这些图像中的内容进行预测，并将预测与实际标签或内容进行比较以计算误差。这些误差显示模型的预测与真实值的相差程度。

During training, the model iteratively makes predictions, calculates errors, and updates its parameters through a process called backpropagation. In this process, the model adjusts its internal parameters (weights and biases) to reduce the errors. By repeating this cycle many times, the model gradually improves its accuracy. Over time, it learns to recognize complex patterns such as shapes, colors, and textures.

什么是反向传播？

This learning process makes it possible for the computer vision model to perform various tasks, including object detection, instance segmentation, and image classification. The ultimate goal is to create a model that can generalize its learning to new, unseen images so that it can accurately understand visual data in real-world applications.

现在我们知道了训练模型时幕后发生了什么，让我们看看训练模型时要考虑的要点。

大型数据集上的训练

在计划使用大型数据集训练模型时，需要考虑几个不同的方面。例如，可以调整批量大小、控制GPU 利用率、选择使用多尺度训练等。下面让我们逐一详细介绍这些选项。

批量大小和GPU 利用率

在大型数据集上训练模型时，有效利用GPU 是关键。批量大小是一个重要因素。它是机器学习模型在一次训练迭代中处理的数据样本数量。使用GPU 支持的最大批量大小，可以充分利用其功能，缩短模型训练时间。但是，要避免GPU 内存不足。如果遇到内存错误，请逐步减少批量大小，直到模型训练顺利进行。

With respect to YOLO11, you can set the batch_size 参数中的培训配置以匹配GPU 的容量。此外，设置 batch=-1 in your training script will automatically determine the batch size that can be efficiently processed based on your device's capabilities. By fine-tuning the batch size, you can make the most of your GPU resources and improve the overall training process.

子集训练

子集训练是一种智能策略，涉及在表示较大数据集的较小数据集上训练模型。它可以节省时间和资源，尤其是在初始模型开发和测试期间。如果您时间紧迫或尝试不同的模型配置，子集训练是一个不错的选择。

When it comes to YOLO11, you can easily implement subset training by using the fraction 参数。此参数允许您指定数据集中用于训练的比例。例如，将 fraction=0.1 将在 10% 的数据上训练模型。在提交使用完整数据集训练模型之前，可以使用此技术进行快速迭代和调整模型。子集训练可帮助您快速取得进展并及早发现潜在问题。

多尺度培训

多尺度训练是一种通过在不同尺寸的图像上进行训练来提高模型泛化能力的技术。您的模型可以学会检测不同尺度和距离的物体，从而变得更加强大。

For example, when you train YOLO11, you can enable multiscale training by setting the scale 参数。此参数按指定因子调整训练图像的大小，模拟不同距离的对象。例如，将 scale=0.5 会将图像大小减小一半，而 scale=2.0 会翻倍。配置此参数可让您的模型体验各种图像比例，并提高其在不同对象大小和场景中的检测能力。

缓存

缓存是提高机器学习模型训练效率的一项重要技术。通过在内存中存储预处理图像，缓存减少了GPU 从磁盘加载数据的等待时间。模型可以持续接收数据，而不会因磁盘 I/O 操作而造成延迟。

Caching can be controlled when training YOLO11 using the cache 参数：

cache=True：将数据集图像存储在RAM中，提供最快的访问速度，但代价是增加内存使用量。
cache='disk'：将图像存储在磁盘上，比RAM慢，但比每次加载新数据都快。
cache=False：禁用缓存，完全依赖磁盘 I/O，这是最慢的选项。

混合精密训练

Mixed precision training uses both 16-bit (FP16) and 32-bit (FP32) floating-point types. The strengths of both FP16 and FP32 are leveraged by using FP16 for faster computation and FP32 to maintain precision where needed. Most of the neural network's operations are done in FP16 to benefit from faster computation and lower memory usage. However, a master copy of the model's weights is kept in FP32 to ensure accuracy during the weight update steps. You can handle larger models or larger batch sizes within the same hardware constraints.

混合精度训练概述

To implement mixed precision training, you'll need to modify your training scripts and ensure your hardware (like GPUs) supports it. Many modern deep learning frameworks, such as Tensorflow, offer built-in support for mixed precision.

Mixed precision training is straightforward when working with YOLO11. You can use the amp 在训练配置中标记。设置 amp=True 启用自动混合精度（AMP）训练。混合精度训练是优化模型训练过程的一种简单而有效的方法。

预训练重量

Using pretrained weights is a smart way to speed up your model's training process. Pretrained weights come from models already trained on large datasets, giving your model a head start. Transfer learning adapts pretrained models to new, related tasks. Fine-tuning a pre-trained model involves starting with these weights and then continuing training on your specific dataset. This method of training results in faster training times and often better performance because the model starts with a solid understanding of basic features.

"(《世界人权宣言》) pretrained parameter makes transfer learning easy with YOLO11. Setting pretrained=True 将使用默认的预训练权重，或者您可以指定自定义预训练模型的路径。使用预训练权重和迁移学习可有效提高模型的能力并降低训练成本。

处理大型数据集时要考虑的其他技术

处理大型数据集时，还需要考虑其他几种技术：

Learning Rate Schedulers: Implementing learning rate schedulers dynamically adjusts the learning rate during training. A well-tuned learning rate can prevent the model from overshooting minima and improve stability. When training YOLO11, the lrf 参数通过将最终学习率设置为初始率的一小部分来帮助管理学习率计划。
分布式训练：对于处理大型数据集而言，分布式训练可以改变游戏规则。您可以将训练工作量分散到多个 GPU 或机器上，从而缩短训练时间。

要训练的纪元数

训练模型时，epoch 是指整个训练数据集的一次完整传递。在一个时期内，模型对训练集中的每个示例进行一次处理，并根据学习算法更新其参数。通常需要多个时期才能使模型随着时间的推移学习和优化其参数。

A common question that comes up is how to determine the number of epochs to train the model for. A good starting point is 300 epochs. If the model overfits early, you can reduce the number of epochs. If overfitting does not occur after 300 epochs, you can extend the training to 600, 1200, or more epochs.

However, the ideal number of epochs can vary based on your dataset's size and project goals. Larger datasets might require more epochs for the model to learn effectively, while smaller datasets might need fewer epochs to avoid overfitting. With respect to YOLO11, you can set the epochs 参数。

提前停止

提前停止是优化模型训练的宝贵技术。通过监视验证性能，您可以在模型停止改进后停止训练。您可以节省计算资源并防止过拟合。

The process involves setting a patience parameter that determines how many epochs to wait for an improvement in validation metrics before stopping training. If the model's performance does not improve within these epochs, training is stopped to avoid wasting time and resources.

提前停课概述

For YOLO11, you can enable early stopping by setting the patience parameter in your training configuration. For example, patience=5 表示如果连续 5 个周期的验证指标没有改进，则训练将停止。使用这种方法可确保训练过程保持高效，并在没有过多计算的情况下实现最佳性能。

在云培训和本地培训之间进行选择

训练模型有两个选项：云训练和本地训练。

云培训提供了可扩展性和强大的硬件，是处理大型数据集和复杂模型的理想选择。Google Cloud、AWS 和 Azure 等平台可按需访问高性能 GPU 和 TPU，从而加快训练时间，并可进行大型模型实验。不过，云训练可能会很昂贵，尤其是长期训练，而且数据传输会增加成本和延迟。

本地培训提供了更好的控制和自定义，使你能够根据特定需求定制环境，并避免持续的云成本。对于长期项目来说，它可能更经济，而且由于您的数据保留在本地，因此更安全。但是，本地硬件可能具有资源限制并需要维护，这可能会导致大型模型的训练时间更长。

选择优化器

An optimizer is an algorithm that adjusts the weights of your neural network to minimize the loss function, which measures how well the model is performing. In simpler terms, the optimizer helps the model learn by tweaking its parameters to reduce errors. Choosing the right optimizer directly affects how quickly and accurately the model learns.

您还可以微调优化器参数以提高模型性能。调整学习率可在更新参数时设置步骤的大小。为了稳定起见，您可以从适度的学习率开始，随着时间的推移逐渐降低学习率，以改善长期学习。此外，设置动量决定了过去的更新对当前更新的影响程度。动量的常见值约为 0.9。它通常提供良好的平衡。

通用优化器

不同的优化器有不同的优点和缺点。让我们看一下一些常见的优化器。

SGD（随机梯度下降）：
- 使用损失函数相对于参数的梯度更新模型参数。
- 简单高效，但收敛速度可能很慢，并且可能会卡在局部最小值。
亚当（自适应矩估计）：
- 将 SGD 与 momentum 和 RMSProp 的优势相结合。
- 根据梯度第一矩和第二矩的估计值调整每个参数的学习率。
- 非常适合嘈杂的数据和稀疏的梯度。
- Efficient and generally requires less tuning, making it a recommended optimizer for YOLO11.
RMSProp（均方根传播）：
- 通过将梯度除以最近梯度幅度的运行平均值来调整每个参数的学习率。
- Helps in handling the vanishing gradient problem and is effective for recurrent neural networks.

For YOLO11, the optimizer 参数允许您从各种优化器中进行选择，包括 SGD、Adam、AdamW、NAdam、RAdam 和 RMSProp，也可以将其设置为 auto 用于根据模型配置自动选择。

与社区联系

成为计算机视觉爱好者社区的一员可以帮助您解决问题并更快地学习。以下是一些联系、获取帮助和分享想法的方法。

社区资源

GitHub Issues: Visit the YOLO11 GitHub repository and use the Issues tab to ask questions, report bugs, and suggest new features. The community and maintainers are very active and ready to help.
Ultralytics Discord 服务器：加入Ultralytics Discord 服务器，与其他用户和开发人员聊天，获得支持并分享经验。

官方文件

Ultralytics YOLO11 Documentation: Check out the official YOLO11 documentation for detailed guides and helpful tips on various computer vision projects.

使用这些资源将帮助您解决挑战，并及时了解计算机视觉社区的最新趋势和实践。

关键要点

Training computer vision models involves following good practices, optimizing your strategies, and solving problems as they arise. Techniques like adjusting batch sizes, mixed precision training, and starting with pre-trained weights can make your models work better and train faster. Methods like subset training and early stopping help you save time and resources. Staying connected with the community and keeping up with new trends will help you keep improving your model training skills.

常见问题

在使用Ultralytics YOLO 训练大型数据集时，如何提高GPU 的利用率？

要提高GPU 的利用率，请设置 batch_size parameter in your training configuration to the maximum size supported by your GPU. This ensures that you make full use of the GPU's capabilities, reducing training time. If you encounter memory errors, incrementally reduce the batch size until training runs smoothly. For YOLO11, setting batch=-1 将自动确定最佳批量大小，以实现高效处理。有关详细信息，请参阅培训配置.

What is mixed precision training, and how do I enable it in YOLO11?

Mixed precision training utilizes both 16-bit (FP16) and 32-bit (FP32) floating-point types to balance computational speed and precision. This approach speeds up training and reduces memory usage without sacrificing model accuracy. To enable mixed precision training in YOLO11, set the amp 参数改为 True 在训练配置中。这将激活自动混合精度 (AMP) 训练。有关该优化技术的更多详情，请参阅培训配置.

How does multiscale training enhance YOLO11 model performance?

Multiscale training enhances model performance by training on images of varying sizes, allowing the model to better generalize across different scales and distances. In YOLO11, you can enable multiscale training by setting the scale 参数。例如 scale=0.5 将图像尺寸缩小一半，而 scale=2.0 加倍。这种技术可以模拟不同距离的物体，使模型在各种情况下都更加稳健。有关设置和更多详情，请查看培训配置.

How can I use pre-trained weights to speed up training in YOLO11?

Using pre-trained weights can significantly reduce training times and improve model performance by starting from a model that already understands basic features. In YOLO11, you can set the pretrained 参数改为 True 或在训练配置中指定自定义预训练权重的路径。这种方法被称为迁移学习，它利用来自大型数据集的知识来适应你的特定任务。进一步了解预训练权重及其优势这里.

What is the recommended number of epochs for training a model, and how do I set this in YOLO11?

The number of epochs refers to the complete passes through the training dataset during model training. A typical starting point is 300 epochs. If your model overfits early, you can reduce the number. Alternatively, if overfitting isn't observed, you might extend training to 600, 1200, or more epochs. To set this in YOLO11, use the epochs 参数。有关确定理想epoch次数的其他建议，请参阅本节的计时次数.

📅 Created 4 months ago ✏️ Updated 5 days ago