Link to this sectionBaidu's RT-DETR: A Vision Transformer-Based Real-Time Object Detector#

Q: Can max_det make RT-DETR return more than 300 detections?

No. For RT-DETR, max_det caps how many predictions are returned after inference, but it does not increase the number of object queries produced by the decoder. The Ultralytics RT-DETR pretrained checkpoints use 300 object queries, so they cannot return more than 300 detections per image even if you set max_det to a larger value. Use max_det to reduce returned detections, for example max_det=100, when you only need fewer high-confidence predictions. If your dataset can contain more than 300 objects per image, train a custom RT-DETR model with a higher decoder query count (nq) in the model YAML; changing this value on a pretrained checkpoint after training is not equivalent and requires retraining to learn the additional queries.

Link to this sectionOverview#

Real-Time Detection Transformer (RT-DETR), developed by Baidu, is a cutting-edge end-to-end object detector that provides real-time performance while maintaining high accuracy. It is based on the idea of DETR (the NMS-free framework), meanwhile introducing conv-based backbone and an efficient hybrid encoder to gain real-time speed. RT-DETR efficiently processes multiscale features by decoupling intra-scale interaction and cross-scale fusion. The model is highly adaptable, supporting flexible adjustment of inference speed using different decoder layers without retraining. RT-DETR excels on accelerated backends like CUDA with TensorRT, outperforming many other real-time object detectors.

Watch: How to Use Baidu's RT-DETR for Object Detection | Inference and Benchmarking with Ultralytics 🚀

Baidu RT-DETR model architecture overview Overview of Baidu's RT-DETR. The RT-DETR model architecture diagram shows the last three stages of the backbone {S3, S4, S5} as the input to the encoder. The efficient hybrid encoder transforms multiscale features into a sequence of image features through intrascale feature interaction (AIFI) and cross-scale feature-fusion module (CCFM). The IoU-aware query selection is employed to select a fixed number of image features to serve as initial object queries for the decoder. Finally, the decoder with auxiliary prediction heads iteratively optimizes object queries to generate boxes and confidence scores (source).

Link to this sectionKey Features#

Efficient Hybrid Encoder: Baidu's RT-DETR uses an efficient hybrid encoder that processes multiscale features by decoupling intra-scale interaction and cross-scale fusion. This unique Vision Transformers-based design reduces computational costs and allows for real-time object detection.
IoU-aware Query Selection: Baidu's RT-DETR improves object query initialization by utilizing IoU-aware query selection. This allows the model to focus on the most relevant objects in the scene, enhancing the detection accuracy.
Adaptable Inference Speed: Baidu's RT-DETR supports flexible adjustments of inference speed by using different decoder layers without the need for retraining. This adaptability facilitates practical application in various real-time object detection scenarios.
NMS-Free Framework: Based on DETR, RT-DETR eliminates the need for non-maximum suppression post-processing, simplifying the detection pipeline and potentially improving efficiency.
Anchor-Free Detection: As an anchor-free detector, RT-DETR simplifies the detection process and may improve generalization across different datasets.

Link to this sectionPretrained Models#

The Ultralytics Python API provides pretrained PaddlePaddle RT-DETR models with different scales:

RT-DETR-L: 53.0% AP on COCO val2017, 114 FPS on T4 GPU
RT-DETR-X: 54.8% AP on COCO val2017, 74 FPS on T4 GPU

Additionally, Baidu has released RTDETRv2 in July 2024, which further improves upon the original architecture with enhanced performance metrics.

Link to this sectionUsage Examples#

This example provides simple RT-DETR training and inference examples. For full documentation on these and other modes see the Predict, Train, Val and Export docs pages. Models can also be trained on cloud GPUs through Ultralytics Platform.

Example

from ultralytics import RTDETR

# Load a COCO-pretrained RT-DETR-l model
model = RTDETR("rtdetr-l.pt")

# Display model information (optional)
model.info()

# Train the model on the COCO8 example dataset for 100 epochs
results = model.train(data="coco8.yaml", epochs=100, imgsz=640)

# Run inference with the RT-DETR-l model on the 'bus.jpg' image
results = model("path/to/bus.jpg")

Faster Inference Trade-Offs

RT-DETR pretrained weights support two inference-time settings to reduce latency without retraining:

eval_idx: Stop decoding early. For the default 6-layer decoder, use a zero-based index (0–5). eval_idx=5 uses all layers; eval_idx=3 uses 4 layers. On a T4 GPU with TensorRT v10.11, RT-DETR-L improves from 8.0 ms / 52.7 mAP to 7.4 ms / 52.5 mAP with 4 layers.
num_queries: Reduce object queries (default: 300). Lowering to 100 can reach 7.4 ms / 51.7 mAP on COCO in the same setup. On datasets with fewer objects per image the mAP drop is typically smaller, but keep the value above the maximum expected objects per image.

Both settings can lower mAP — validate the trade-off on your dataset before deployment.

from ultralytics import RTDETR

rtdetr = RTDETR("rtdetr-l.pt")
head = rtdetr.model.model[-1]

# Choose one or both settings after validating the speed/accuracy trade-off.
head.decoder.eval_idx = 3  # Use 4 of 6 decoder layers.
head.num_queries = 100  # Use fewer object queries.

results = rtdetr("path/to/image.jpg")

# Export uses the same decoder and query settings, including TensorRT exports.
rtdetr.export(format="engine", device=0, quantize=16)

Link to this sectionSupported Tasks and Modes#

This table presents the model types, the specific pretrained weights, the tasks supported by each model, and the various modes (Train , Val, Predict, Export) that are supported, indicated by ✅ emojis.

Model Type	Pretrained Weights	Tasks Supported	Inference	Validation	Training	Export
RT-DETR Large	rtdetr-l.pt	Object Detection	✅	✅	✅	✅
RT-DETR Extra-Large	rtdetr-x.pt	Object Detection	✅	✅	✅	✅

Architecture-only variants

rtdetr-resnet50.yaml and rtdetr-resnet101.yaml are shipped as YAML architectures only. Ultralytics releases pretrained weights only for rtdetr-l and rtdetr-x. Instantiate the ResNet variants from YAML (for example, RTDETR("rtdetr-resnet50.yaml")) and train or fine-tune them as needed.

Link to this sectionIdeal Use Cases#

RT-DETR is particularly well-suited for applications requiring both high accuracy and real-time performance:

Autonomous Driving: For reliable environmental perception in self-driving systems where both speed and accuracy are critical. Learn more about AI in self-driving cars.
Advanced Robotics: Enabling robots to perform complex tasks requiring accurate object recognition and interaction in dynamic environments. Explore AI's role in robotics.
Medical Imaging: For applications in healthcare where precision in object detection can be crucial for diagnostics. Discover AI in healthcare.
Surveillance Systems: For security applications requiring real-time monitoring with high detection accuracy. Learn about security alarm systems.
Satellite Image Analysis: For detailed analysis of high-resolution imagery where global context understanding is important. Read about computer vision in satellite imagery.

Link to this sectionCitations and Acknowledgments#

If you use Baidu's RT-DETR in your research or development work, please cite the original paper:

Quote

@misc{lv2023detrs,
      title={DETRs Beat YOLOs on Real-time Object Detection},
      author={Wenyu Lv and Shangliang Xu and Yian Zhao and Guanzhong Wang and Jinman Wei and Cheng Cui and Yuning Du and Qingqing Dang and Yi Liu},
      year={2023},
      eprint={2304.08069},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

For RTDETRv2, you can cite the 2024 paper:

Quote

@misc{lv2024rtdetrv2,
      title={RTDETRv2: All-in-One Detection Transformer Beats YOLO and DINO},
      author={Wenyu Lv and Yian Zhao and Qinyao Chang and Kui Huang and Guanzhong Wang and Yi Liu},
      year={2024},
      eprint={2407.17140},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

We would like to acknowledge Baidu and the PaddlePaddle team for creating and maintaining this valuable resource for the computer vision community. Their contribution to the field with the development of the Vision Transformers-based real-time object detector, RT-DETR, is greatly appreciated.

Link to this sectionFAQ#

Link to this sectionWhat is Baidu's RT-DETR model and how does it work?#

Baidu's RT-DETR (Real-Time Detection Transformer) is an advanced real-time object detector built upon the Vision Transformer architecture. It efficiently processes multiscale features by decoupling intra-scale interaction and cross-scale fusion through its efficient hybrid encoder. By employing IoU-aware query selection, the model focuses on the most relevant objects, enhancing detection accuracy. Its adaptable inference speed, achieved by adjusting decoder layers without retraining, makes RT-DETR suitable for various real-time object detection scenarios. Learn more about RT-DETR features in the RT-DETR Arxiv paper.

Link to this sectionHow can I use the pretrained RT-DETR models provided by Ultralytics?#

You can leverage Ultralytics Python API to use pretrained PaddlePaddle RT-DETR models. For instance, to load an RT-DETR-l model pretrained on COCO val2017 and achieve high FPS on T4 GPU, you can utilize the following example:

Example

from ultralytics import RTDETR

# Load a COCO-pretrained RT-DETR-l model
model = RTDETR("rtdetr-l.pt")

# Display model information (optional)
model.info()

# Train the model on the COCO8 example dataset for 100 epochs
results = model.train(data="coco8.yaml", epochs=100, imgsz=640)

# Run inference with the RT-DETR-l model on the 'bus.jpg' image
results = model("path/to/bus.jpg")

Link to this sectionWhy should I choose Baidu's RT-DETR over other real-time object detectors?#

Baidu's RT-DETR stands out due to its efficient hybrid encoder and IoU-aware query selection, which drastically reduce computational costs while maintaining high accuracy. Its unique ability to adjust inference speed by using different decoder layers without retraining adds significant flexibility. This makes it particularly advantageous for applications requiring real-time performance on accelerated backends like CUDA with TensorRT, outclassing many other real-time object detectors. The transformer architecture also provides better global context understanding compared to traditional CNN-based detectors.

Link to this sectionHow does RT-DETR support adaptable inference speed for different real-time applications?#

Baidu's RT-DETR allows flexible adjustments of inference speed by using different decoder layers without requiring retraining. This adaptability is crucial for scaling performance across various real-time object detection tasks. Whether you need faster processing for lower precision needs or slower, more accurate detections, RT-DETR can be tailored to meet your specific requirements. This feature is particularly valuable when deploying models across devices with varying computational capabilities.

Link to this sectionCan `max_det` make RT-DETR return more than 300 detections?#

No. For RT-DETR, max_det caps how many predictions are returned after inference, but it does not increase the number of object queries produced by the decoder. The Ultralytics RT-DETR pretrained checkpoints use 300 object queries, so they cannot return more than 300 detections per image even if you set max_det to a larger value.

Use max_det to reduce returned detections, for example max_det=100, when you only need fewer high-confidence predictions. If your dataset can contain more than 300 objects per image, train a custom RT-DETR model with a higher decoder query count (nq) in the model YAML; changing this value on a pretrained checkpoint after training is not equivalent and requires retraining to learn the additional queries.

Link to this sectionCan I use RT-DETR models with other Ultralytics modes, such as training, validation, and export?#

Yes, RT-DETR models are compatible with various Ultralytics modes including training, validation, prediction, and export. You can refer to the respective documentation for detailed instructions on how to utilize these modes: Train, Val, Predict, and Export. This ensures a comprehensive workflow for developing and deploying your object detection solutions. The Ultralytics framework provides a consistent API across different model architectures, making it easy to work with RT-DETR models.

Contributors

GLglenn-jocher²⁸ RIRizwanMunawar⁵ RAraimbekovm³ ARartest08² OAoaslananka¹ ONonuralpszr¹ PDpderrenger¹ LEleonnil¹ LALaughing-q¹ MAMatthewNoyce¹

Created Nov 12, 2023Updated 1 day ago