Link to this sectionKnowledge Distillation#
Link to this sectionQuickstart#
Train a smaller student model with guidance from a larger teacher model by adding the distill_model argument:
from ultralytics import YOLO
model = YOLO("yolo26n.pt")
model.train(data="coco8.yaml", epochs=100, distill_model="yolo26s.pt")Link to this sectionWhat is Knowledge Distillation?#
Knowledge distillation transfers knowledge from a large, accurate teacher model to a smaller student model. The student learns to mimic the teacher's internal feature representations, often achieving better accuracy than training from scratch.
Use distillation when:
- You need a smaller, faster model for deployment
- You have a high-accuracy teacher model trained on the same data
- You want better accuracy than standard training provides
Knowledge distillation is implemented for detect, segment, pose, and obb tasks. Only detect has been experimentally verified for accuracy improvements for now.
Link to this sectionPerformance#
Knowledge distillation improves student mAP across the entire YOLO26 family on COCO, with no added inference cost. The table below compares the standard YOLO26 models (baseline) against the same models trained with distillation from their recommended teacher.
| Model | size (pixels) | mAPval 50-95 baseline | mAPval 50-95 distilled | mAPval 50-95 (e2e) baseline | mAPval 50-95 (e2e) distilled |
|---|---|---|---|---|---|
| YOLO26n-distill | 640 | 40.9 | 41.5 | 40.1 | 40.9 |
| YOLO26s-distill | 640 | 48.6 | 49.2 | 47.8 | 48.6 |
| YOLO26m-distill | 640 | 53.1 | 53.9 | 52.5 | 53.3 |
| YOLO26l-distill | 640 | 55.0 | 56.0 | 54.4 | 55.5 |
| YOLO26x-distill | 640 | 57.5 | 57.9 | 56.9 | 57.4 |
- mAPval values are for single-model single-scale on the COCO val2017 dataset.
Reproduce byyolo val detect data=coco.yaml device=0 - e2e values use the default NMS-free inference path; non-e2e values use traditional NMS post-processing (
end2end=False). See End-to-End Detection for details.
Link to this sectionPrerequisites#
Before you begin, ensure you meet the following requirements:
- Trained Teacher Model: A pre-trained, high-accuracy teacher model from the same YOLO family as the student model (e.g., YOLO26).
- Matching Dataset and Task: Both the teacher and student models must use the exact same dataset and task configuration.
- GPU Resources: Sufficient GPU memory (VRAM) to load and run both models concurrently during training (refer to the FAQ for typical VRAM overhead).
Link to this sectionRecommended Model Pairs#
| Student | Recommended Teacher |
|---|---|
yolo26n.pt | yolo26s.pt |
yolo26s.pt | yolo26m.pt |
yolo26m.pt | yolo26x.pt |
yolo26l.pt | yolo26x.pt |
Cross-family distillation (e.g., YOLO11 teacher with YOLO26 student) is not supported.
Link to this sectionKey Parameters#
| Parameter | Type | Default | Description |
|---|---|---|---|
distill_model | str | None | Path to the teacher model file (e.g., yolo26x.pt). Setting this enables knowledge distillation. |
dis | float | 6.0 | Distillation loss weight. Controls how much the distillation loss contributes to the total training loss. |
Link to this sectionHow It Works#
- The teacher model remains frozen in
evalmode and runs inference on each batch - The student model trains with standard task losses plus distillation guidance
- Features are extracted from both models at the three neck layers that feed the Detect-family head
- A projector network (lightweight MLP) aligns student feature dimensions to match the teacher
- A score-weighted L2 loss compares projected student features with teacher features, weighted by the teacher's classification confidence
- The distillation loss combines with standard losses using the
disweight
flowchart TD
A[Input Image Batch] --> T[Teacher Model<br/>frozen, eval mode]
A --> S[Student Model<br/>trainable]
T --> |Detect head inputs| TF[Teacher Features]
S --> |Detect head inputs| SF[Student Features]
SF --> P[1×1 Conv Projector<br/>with ReLU]
P --> AF[Aligned Student Features]
TF --> SW[Score-weighted L2 Loss]
AF --> SW
S --> D[Detection Head]
D --> DL[box_loss + cls_loss + dfl_loss]
SW --> |× dis| DIS[distillation loss]
DL --> TOTAL[Total Loss]
DIS --> TOTAL
TOTAL --> BP[Backpropagate<br/>Student + Projector only]Link to this sectionTask Support#
The distillation implementation extracts features from the three neck layers that feed the model's Detect-family head. Because the segment, pose, and obb heads inherit from the same Detect architecture, distillation is technically compatible with those tasks as well.
Only detect has been experimentally benchmarked and verified. You can run distillation for segment, pose, or obb, but accuracy improvements for those tasks are not yet validated.
from ultralytics import YOLO
# Segment
model = YOLO("yolo26n-seg.pt")
model.train(data="coco8-seg.yaml", epochs=100, distill_model="yolo26s-seg.pt")
# Pose
model = YOLO("yolo26n-pose.pt")
model.train(data="coco8-pose.yaml", epochs=100, distill_model="yolo26s-pose.pt")
# OBB
model = YOLO("yolo26n-obb.pt")
model.train(data="dota8.yaml", epochs=100, distill_model="yolo26s-obb.pt")Link to this sectionTraining#
Link to this sectionBasic Training#
Training with distillation is identical to standard training. Provide the distill_model path to enable it:
from ultralytics import YOLO
# Load a student model
student = YOLO("yolo26m.pt")
# Train with knowledge distillation from a larger teacher model
results = student.train(data="coco8.yaml", epochs=100, distill_model="yolo26x.pt")Link to this sectionAdjusting the Distillation Loss Weight#
The dis parameter (default: 6.0) controls distillation loss contribution:
from ultralytics import YOLO
student = YOLO("yolo26n.pt")
results = student.train(data="coco8.yaml", epochs=100, distill_model="yolo26s.pt", dis=10.0)Link to this sectionResuming Distillation Training#
Distillation training supports resuming from checkpoints. The teacher model is rebuilt automatically from the distill_model path:
from ultralytics import YOLO
student = YOLO("runs/detect/train/weights/last.pt")
results = student.train(resume=True)Link to this sectionTraining Output#
When distillation is enabled, an additional dis_loss column appears in training logs:
Epoch GPU_mem box_loss cls_loss dfl_loss dis_loss Instances Size
1/80 46.2G 1.566 5.404 0.003249 6.658 231 640The exported model contains only the student weights—file size and inference speed match a normally trained student model.
Link to this sectionFAQ#
Link to this sectionWhy is my distillation loss not decreasing?#
- Verify teacher and student are from the same YOLO generation
- Confirm
distill_modelpath is correct and the file loads - Try increasing
disif the loss value is very small - Ensure the teacher model is trained on the same dataset
Link to this sectionHow does distillation differ from standard training?#
Add the distill_model parameter—everything else works identically. An extra distillation loss computes during training, but the saved model is a standard YOLO model with no overhead.
Link to this sectionDoes knowledge distillation slow down training?#
Yes. Expect 1.2-1.5x slower training and ~1.1x more GPU memory because the teacher model runs inference on each batch. The teacher runs in eval mode without gradients, keeping overhead manageable. Use amp=True to reduce impact.
Link to this sectionWhich tasks and models are supported?#
Knowledge distillation works with detect, segment, pose, and obb tasks because it distills features from the three neck layers that feed the Detect-family head. Classify and semantic tasks are not supported.
Only detect has been experimentally verified for accuracy improvements. Segment, pose, and obb are technically compatible but not yet benchmarked.
The teacher and student must belong to the same YOLO family (e.g., YOLOv8, YOLO11, or YOLO26). Cross-family distillation (e.g., a YOLO11 teacher with a YOLO26 student) is not supported.