Link to this sectionKnowledge Distillation#

Q: Why is my distillation loss not decreasing?

Verify teacher and student are from the same YOLO generation; Confirm distill_model path is correct and the file loads; Try increasing dis if the loss value is very small; Ensure the teacher model is trained on the same dataset

Q: How does distillation differ from standard training?

Add the distill_model parameter—everything else works identically. An extra distillation loss computes during training, but the saved model is a standard YOLO model with no overhead.

Q: Does knowledge distillation slow down training?

Yes. Expect 1.2-1.5x slower training and 1.1x more GPU memory because the teacher model runs inference on each batch. The teacher runs in eval mode without gradients, keeping overhead manageable. Use amp=True to reduce impact.

Q: Which tasks and models are supported?

Knowledge distillation works with detect, segment, pose, and obb tasks because it distills features from the three neck layers that feed the Detect-family head. Classify and semantic tasks are not supported. Only detect has been experimentally verified for accuracy improvements. Segment, pose, and obb are technically compatible but not yet benchmarked. The teacher and student must belong to the same YOLO family (e.g., YOLOv8, YOLO11, or YOLO26). Cross-family distillation (e.g., a YOLO11 teacher with a YOLO26 student) is not supported.

Link to this sectionQuickstart#

Train a smaller student model with guidance from a larger teacher model by adding the distill_model argument:

Example

from ultralytics import YOLO

model = YOLO("yolo26n.pt")
model.train(data="coco8.yaml", epochs=100, distill_model="yolo26s.pt")

Link to this sectionWhat is Knowledge Distillation?#

Knowledge distillation transfers knowledge from a large, accurate teacher model to a smaller student model. The student learns to mimic the teacher's internal feature representations, often achieving better accuracy than training from scratch.

Knowledge distillation workflow image

Use distillation when:

You need a smaller, faster model for deployment
You have a high-accuracy teacher model trained on the same data
You want better accuracy than standard training provides

Note

Knowledge distillation is implemented for detect, segment, pose, and obb tasks. Only detect has been experimentally verified for accuracy improvements for now.

Link to this sectionPerformance#

Knowledge distillation improves student mAP across the entire YOLO26 family on COCO, with no added inference cost. The table below compares the standard YOLO26 models (baseline) against the same models trained with distillation from their recommended teacher.

Model	size ^(pixels)	mAP^val 50-95 baseline	mAP^val 50-95 distilled	mAP^{val 50-95 (e2e)} baseline	mAP^{val 50-95 (e2e)} distilled
YOLO26n-distill	640	40.9	41.5	40.1	40.9
YOLO26s-distill	640	48.6	49.2	47.8	48.6
YOLO26m-distill	640	53.1	53.9	52.5	53.3
YOLO26l-distill	640	55.0	56.0	54.4	55.5
YOLO26x-distill	640	57.5	57.9	56.9	57.4

mAP^val values are for single-model single-scale on the COCO val2017 dataset.
Reproduce by yolo val detect data=coco.yaml device=0
e2e values use the default NMS-free inference path; non-e2e values use traditional NMS post-processing (end2end=False). See End-to-End Detection for details.

Link to this sectionPrerequisites#

Before you begin, ensure you meet the following requirements:

Trained Teacher Model: A pre-trained, high-accuracy teacher model from the same YOLO family as the student model (e.g., YOLO26).
Matching Dataset and Task: Both the teacher and student models must use the exact same dataset and task configuration.
GPU Resources: Sufficient GPU memory (VRAM) to load and run both models concurrently during training (refer to the FAQ for typical VRAM overhead).

Link to this sectionRecommended Model Pairs#

Student	Recommended Teacher
`yolo26n.pt`	`yolo26s.pt`
`yolo26s.pt`	`yolo26m.pt`
`yolo26m.pt`	`yolo26x.pt`
`yolo26l.pt`	`yolo26x.pt`

Cross-family distillation (e.g., YOLO11 teacher with YOLO26 student) is not supported.

Link to this sectionKey Parameters#

Parameter	Type	Default	Description
`distill_model`	`str`	`None`	Path to the teacher model file (e.g., `yolo26x.pt`). Setting this enables knowledge distillation.
`dis`	`float`	`6.0`	Distillation loss weight. Controls how much the distillation loss contributes to the total training loss.

Link to this sectionHow It Works#

The teacher model remains frozen in eval mode and runs inference on each batch
The student model trains with standard task losses plus distillation guidance
Features are extracted from both models at the three neck layers that feed the Detect-family head
A projector network (lightweight MLP) aligns student feature dimensions to match the teacher
A score-weighted L2 loss compares projected student features with teacher features, weighted by the teacher's classification confidence
The distillation loss combines with standard losses using the dis weight

flowchart TD
    A[Input Image Batch]:::start --> T[Teacher Model<br/>frozen, eval mode]:::extern
    A --> S[Student Model<br/>trainable]:::proc

    T --> |Detect head inputs| TF[Teacher Features]:::extern
    S --> |Detect head inputs| SF[Student Features]:::proc

    SF --> P[1×1 Conv Projector<br/>with ReLU]:::decide
    P --> AF[Aligned Student Features]:::proc

    TF --> SW[Score-weighted L2 Loss]:::proc
    AF --> SW

    S --> D[Detection Head]:::proc
    D --> DL[box_loss + cls_loss + dfl_loss]:::proc

    SW --> |× dis| DIS[distillation loss]:::proc
    DL --> TOTAL[Total Loss]:::out
    DIS --> TOTAL

    TOTAL --> BP[Backpropagate<br/>Student + Projector only]:::out

    classDef start fill:#4CAF50,color:#fff
    classDef proc fill:#2196F3,color:#fff
    classDef decide fill:#FF9800,color:#fff
    classDef out fill:#9C27B0,color:#fff
    classDef extern fill:#607D8B,color:#fff

Link to this sectionTask Support#

The distillation implementation extracts features from the three neck layers that feed the model's Detect-family head. Because the segment, pose, and obb heads inherit from the same Detect architecture, distillation is technically compatible with those tasks as well.

Warning

Only detect has been experimentally benchmarked and verified. You can run distillation for segment, pose, or obb, but accuracy improvements for those tasks are not yet validated.

Knowledge Distillation for Other Tasks

from ultralytics import YOLO

# Segment
model = YOLO("yolo26n-seg.pt")
model.train(data="coco8-seg.yaml", epochs=100, distill_model="yolo26s-seg.pt")

# Pose
model = YOLO("yolo26n-pose.pt")
model.train(data="coco8-pose.yaml", epochs=100, distill_model="yolo26s-pose.pt")

# OBB
model = YOLO("yolo26n-obb.pt")
model.train(data="dota8.yaml", epochs=100, distill_model="yolo26s-obb.pt")

Link to this sectionTraining#

Link to this sectionBasic Training#

Training with distillation is identical to standard training. Provide the distill_model path to enable it:

Knowledge Distillation Training

from ultralytics import YOLO

# Load a student model
student = YOLO("yolo26m.pt")

# Train with knowledge distillation from a larger teacher model
results = student.train(data="coco8.yaml", epochs=100, distill_model="yolo26x.pt")

Link to this sectionAdjusting the Distillation Loss Weight#

The dis parameter (default: 6.0) controls distillation loss contribution:

Custom Distillation Weight

from ultralytics import YOLO

student = YOLO("yolo26n.pt")
results = student.train(data="coco8.yaml", epochs=100, distill_model="yolo26s.pt", dis=10.0)

Link to this sectionResuming Distillation Training#

Distillation training supports resuming from checkpoints. The teacher model is rebuilt automatically from the distill_model path:

Resume Distillation Training

from ultralytics import YOLO

student = YOLO("runs/detect/train/weights/last.pt")
results = student.train(resume=True)

Link to this sectionTraining Output#

When distillation is enabled, an additional dis_loss column appears in training logs:

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss   dis_loss  Instances       Size
      1/80      46.2G      1.566      5.404    0.003249      6.658        231        640

The exported model contains only the student weights—file size and inference speed match a normally trained student model.

Link to this sectionFAQ#

Link to this sectionWhy is my distillation loss not decreasing?#

Verify teacher and student are from the same YOLO generation
Confirm distill_model path is correct and the file loads
Try increasing dis if the loss value is very small
Ensure the teacher model is trained on the same dataset

Link to this sectionHow does distillation differ from standard training?#

Add the distill_model parameter—everything else works identically. An extra distillation loss computes during training, but the saved model is a standard YOLO model with no overhead.

Link to this sectionDoes knowledge distillation slow down training?#

Yes. Expect 1.2-1.5x slower training and ~1.1x more GPU memory because the teacher model runs inference on each batch. The teacher runs in eval mode without gradients, keeping overhead manageable. Use amp=True to reduce impact.

Link to this sectionWhich tasks and models are supported?#

Knowledge distillation works with detect, segment, pose, and obb tasks because it distills features from the three neck layers that feed the Detect-family head. Classify and semantic tasks are not supported.

Only detect has been experimentally verified for accuracy improvements. Segment, pose, and obb are technically compatible but not yet benchmarked.

The teacher and student must belong to the same YOLO family (e.g., YOLOv8, YOLO11, or YOLO26). Cross-family distillation (e.g., a YOLO11 teacher with a YOLO26 student) is not supported.

Contributors

GLglenn-jocher² RIRizwanMunawar¹ LMlmycross¹

Created 2 weeks agoUpdated 14 hours ago