Meet YOLO26: next-gen vision AI.

Link to this sectionReference for ultralytics/models/sam/sam3/vl_combiner.py#

Improvements

This page is sourced from https://github.com/ultralytics/ultralytics/blob/main/ultralytics/models/sam/sam3/vl_combiner.py. Have an improvement or example to add? Open a Pull Request — thank you! 🙏


Summary

Link to this sectionClass ultralytics.models.sam.sam3.vl_combiner.SAM3VLBackbone#

def __init__(
    self,
    visual: Sam3DualViTDetNeck,
    text,
    compile_visual: bool = False,
    act_ckpt_whole_vision_backbone: bool = False,
    act_ckpt_whole_language_backbone: bool = False,
    scalp=0,
)

Bases: nn.Module

This backbone combines a vision backbone and a language backbone without fusion. As such it is more of a

convenience wrapper to handle the two backbones together.

It adds support for activation checkpointing and compilation.

Args

NameTypeDescriptionDefault
visualSam3DualViTDetNeckThe vision backbone to use.required
textnn.ModuleThe text encoder to use.required
compile_visualboolWhether to torch.compile the vision backbone.False
act_ckpt_whole_vision_backboneboolWhether to checkpoint activations for the whole vision backbone.False
act_ckpt_whole_language_backboneboolWhether to checkpoint activations for the whole language backbone.False
scalpintNumber of trailing (lowest-resolution) feature levels to drop from the backbone output.0

Methods

NameDescription
forwardForward pass of the backbone combiner.
forward_imageForward pass of the vision backbone and get both SAM3 and SAM2 features.
forward_image_sam2Forward pass of the vision backbone to get SAM2 features only.
forward_textForward pass of the text encoder.
set_imgszSet the image size for the vision backbone.
Source code in ultralytics/models/sam/sam3/vl_combiner.py

View on GitHub

class SAM3VLBackbone(nn.Module):
    """This backbone combines a vision backbone and a language backbone without fusion. As such it is more of a
    convenience wrapper to handle the two backbones together.

    It adds support for activation checkpointing and compilation.
    """

    def __init__(
        self,
        visual: Sam3DualViTDetNeck,
        text,
        compile_visual: bool = False,
        act_ckpt_whole_vision_backbone: bool = False,
        act_ckpt_whole_language_backbone: bool = False,
        scalp=0,
    ):
        """Initialize the backbone combiner.

        Args:
            visual (Sam3DualViTDetNeck): The vision backbone to use.
            text (nn.Module): The text encoder to use.
            compile_visual (bool): Whether to `torch.compile` the vision backbone.
            act_ckpt_whole_vision_backbone (bool): Whether to checkpoint activations for the whole vision backbone.
            act_ckpt_whole_language_backbone (bool): Whether to checkpoint activations for the whole language backbone.
            scalp (int): Number of trailing (lowest-resolution) feature levels to drop from the backbone output.
        """
        super().__init__()
        self.vision_backbone: Sam3DualViTDetNeck = torch.compile(visual) if compile_visual else visual
        self.language_backbone = text
        self.scalp = scalp
        # allow running activation checkpointing on the entire vision and language backbones
        self.act_ckpt_whole_vision_backbone = act_ckpt_whole_vision_backbone
        self.act_ckpt_whole_language_backbone = act_ckpt_whole_language_backbone

Link to this sectionMethod ultralytics.models.sam.sam3.vl_combiner.SAM3VLBackbone.forward#

def forward(
    self,
    samples: torch.Tensor,
    captions: list[str],
    input_boxes: torch.Tensor = None,
    additional_text: list[str] | None = None,
)

Forward pass of the backbone combiner.

Args

NameTypeDescriptionDefault
samplestorch.TensorThe input images.required
captionslist[str]The input captions.required
input_boxestorch.Tensor, optionalWhen the text contains box place-holders, the tensor containing their
spatial features.
None
additional_textlist[str], optionalExtra text (different from the captions) to encode in the same
backbone forward pass.
None

Returns

TypeDescription
dictOutput dictionary with the following keys: vision_features (the output of the vision backbone),
Source code in ultralytics/models/sam/sam3/vl_combiner.py

View on GitHub

def forward(
    self,
    samples: torch.Tensor,
    captions: list[str],
    input_boxes: torch.Tensor = None,
    additional_text: list[str] | None = None,
):
    """Forward pass of the backbone combiner.

    Args:
        samples (torch.Tensor): The input images.
        captions (list[str]): The input captions.
        input_boxes (torch.Tensor, optional): When the text contains box place-holders, the tensor containing their
            spatial features.
        additional_text (list[str], optional): Extra text (different from the captions) to encode in the same
            backbone forward pass.

    Returns:
        (dict): Output dictionary with the following keys: `vision_features` (the output of the vision backbone),
            `language_features` (the output of the language backbone), `language_mask` (the attention mask of the
            language backbone), `vision_pos_enc` (the positional encoding of the vision
            backbone) and, when `additional_text` is provided, `additional_text_features` and
            `additional_text_mask` (the language backbone output and attention mask for the additional text).
    """
    output = self.forward_image(samples)
    output.update(self.forward_text(captions, input_boxes, additional_text))
    return output

Link to this sectionMethod ultralytics.models.sam.sam3.vl_combiner.SAM3VLBackbone.forward_image#

def forward_image(self, samples: torch.Tensor)

Forward pass of the vision backbone and get both SAM3 and SAM2 features.

Args

NameTypeDescriptionDefault
samplestorch.Tensorrequired
Source code in ultralytics/models/sam/sam3/vl_combiner.py

View on GitHub

def forward_image(self, samples: torch.Tensor):
    """Forward pass of the vision backbone and get both SAM3 and SAM2 features."""
    # Forward through backbone
    sam3_features, sam3_pos, sam2_features, sam2_pos = self.vision_backbone.forward(samples)
    if self.scalp > 0:
        # Discard the lowest resolution features
        sam3_features, sam3_pos = (
            sam3_features[: -self.scalp],
            sam3_pos[: -self.scalp],
        )
        if sam2_features is not None and sam2_pos is not None:
            sam2_features, sam2_pos = (
                sam2_features[: -self.scalp],
                sam2_pos[: -self.scalp],
            )

    sam2_output = None

    if sam2_features is not None and sam2_pos is not None:
        sam2_src = sam2_features[-1]
        sam2_output = {
            "vision_features": sam2_src,
            "vision_pos_enc": sam2_pos,
            "backbone_fpn": sam2_features,
        }

    sam3_src = sam3_features[-1]
    return {
        "vision_features": sam3_src,
        "vision_pos_enc": sam3_pos,
        "backbone_fpn": sam3_features,
        "sam2_backbone_out": sam2_output,
    }

Link to this sectionMethod ultralytics.models.sam.sam3.vl_combiner.SAM3VLBackbone.forward_image_sam2#

def forward_image_sam2(self, samples: torch.Tensor)

Forward pass of the vision backbone to get SAM2 features only.

Args

NameTypeDescriptionDefault
samplestorch.Tensorrequired
Source code in ultralytics/models/sam/sam3/vl_combiner.py

View on GitHub

def forward_image_sam2(self, samples: torch.Tensor):
    """Forward pass of the vision backbone to get SAM2 features only."""
    xs = self.vision_backbone.trunk(samples)
    x = xs[-1]  # simpleFPN

    assert self.vision_backbone.sam2_convs is not None, "SAM2 neck is not available."
    sam2_features, sam2_pos = self.vision_backbone.sam_forward_feature_levels(x, self.vision_backbone.sam2_convs)

    if self.scalp > 0:
        # Discard the lowest resolution features
        sam2_features, sam2_pos = (
            sam2_features[: -self.scalp],
            sam2_pos[: -self.scalp],
        )

    return {
        "vision_features": sam2_features[-1],
        "vision_pos_enc": sam2_pos,
        "backbone_fpn": sam2_features,
    }

Link to this sectionMethod ultralytics.models.sam.sam3.vl_combiner.SAM3VLBackbone.forward_text#

def forward_text(self, captions, input_boxes = None, additional_text = None)

Forward pass of the text encoder.

Args

NameTypeDescriptionDefault
captionsrequired
input_boxesNone
additional_textNone
Source code in ultralytics/models/sam/sam3/vl_combiner.py

View on GitHub

def forward_text(self, captions, input_boxes=None, additional_text=None):
    """Forward pass of the text encoder."""
    output = {}

    # Forward through text_encoder
    text_to_encode = copy(captions)
    if additional_text is not None:
        # if there are additional_text, we piggy-back them into this forward.
        # They'll be used later for output alignment
        text_to_encode += additional_text

    with sdpa_kernel([SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION, SDPBackend.FLASH_ATTENTION]):
        text_attention_mask, text_memory, text_embeds = self.language_backbone(text_to_encode, input_boxes)

    if additional_text is not None:
        output["additional_text_features"] = text_memory[:, -len(additional_text) :]
        output["additional_text_mask"] = text_attention_mask[-len(additional_text) :]

    text_memory = text_memory[:, : len(captions)]
    text_attention_mask = text_attention_mask[: len(captions)]
    text_embeds = text_embeds[:, : len(captions)]
    output["language_features"] = text_memory
    output["language_mask"] = text_attention_mask
    output["language_embeds"] = text_embeds  # Text embeddings before forward to the encoder

    return output

Link to this sectionMethod ultralytics.models.sam.sam3.vl_combiner.SAM3VLBackbone.set_imgsz#

def set_imgsz(self, imgsz: list[int] = [1008, 1008])

Set the image size for the vision backbone.

Args

NameTypeDescriptionDefault
imgszlist[int][1008, 1008]
Source code in ultralytics/models/sam/sam3/vl_combiner.py

View on GitHub

def set_imgsz(self, imgsz: list[int] = [1008, 1008]):
    """Set the image size for the vision backbone."""
    self.vision_backbone.set_imgsz(imgsz)