Reference for ultralytics/models/sam/modules/decoders.py
Note
This file is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/models/sam/modules/decoders.py. If you spot a problem please help fix it by contributing a Pull Request 🛠️. Thank you 🙏!
ultralytics.models.sam.modules.decoders.MaskDecoder
MaskDecoder(
transformer_dim: int,
transformer: nn.Module,
num_multimask_outputs: int = 3,
activation: Type[nn.Module] = nn.GELU,
iou_head_depth: int = 3,
iou_head_hidden_dim: int = 256,
)
Bases: Module
Decoder module for generating masks and their associated quality scores using a transformer architecture.
This class predicts masks given image and prompt embeddings, utilizing a transformer to process the inputs and generate mask predictions along with their quality scores.
Attributes:
Name | Type | Description |
---|---|---|
transformer_dim |
int
|
Channel dimension for the transformer module. |
transformer |
Module
|
Transformer module used for mask prediction. |
num_multimask_outputs |
int
|
Number of masks to predict for disambiguating masks. |
iou_token |
Embedding
|
Embedding for the IoU token. |
num_mask_tokens |
int
|
Number of mask tokens. |
mask_tokens |
Embedding
|
Embedding for the mask tokens. |
output_upscaling |
Sequential
|
Neural network sequence for upscaling the output. |
output_hypernetworks_mlps |
ModuleList
|
Hypernetwork MLPs for generating masks. |
iou_prediction_head |
Module
|
MLP for predicting mask quality. |
Methods:
Name | Description |
---|---|
forward |
Predicts masks given image and prompt embeddings. |
predict_masks |
Internal method for mask prediction. |
Examples:
>>> decoder = MaskDecoder(transformer_dim=256, transformer=transformer_module)
>>> masks, iou_pred = decoder(
... image_embeddings, image_pe, sparse_prompt_embeddings, dense_prompt_embeddings, multimask_output=True
... )
>>> print(f"Predicted masks shape: {masks.shape}, IoU predictions shape: {iou_pred.shape}")
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transformer_dim |
int
|
Channel dimension for the transformer module. |
required |
transformer |
Module
|
Transformer module used for mask prediction. |
required |
num_multimask_outputs |
int
|
Number of masks to predict for disambiguating masks. |
3
|
activation |
Type[Module]
|
Type of activation to use when upscaling masks. |
GELU
|
iou_head_depth |
int
|
Depth of the MLP used to predict mask quality. |
3
|
iou_head_hidden_dim |
int
|
Hidden dimension of the MLP used to predict mask quality. |
256
|
Examples:
>>> transformer = nn.TransformerEncoder(nn.TransformerEncoderLayer(d_model=256, nhead=8), num_layers=6)
>>> decoder = MaskDecoder(transformer_dim=256, transformer=transformer)
>>> print(decoder)
Source code in ultralytics/models/sam/modules/decoders.py
forward
forward(
image_embeddings: torch.Tensor,
image_pe: torch.Tensor,
sparse_prompt_embeddings: torch.Tensor,
dense_prompt_embeddings: torch.Tensor,
multimask_output: bool,
) -> Tuple[torch.Tensor, torch.Tensor]
Predicts masks given image and prompt embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_embeddings |
Tensor
|
Embeddings from the image encoder. |
required |
image_pe |
Tensor
|
Positional encoding with the shape of image_embeddings. |
required |
sparse_prompt_embeddings |
Tensor
|
Embeddings of the points and boxes. |
required |
dense_prompt_embeddings |
Tensor
|
Embeddings of the mask inputs. |
required |
multimask_output |
bool
|
Whether to return multiple masks or a single mask. |
required |
Returns:
Type | Description |
---|---|
Tuple[Tensor, Tensor]
|
A tuple containing: - masks (torch.Tensor): Batched predicted masks. - iou_pred (torch.Tensor): Batched predictions of mask quality. |
Examples:
>>> decoder = MaskDecoder(transformer_dim=256, transformer=transformer_module)
>>> image_emb = torch.rand(1, 256, 64, 64)
>>> image_pe = torch.rand(1, 256, 64, 64)
>>> sparse_emb = torch.rand(1, 2, 256)
>>> dense_emb = torch.rand(1, 256, 64, 64)
>>> masks, iou_pred = decoder(image_emb, image_pe, sparse_emb, dense_emb, multimask_output=True)
>>> print(f"Masks shape: {masks.shape}, IoU predictions shape: {iou_pred.shape}")
Source code in ultralytics/models/sam/modules/decoders.py
predict_masks
predict_masks(
image_embeddings: torch.Tensor,
image_pe: torch.Tensor,
sparse_prompt_embeddings: torch.Tensor,
dense_prompt_embeddings: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]
Predicts masks and quality scores using image and prompt embeddings via transformer architecture.
Source code in ultralytics/models/sam/modules/decoders.py
ultralytics.models.sam.modules.decoders.SAM2MaskDecoder
SAM2MaskDecoder(
transformer_dim: int,
transformer: nn.Module,
num_multimask_outputs: int = 3,
activation: Type[nn.Module] = nn.GELU,
iou_head_depth: int = 3,
iou_head_hidden_dim: int = 256,
use_high_res_features: bool = False,
iou_prediction_use_sigmoid=False,
dynamic_multimask_via_stability=False,
dynamic_multimask_stability_delta=0.05,
dynamic_multimask_stability_thresh=0.98,
pred_obj_scores: bool = False,
pred_obj_scores_mlp: bool = False,
use_multimask_token_for_obj_ptr: bool = False,
)
Bases: Module
Transformer-based decoder for predicting instance segmentation masks from image and prompt embeddings.
This class extends the functionality of the MaskDecoder, incorporating additional features such as high-resolution feature processing, dynamic multimask output, and object score prediction.
Attributes:
Name | Type | Description |
---|---|---|
transformer_dim |
int
|
Channel dimension of the transformer. |
transformer |
Module
|
Transformer used to predict masks. |
num_multimask_outputs |
int
|
Number of masks to predict when disambiguating masks. |
iou_token |
Embedding
|
Embedding for IOU token. |
num_mask_tokens |
int
|
Total number of mask tokens. |
mask_tokens |
Embedding
|
Embedding for mask tokens. |
pred_obj_scores |
bool
|
Whether to predict object scores. |
obj_score_token |
Embedding
|
Embedding for object score token. |
use_multimask_token_for_obj_ptr |
bool
|
Whether to use multimask token for object pointer. |
output_upscaling |
Sequential
|
Upscaling layers for output. |
use_high_res_features |
bool
|
Whether to use high-resolution features. |
conv_s0 |
Conv2d
|
Convolutional layer for high-resolution features (s0). |
conv_s1 |
Conv2d
|
Convolutional layer for high-resolution features (s1). |
output_hypernetworks_mlps |
ModuleList
|
List of MLPs for output hypernetworks. |
iou_prediction_head |
MLP
|
MLP for IOU prediction. |
pred_obj_score_head |
Linear | MLP
|
Linear layer or MLP for object score prediction. |
dynamic_multimask_via_stability |
bool
|
Whether to use dynamic multimask via stability. |
dynamic_multimask_stability_delta |
float
|
Delta value for dynamic multimask stability. |
dynamic_multimask_stability_thresh |
float
|
Threshold for dynamic multimask stability. |
Methods:
Name | Description |
---|---|
forward |
Predicts masks given image and prompt embeddings. |
predict_masks |
Predicts instance segmentation masks from image and prompt embeddings. |
_get_stability_scores |
Computes mask stability scores based on IoU between thresholds. |
_dynamic_multimask_via_stability |
Dynamically selects the most stable mask output. |
Examples:
>>> image_embeddings = torch.rand(1, 256, 64, 64)
>>> image_pe = torch.rand(1, 256, 64, 64)
>>> sparse_prompt_embeddings = torch.rand(1, 2, 256)
>>> dense_prompt_embeddings = torch.rand(1, 256, 64, 64)
>>> decoder = SAM2MaskDecoder(256, transformer)
>>> masks, iou_pred, sam_tokens_out, obj_score_logits = decoder.forward(
... image_embeddings, image_pe, sparse_prompt_embeddings, dense_prompt_embeddings, True, False
... )
This decoder extends the functionality of MaskDecoder, incorporating additional features such as high-resolution feature processing, dynamic multimask output, and object score prediction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transformer_dim |
int
|
Channel dimension of the transformer. |
required |
transformer |
Module
|
Transformer used to predict masks. |
required |
num_multimask_outputs |
int
|
Number of masks to predict when disambiguating masks. |
3
|
activation |
Type[Module]
|
Type of activation to use when upscaling masks. |
GELU
|
iou_head_depth |
int
|
Depth of the MLP used to predict mask quality. |
3
|
iou_head_hidden_dim |
int
|
Hidden dimension of the MLP used to predict mask quality. |
256
|
use_high_res_features |
bool
|
Whether to use high-resolution features. |
False
|
iou_prediction_use_sigmoid |
bool
|
Whether to use sigmoid for IOU prediction. |
False
|
dynamic_multimask_via_stability |
bool
|
Whether to use dynamic multimask via stability. |
False
|
dynamic_multimask_stability_delta |
float
|
Delta value for dynamic multimask stability. |
0.05
|
dynamic_multimask_stability_thresh |
float
|
Threshold for dynamic multimask stability. |
0.98
|
pred_obj_scores |
bool
|
Whether to predict object scores. |
False
|
pred_obj_scores_mlp |
bool
|
Whether to use MLP for object score prediction. |
False
|
use_multimask_token_for_obj_ptr |
bool
|
Whether to use multimask token for object pointer. |
False
|
Examples:
>>> transformer = nn.TransformerEncoder(nn.TransformerEncoderLayer(d_model=256, nhead=8), num_layers=6)
>>> decoder = SAM2MaskDecoder(transformer_dim=256, transformer=transformer)
>>> print(decoder)
Source code in ultralytics/models/sam/modules/decoders.py
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 |
|
forward
forward(
image_embeddings: torch.Tensor,
image_pe: torch.Tensor,
sparse_prompt_embeddings: torch.Tensor,
dense_prompt_embeddings: torch.Tensor,
multimask_output: bool,
repeat_image: bool,
high_res_features: Optional[List[torch.Tensor]] = None,
) -> Tuple[torch.Tensor, torch.Tensor]
Predicts masks given image and prompt embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_embeddings |
Tensor
|
Embeddings from the image encoder with shape (B, C, H, W). |
required |
image_pe |
Tensor
|
Positional encoding with the shape of image_embeddings (B, C, H, W). |
required |
sparse_prompt_embeddings |
Tensor
|
Embeddings of the points and boxes with shape (B, N, C). |
required |
dense_prompt_embeddings |
Tensor
|
Embeddings of the mask inputs with shape (B, C, H, W). |
required |
multimask_output |
bool
|
Whether to return multiple masks or a single mask. |
required |
repeat_image |
bool
|
Flag to repeat the image embeddings. |
required |
high_res_features |
List[Tensor] | None
|
Optional high-resolution features. |
None
|
Returns:
Type | Description |
---|---|
Tuple[Tensor, Tensor, Tensor, Tensor]
|
A tuple containing: - masks (torch.Tensor): Batched predicted masks with shape (B, N, H, W). - iou_pred (torch.Tensor): Batched predictions of mask quality with shape (B, N). - sam_tokens_out (torch.Tensor): Batched SAM token for mask output with shape (B, N, C). - object_score_logits (torch.Tensor): Batched object score logits with shape (B, 1). |
Examples:
>>> image_embeddings = torch.rand(1, 256, 64, 64)
>>> image_pe = torch.rand(1, 256, 64, 64)
>>> sparse_prompt_embeddings = torch.rand(1, 2, 256)
>>> dense_prompt_embeddings = torch.rand(1, 256, 64, 64)
>>> decoder = SAM2MaskDecoder(256, transformer)
>>> masks, iou_pred, sam_tokens_out, obj_score_logits = decoder.forward(
... image_embeddings, image_pe, sparse_prompt_embeddings, dense_prompt_embeddings, True, False
... )
Source code in ultralytics/models/sam/modules/decoders.py
predict_masks
predict_masks(
image_embeddings: torch.Tensor,
image_pe: torch.Tensor,
sparse_prompt_embeddings: torch.Tensor,
dense_prompt_embeddings: torch.Tensor,
repeat_image: bool,
high_res_features: Optional[List[torch.Tensor]] = None,
) -> Tuple[torch.Tensor, torch.Tensor]
Predicts instance segmentation masks from image and prompt embeddings using a transformer.