Reference for ultralytics/models/sam/modules/sam.py
Note
This file is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/models/sam/modules/sam.py. If you spot a problem please help fix it by contributing a Pull Request 🛠️. Thank you 🙏!
ultralytics.models.sam.modules.sam.SAMModel
SAMModel(
image_encoder: ImageEncoderViT,
prompt_encoder: PromptEncoder,
mask_decoder: MaskDecoder,
pixel_mean: List[float] = (123.675, 116.28, 103.53),
pixel_std: List[float] = (58.395, 57.12, 57.375),
)
Bases: Module
Segment Anything Model (SAM) for object segmentation tasks.
This class combines image encoders, prompt encoders, and mask decoders to predict object masks from images and input prompts.
Attributes:
Name | Type | Description |
---|---|---|
mask_threshold |
float
|
Threshold value for mask prediction. |
image_encoder |
ImageEncoderViT
|
Backbone for encoding images into embeddings. |
prompt_encoder |
PromptEncoder
|
Encoder for various types of input prompts. |
mask_decoder |
MaskDecoder
|
Predicts object masks from image and prompt embeddings. |
Methods:
Name | Description |
---|
Examples:
>>> image_encoder = ImageEncoderViT(...)
>>> prompt_encoder = PromptEncoder(...)
>>> mask_decoder = MaskDecoder(...)
>>> sam_model = SAMModel(image_encoder, prompt_encoder, mask_decoder)
>>> # Further usage depends on SAMPredictor class
Notes
All forward() operations are implemented in the SAMPredictor class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_encoder
|
ImageEncoderViT
|
The backbone used to encode the image into image embeddings. |
required |
prompt_encoder
|
PromptEncoder
|
Encodes various types of input prompts. |
required |
mask_decoder
|
MaskDecoder
|
Predicts masks from the image embeddings and encoded prompts. |
required |
pixel_mean
|
List[float]
|
Mean values for normalizing pixels in the input image. |
(123.675, 116.28, 103.53)
|
pixel_std
|
List[float]
|
Std values for normalizing pixels in the input image. |
(58.395, 57.12, 57.375)
|
Examples:
>>> image_encoder = ImageEncoderViT(...)
>>> prompt_encoder = PromptEncoder(...)
>>> mask_decoder = MaskDecoder(...)
>>> sam_model = SAMModel(image_encoder, prompt_encoder, mask_decoder)
>>> # Further usage depends on SAMPredictor class
Notes
All forward() operations moved to SAMPredictor.
Source code in ultralytics/models/sam/modules/sam.py
set_imgsz
Set image size to make model compatible with different image sizes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
imgsz
|
Tuple[int, int]
|
The size of the input image. |
required |
Source code in ultralytics/models/sam/modules/sam.py
ultralytics.models.sam.modules.sam.SAM2Model
SAM2Model(
image_encoder,
memory_attention,
memory_encoder,
num_maskmem=7,
image_size=512,
backbone_stride=16,
sigmoid_scale_for_mem_enc=1.0,
sigmoid_bias_for_mem_enc=0.0,
binarize_mask_from_pts_for_mem_enc=False,
use_mask_input_as_output_without_sam=False,
max_cond_frames_in_attn=-1,
directly_add_no_mem_embed=False,
use_high_res_features_in_sam=False,
multimask_output_in_sam=False,
multimask_min_pt_num=1,
multimask_max_pt_num=1,
multimask_output_for_tracking=False,
use_multimask_token_for_obj_ptr: bool = False,
iou_prediction_use_sigmoid=False,
memory_temporal_stride_for_eval=1,
non_overlap_masks_for_mem_enc=False,
use_obj_ptrs_in_encoder=False,
max_obj_ptrs_in_encoder=16,
add_tpos_enc_to_obj_ptrs=True,
proj_tpos_enc_in_obj_ptrs=False,
use_signed_tpos_enc_to_obj_ptrs=False,
only_obj_ptrs_in_the_past_for_eval=False,
pred_obj_scores: bool = False,
pred_obj_scores_mlp: bool = False,
fixed_no_obj_ptr: bool = False,
soft_no_obj_ptr: bool = False,
use_mlp_for_obj_ptr_proj: bool = False,
no_obj_embed_spatial: bool = False,
sam_mask_decoder_extra_args=None,
compile_image_encoder: bool = False,
)
Bases: Module
SAM2Model class for Segment Anything Model 2 with memory-based video object segmentation capabilities.
This class extends the functionality of SAM to handle video sequences, incorporating memory mechanisms for temporal consistency and efficient tracking of objects across frames.
Attributes:
Name | Type | Description |
---|---|---|
mask_threshold |
float
|
Threshold value for mask prediction. |
image_encoder |
ImageEncoderViT
|
Visual encoder for extracting image features. |
memory_attention |
Module
|
Module for attending to memory features. |
memory_encoder |
Module
|
Encoder for generating memory representations. |
num_maskmem |
int
|
Number of accessible memory frames. |
image_size |
int
|
Size of input images. |
backbone_stride |
int
|
Stride of the backbone network output. |
sam_prompt_embed_dim |
int
|
Dimension of SAM prompt embeddings. |
sam_image_embedding_size |
int
|
Size of SAM image embeddings. |
sam_prompt_encoder |
PromptEncoder
|
Encoder for processing input prompts. |
sam_mask_decoder |
SAM2MaskDecoder
|
Decoder for generating object masks. |
obj_ptr_proj |
Module
|
Projection layer for object pointers. |
obj_ptr_tpos_proj |
Module
|
Projection for temporal positional encoding in object pointers. |
Methods:
Name | Description |
---|---|
forward_image |
Processes image batch through encoder to extract multi-level features. |
track_step |
Performs a single tracking step, updating object masks and memory features. |
Examples:
>>> model = SAM2Model(image_encoder, memory_attention, memory_encoder)
>>> image_batch = torch.rand(1, 3, 512, 512)
>>> features = model.forward_image(image_batch)
>>> track_results = model.track_step(0, True, features, None, None, None, {})
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_encoder
|
Module
|
Visual encoder for extracting image features. |
required |
memory_attention
|
Module
|
Module for attending to memory features. |
required |
memory_encoder
|
Module
|
Encoder for generating memory representations. |
required |
num_maskmem
|
int
|
Number of accessible memory frames. Default is 7 (1 input frame + 6 previous frames). |
7
|
image_size
|
int
|
Size of input images. |
512
|
backbone_stride
|
int
|
Stride of the image backbone output. |
16
|
sigmoid_scale_for_mem_enc
|
float
|
Scale factor for mask sigmoid probability. |
1.0
|
sigmoid_bias_for_mem_enc
|
float
|
Bias factor for mask sigmoid probability. |
0.0
|
binarize_mask_from_pts_for_mem_enc
|
bool
|
Whether to binarize sigmoid mask logits on interacted frames with clicks during evaluation. |
False
|
use_mask_input_as_output_without_sam
|
bool
|
Whether to directly output the input mask without using SAM prompt encoder and mask decoder on frames with mask input. |
False
|
max_cond_frames_in_attn
|
int
|
Maximum number of conditioning frames to participate in memory attention. -1 means no limit. |
-1
|
directly_add_no_mem_embed
|
bool
|
Whether to directly add no-memory embedding to image feature on the first frame. |
False
|
use_high_res_features_in_sam
|
bool
|
Whether to use high-resolution feature maps in the SAM mask decoder. |
False
|
multimask_output_in_sam
|
bool
|
Whether to output multiple (3) masks for the first click on initial conditioning frames. |
False
|
multimask_min_pt_num
|
int
|
Minimum number of clicks to use multimask output in SAM. |
1
|
multimask_max_pt_num
|
int
|
Maximum number of clicks to use multimask output in SAM. |
1
|
multimask_output_for_tracking
|
bool
|
Whether to use multimask output for tracking. |
False
|
use_multimask_token_for_obj_ptr
|
bool
|
Whether to use multimask tokens for object pointers. |
False
|
iou_prediction_use_sigmoid
|
bool
|
Whether to use sigmoid to restrict IoU prediction to [0-1]. |
False
|
memory_temporal_stride_for_eval
|
int
|
Memory bank's temporal stride during evaluation. |
1
|
non_overlap_masks_for_mem_enc
|
bool
|
Whether to apply non-overlapping constraints on object masks in memory encoder during evaluation. |
False
|
use_obj_ptrs_in_encoder
|
bool
|
Whether to cross-attend to object pointers from other frames in the encoder. |
False
|
max_obj_ptrs_in_encoder
|
int
|
Maximum number of object pointers from other frames in encoder cross-attention. |
16
|
add_tpos_enc_to_obj_ptrs
|
bool
|
Whether to add temporal positional encoding to object pointers in the encoder. |
True
|
proj_tpos_enc_in_obj_ptrs
|
bool
|
Whether to add an extra linear projection layer for temporal positional encoding in object pointers. |
False
|
use_signed_tpos_enc_to_obj_ptrs
|
bool
|
whether to use signed distance (instead of unsigned absolute distance)
in the temporal positional encoding in the object pointers, only relevant when both |
False
|
only_obj_ptrs_in_the_past_for_eval
|
bool
|
Whether to only attend to object pointers in the past during evaluation. |
False
|
pred_obj_scores
|
bool
|
Whether to predict if there is an object in the frame. |
False
|
pred_obj_scores_mlp
|
bool
|
Whether to use an MLP to predict object scores. |
False
|
fixed_no_obj_ptr
|
bool
|
Whether to have a fixed no-object pointer when there is no object present. |
False
|
soft_no_obj_ptr
|
bool
|
Whether to mix in no-object pointer softly for easier recovery and error mitigation. |
False
|
use_mlp_for_obj_ptr_proj
|
bool
|
Whether to use MLP for object pointer projection. |
False
|
no_obj_embed_spatial
|
bool
|
Whether add no obj embedding to spatial frames. |
False
|
sam_mask_decoder_extra_args
|
Dict | None
|
Extra arguments for constructing the SAM mask decoder. |
None
|
compile_image_encoder
|
bool
|
Whether to compile the image encoder for faster inference. |
False
|
Examples:
>>> image_encoder = ImageEncoderViT(...)
>>> memory_attention = SAM2TwoWayTransformer(...)
>>> memory_encoder = nn.Sequential(...)
>>> model = SAM2Model(image_encoder, memory_attention, memory_encoder)
>>> image_batch = torch.rand(1, 3, 512, 512)
>>> features = model.forward_image(image_batch)
>>> track_results = model.track_step(0, True, features, None, None, None, {})
Source code in ultralytics/models/sam/modules/sam.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 |
|
forward
Processes image and prompt inputs to generate object masks and scores in video sequences.
Source code in ultralytics/models/sam/modules/sam.py
forward_image
Processes image batch through encoder to extract multi-level features for SAM model.
Source code in ultralytics/models/sam/modules/sam.py
set_binarize
set_imgsz
Set image size to make model compatible with different image sizes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
imgsz
|
Tuple[int, int]
|
The size of the input image. |
required |
Source code in ultralytics/models/sam/modules/sam.py
track_step
track_step(
frame_idx,
is_init_cond_frame,
current_vision_feats,
current_vision_pos_embeds,
feat_sizes,
point_inputs,
mask_inputs,
output_dict,
num_frames,
track_in_reverse=False,
run_mem_encoder=True,
prev_sam_mask_logits=None,
)
Performs a single tracking step, updating object masks and memory features based on current frame inputs.