Reference for ultralytics/models/sam/modules/sam.py
Note
This file is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/models/sam/modules/sam.py. If you spot a problem please help fix it by contributing a Pull Request 🛠️. Thank you 🙏!
ultralytics.models.sam.modules.sam.SAMModel
SAMModel(
image_encoder: ImageEncoderViT,
prompt_encoder: PromptEncoder,
mask_decoder: MaskDecoder,
pixel_mean: List[float] = (123.675, 116.28, 103.53),
pixel_std: List[float] = (58.395, 57.12, 57.375),
)
Bases: Module
Segment Anything Model (SAM) for object segmentation tasks.
This class combines image encoders, prompt encoders, and mask decoders to predict object masks from images and input prompts.
Attributes:
Name | Type | Description |
---|---|---|
mask_threshold |
float
|
Threshold value for mask prediction. |
image_encoder |
ImageEncoderViT
|
Backbone for encoding images into embeddings. |
prompt_encoder |
PromptEncoder
|
Encoder for various types of input prompts. |
mask_decoder |
MaskDecoder
|
Predicts object masks from image and prompt embeddings. |
pixel_mean |
Tensor
|
Mean pixel values for image normalization, shape (3, 1, 1). |
pixel_std |
Tensor
|
Standard deviation values for image normalization, shape (3, 1, 1). |
Methods:
Name | Description |
---|
Examples:
>>> image_encoder = ImageEncoderViT(...)
>>> prompt_encoder = PromptEncoder(...)
>>> mask_decoder = MaskDecoder(...)
>>> sam_model = SAMModel(image_encoder, prompt_encoder, mask_decoder)
>>> # Further usage depends on SAMPredictor class
Notes
All forward() operations are implemented in the SAMPredictor class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_encoder |
ImageEncoderViT
|
The backbone used to encode the image into image embeddings. |
required |
prompt_encoder |
PromptEncoder
|
Encodes various types of input prompts. |
required |
mask_decoder |
MaskDecoder
|
Predicts masks from the image embeddings and encoded prompts. |
required |
pixel_mean |
List[float]
|
Mean values for normalizing pixels in the input image. |
(123.675, 116.28, 103.53)
|
pixel_std |
List[float]
|
Std values for normalizing pixels in the input image. |
(58.395, 57.12, 57.375)
|
Examples:
>>> image_encoder = ImageEncoderViT(...)
>>> prompt_encoder = PromptEncoder(...)
>>> mask_decoder = MaskDecoder(...)
>>> sam_model = SAMModel(image_encoder, prompt_encoder, mask_decoder)
>>> # Further usage depends on SAMPredictor class
Notes
All forward() operations moved to SAMPredictor.
Source code in ultralytics/models/sam/modules/sam.py
set_imgsz
Set image size to make model compatible with different image sizes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
imgsz |
Tuple[int, int]
|
The size of the input image. |
required |
Source code in ultralytics/models/sam/modules/sam.py
ultralytics.models.sam.modules.sam.SAM2Model
SAM2Model(
image_encoder,
memory_attention,
memory_encoder,
num_maskmem=7,
image_size=512,
backbone_stride=16,
sigmoid_scale_for_mem_enc=1.0,
sigmoid_bias_for_mem_enc=0.0,
binarize_mask_from_pts_for_mem_enc=False,
use_mask_input_as_output_without_sam=False,
max_cond_frames_in_attn=-1,
directly_add_no_mem_embed=False,
use_high_res_features_in_sam=False,
multimask_output_in_sam=False,
multimask_min_pt_num=1,
multimask_max_pt_num=1,
multimask_output_for_tracking=False,
use_multimask_token_for_obj_ptr: bool = False,
iou_prediction_use_sigmoid=False,
memory_temporal_stride_for_eval=1,
add_all_frames_to_correct_as_cond=False,
non_overlap_masks_for_mem_enc=False,
use_obj_ptrs_in_encoder=False,
max_obj_ptrs_in_encoder=16,
add_tpos_enc_to_obj_ptrs=True,
proj_tpos_enc_in_obj_ptrs=False,
only_obj_ptrs_in_the_past_for_eval=False,
pred_obj_scores: bool = False,
pred_obj_scores_mlp: bool = False,
fixed_no_obj_ptr: bool = False,
soft_no_obj_ptr: bool = False,
use_mlp_for_obj_ptr_proj: bool = False,
sam_mask_decoder_extra_args=None,
compile_image_encoder: bool = False,
)
Bases: Module
SAM2Model class for Segment Anything Model 2 with memory-based video object segmentation capabilities.
This class extends the functionality of SAM to handle video sequences, incorporating memory mechanisms for temporal consistency and efficient tracking of objects across frames.
Attributes:
Name | Type | Description |
---|---|---|
mask_threshold |
float
|
Threshold value for mask prediction. |
image_encoder |
ImageEncoderViT
|
Visual encoder for extracting image features. |
memory_attention |
Module
|
Module for attending to memory features. |
memory_encoder |
Module
|
Encoder for generating memory representations. |
num_maskmem |
int
|
Number of accessible memory frames. |
image_size |
int
|
Size of input images. |
backbone_stride |
int
|
Stride of the backbone network output. |
sam_prompt_embed_dim |
int
|
Dimension of SAM prompt embeddings. |
sam_image_embedding_size |
int
|
Size of SAM image embeddings. |
sam_prompt_encoder |
PromptEncoder
|
Encoder for processing input prompts. |
sam_mask_decoder |
SAM2MaskDecoder
|
Decoder for generating object masks. |
obj_ptr_proj |
Module
|
Projection layer for object pointers. |
obj_ptr_tpos_proj |
Module
|
Projection for temporal positional encoding in object pointers. |
Methods:
Name | Description |
---|---|
forward_image |
Processes image batch through encoder to extract multi-level features. |
track_step |
Performs a single tracking step, updating object masks and memory features. |
Examples:
>>> model = SAM2Model(image_encoder, memory_attention, memory_encoder)
>>> image_batch = torch.rand(1, 3, 512, 512)
>>> features = model.forward_image(image_batch)
>>> track_results = model.track_step(0, True, features, None, None, None, {})
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_encoder |
Module
|
Visual encoder for extracting image features. |
required |
memory_attention |
Module
|
Module for attending to memory features. |
required |
memory_encoder |
Module
|
Encoder for generating memory representations. |
required |
num_maskmem |
int
|
Number of accessible memory frames. Default is 7 (1 input frame + 6 previous frames). |
7
|
image_size |
int
|
Size of input images. |
512
|
backbone_stride |
int
|
Stride of the image backbone output. |
16
|
sigmoid_scale_for_mem_enc |
float
|
Scale factor for mask sigmoid probability. |
1.0
|
sigmoid_bias_for_mem_enc |
float
|
Bias factor for mask sigmoid probability. |
0.0
|
binarize_mask_from_pts_for_mem_enc |
bool
|
Whether to binarize sigmoid mask logits on interacted frames with clicks during evaluation. |
False
|
use_mask_input_as_output_without_sam |
bool
|
Whether to directly output the input mask without using SAM prompt encoder and mask decoder on frames with mask input. |
False
|
max_cond_frames_in_attn |
int
|
Maximum number of conditioning frames to participate in memory attention. -1 means no limit. |
-1
|
directly_add_no_mem_embed |
bool
|
Whether to directly add no-memory embedding to image feature on the first frame. |
False
|
use_high_res_features_in_sam |
bool
|
Whether to use high-resolution feature maps in the SAM mask decoder. |
False
|
multimask_output_in_sam |
bool
|
Whether to output multiple (3) masks for the first click on initial conditioning frames. |
False
|
multimask_min_pt_num |
int
|
Minimum number of clicks to use multimask output in SAM. |
1
|
multimask_max_pt_num |
int
|
Maximum number of clicks to use multimask output in SAM. |
1
|
multimask_output_for_tracking |
bool
|
Whether to use multimask output for tracking. |
False
|
use_multimask_token_for_obj_ptr |
bool
|
Whether to use multimask tokens for object pointers. |
False
|
iou_prediction_use_sigmoid |
bool
|
Whether to use sigmoid to restrict IoU prediction to [0-1]. |
False
|
memory_temporal_stride_for_eval |
int
|
Memory bank's temporal stride during evaluation. |
1
|
add_all_frames_to_correct_as_cond |
bool
|
Whether to append frames with correction clicks to conditioning frame list. |
False
|
non_overlap_masks_for_mem_enc |
bool
|
Whether to apply non-overlapping constraints on object masks in memory encoder during evaluation. |
False
|
use_obj_ptrs_in_encoder |
bool
|
Whether to cross-attend to object pointers from other frames in the encoder. |
False
|
max_obj_ptrs_in_encoder |
int
|
Maximum number of object pointers from other frames in encoder cross-attention. |
16
|
add_tpos_enc_to_obj_ptrs |
bool
|
Whether to add temporal positional encoding to object pointers in the encoder. |
True
|
proj_tpos_enc_in_obj_ptrs |
bool
|
Whether to add an extra linear projection layer for temporal positional encoding in object pointers. |
False
|
only_obj_ptrs_in_the_past_for_eval |
bool
|
Whether to only attend to object pointers in the past during evaluation. |
False
|
pred_obj_scores |
bool
|
Whether to predict if there is an object in the frame. |
False
|
pred_obj_scores_mlp |
bool
|
Whether to use an MLP to predict object scores. |
False
|
fixed_no_obj_ptr |
bool
|
Whether to have a fixed no-object pointer when there is no object present. |
False
|
soft_no_obj_ptr |
bool
|
Whether to mix in no-object pointer softly for easier recovery and error mitigation. |
False
|
use_mlp_for_obj_ptr_proj |
bool
|
Whether to use MLP for object pointer projection. |
False
|
sam_mask_decoder_extra_args |
Dict | None
|
Extra arguments for constructing the SAM mask decoder. |
None
|
compile_image_encoder |
bool
|
Whether to compile the image encoder for faster inference. |
False
|
Examples:
>>> image_encoder = ImageEncoderViT(...)
>>> memory_attention = SAM2TwoWayTransformer(...)
>>> memory_encoder = nn.Sequential(...)
>>> model = SAM2Model(image_encoder, memory_attention, memory_encoder)
>>> image_batch = torch.rand(1, 3, 512, 512)
>>> features = model.forward_image(image_batch)
>>> track_results = model.track_step(0, True, features, None, None, None, {})
Source code in ultralytics/models/sam/modules/sam.py
|
|
forward
Processes image and prompt inputs to generate object masks and scores in video sequences.
Source code in ultralytics/models/sam/modules/sam.py
forward_image
Processes image batch through encoder to extract multi-level features for SAM model.
Source code in ultralytics/models/sam/modules/sam.py
set_imgsz
Set image size to make model compatible with different image sizes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
imgsz |
Tuple[int, int]
|
The size of the input image. |
required |
Source code in ultralytics/models/sam/modules/sam.py
track_step
track_step(
frame_idx,
is_init_cond_frame,
current_vision_feats,
current_vision_pos_embeds,
feat_sizes,
point_inputs,
mask_inputs,
output_dict,
num_frames,
track_in_reverse=False,
run_mem_encoder=True,
prev_sam_mask_logits=None,
)
Performs a single tracking step, updating object masks and memory features based on current frame inputs.
Source code in ultralytics/models/sam/modules/sam.py
827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 |
|