Reference for ultralytics/models/sam/modules/encoders.py
Note
This file is available at https://github.com/ultralytics/ultralytics/blob/main/ultralytics/models/sam/modules/encoders.py. If you spot a problem please help fix it by contributing a Pull Request 🛠️. Thank you 🙏!
ultralytics.models.sam.modules.encoders.ImageEncoderViT
ImageEncoderViT(img_size: int = 1024, patch_size: int = 16, in_chans: int = 3, embed_dim: int = 768, depth: int = 12, num_heads: int = 12, mlp_ratio: float = 4.0, out_chans: int = 256, qkv_bias: bool = True, norm_layer: Type[nn.Module] = nn.LayerNorm, act_layer: Type[nn.Module] = nn.GELU, use_abs_pos: bool = True, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, window_size: int = 0, global_attn_indexes: Tuple[int, ...] = ())
Bases: Module
An image encoder using Vision Transformer (ViT) architecture for encoding an image into a compact latent space. The encoder takes an image, splits it into patches, and processes these patches through a series of transformer blocks. The encoded patches are then processed through a neck to generate the final encoded representation.
This class and its supporting functions below lightly adapted from the ViTDet backbone available at https://github.com/facebookresearch/detectron2/blob/main/detectron2/modeling/backbone/vit.py.
Attributes:
Name | Type | Description |
---|---|---|
img_size |
int
|
Dimension of input images, assumed to be square. |
patch_embed |
PatchEmbed
|
Module for patch embedding. |
pos_embed |
Parameter
|
Absolute positional embedding for patches. |
blocks |
ModuleList
|
List of transformer blocks for processing patch embeddings. |
neck |
Sequential
|
Neck module to further process the output. |
patch_size (int): Patch size.
in_chans (int): Number of input image channels.
embed_dim (int): Patch embedding dimension.
depth (int): Depth of ViT.
num_heads (int): Number of attention heads in each ViT block.
mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
qkv_bias (bool): If True, add a learnable bias to query, key, value.
norm_layer (nn.Module): Normalization layer.
act_layer (nn.Module): Activation layer.
use_abs_pos (bool): If True, use absolute positional embeddings.
use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
window_size (int): Window size for window attention blocks.
global_attn_indexes (list): Indexes for blocks using global attention.
Source code in ultralytics/models/sam/modules/encoders.py
forward
Processes input through patch embedding, applies positional embedding if present, and passes through blocks and neck.
Source code in ultralytics/models/sam/modules/encoders.py
ultralytics.models.sam.modules.encoders.PromptEncoder
PromptEncoder(embed_dim: int, image_embedding_size: Tuple[int, int], input_image_size: Tuple[int, int], mask_in_chans: int, activation: Type[nn.Module] = nn.GELU)
Bases: Module
Encodes different types of prompts, including points, boxes, and masks, for input to SAM's mask decoder. The encoder produces both sparse and dense embeddings for the input prompts.
Attributes:
Name | Type | Description |
---|---|---|
embed_dim |
int
|
Dimension of the embeddings. |
input_image_size |
Tuple[int, int]
|
Size of the input image as (H, W). |
image_embedding_size |
Tuple[int, int]
|
Spatial size of the image embedding as (H, W). |
pe_layer |
PositionEmbeddingRandom
|
Module for random position embedding. |
num_point_embeddings |
int
|
Number of point embeddings for different types of points. |
point_embeddings |
ModuleList
|
List of point embeddings. |
not_a_point_embed |
Embedding
|
Embedding for points that are not a part of any label. |
mask_input_size |
Tuple[int, int]
|
Size of the input mask. |
mask_downscaling |
Sequential
|
Neural network for downscaling the mask. |
no_mask_embed |
Embedding
|
Embedding for cases where no mask is provided. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embed_dim |
int
|
The prompts' embedding dimension |
required |
image_embedding_size |
tuple(int, int
|
The spatial size of the image embedding, as (H, W). |
required |
input_image_size |
int
|
The padded size of the image as input to the image encoder, as (H, W). |
required |
mask_in_chans |
int
|
The number of hidden channels used for encoding input masks. |
required |
activation |
Module
|
The activation to use when encoding input masks. |
GELU
|
Source code in ultralytics/models/sam/modules/encoders.py
forward
forward(points: Optional[Tuple[torch.Tensor, torch.Tensor]], boxes: Optional[torch.Tensor], masks: Optional[torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]
Embeds different types of prompts, returning both sparse and dense embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
points |
(tuple(Tensor, Tensor), None)
|
point coordinates and labels to embed. |
required |
boxes |
(Tensor, None)
|
boxes to embed |
required |
masks |
(Tensor, None)
|
masks to embed |
required |
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: sparse embeddings for the points and boxes, with shape BxNx(embed_dim), where N is determined by the number of input points and boxes. |
Tensor
|
torch.Tensor: dense embeddings for the masks, in the shape Bx(embed_dim)x(embed_H)x(embed_W) |
Source code in ultralytics/models/sam/modules/encoders.py
get_dense_pe
Returns the positional encoding used to encode point prompts, applied to a dense set of points the shape of the image encoding.
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: Positional encoding with shape 1x(embed_dim)x(embedding_h)x(embedding_w) |
Source code in ultralytics/models/sam/modules/encoders.py
ultralytics.models.sam.modules.encoders.PositionEmbeddingRandom
Bases: Module
Positional encoding using random spatial frequencies.
Source code in ultralytics/models/sam/modules/encoders.py
forward
Generate positional encoding for a grid of the specified size.
Source code in ultralytics/models/sam/modules/encoders.py
forward_with_coords
Positionally encode points that are not normalized to [0,1].
Source code in ultralytics/models/sam/modules/encoders.py
ultralytics.models.sam.modules.encoders.Block
Block(dim: int, num_heads: int, mlp_ratio: float = 4.0, qkv_bias: bool = True, norm_layer: Type[nn.Module] = nn.LayerNorm, act_layer: Type[nn.Module] = nn.GELU, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, window_size: int = 0, input_size: Optional[Tuple[int, int]] = None)
Bases: Module
Transformer blocks with support of window attention and residual propagation blocks.
num_heads (int): Number of attention heads in each ViT block.
mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
qkv_bias (bool): If True, add a learnable bias to query, key, value.
norm_layer (nn.Module): Normalization layer.
act_layer (nn.Module): Activation layer.
use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
window_size (int): Window size for window attention blocks. If it equals 0, then
use global attention.
input_size (tuple(int, int), None): Input resolution for calculating the relative
positional parameter size.
Source code in ultralytics/models/sam/modules/encoders.py
forward
Executes a forward pass through the transformer block with window attention and non-overlapping windows.
Source code in ultralytics/models/sam/modules/encoders.py
ultralytics.models.sam.modules.encoders.Attention
Attention(dim: int, num_heads: int = 8, qkv_bias: bool = True, use_rel_pos: bool = False, rel_pos_zero_init: bool = True, input_size: Optional[Tuple[int, int]] = None)
Bases: Module
Multi-head Attention block with relative position embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dim |
int
|
Number of input channels. |
required |
num_heads |
int
|
Number of attention heads. |
8
|
qkv_bias |
bool
|
If True, add a learnable bias to query, key, value. |
True
|
rel_pos_zero_init |
bool
|
If True, zero initialize relative positional parameters. |
True
|
input_size |
(tuple(int, int), None)
|
Input resolution for calculating the relative positional parameter size. |
None
|
Source code in ultralytics/models/sam/modules/encoders.py
forward
Applies the forward operation including attention, normalization, MLP, and indexing within window limits.
Source code in ultralytics/models/sam/modules/encoders.py
ultralytics.models.sam.modules.encoders.PatchEmbed
PatchEmbed(kernel_size: Tuple[int, int] = (16, 16), stride: Tuple[int, int] = (16, 16), padding: Tuple[int, int] = (0, 0), in_chans: int = 3, embed_dim: int = 768)
Bases: Module
Image to Patch Embedding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kernel_size |
Tuple
|
kernel size of the projection layer. |
(16, 16)
|
stride |
Tuple
|
stride of the projection layer. |
(16, 16)
|
padding |
Tuple
|
padding size of the projection layer. |
(0, 0)
|
in_chans |
int
|
Number of input image channels. |
3
|
embed_dim |
int
|
Patch embedding dimension. |
768
|
Source code in ultralytics/models/sam/modules/encoders.py
forward
Computes patch embedding by applying convolution and transposing resulting tensor.
ultralytics.models.sam.modules.encoders.window_partition
Partition into non-overlapping windows with padding if needed. Args: x (tensor): input tokens with [B, H, W, C]. window_size (int): window size.
Returns:
Name | Type | Description |
---|---|---|
windows |
Tensor
|
windows after partition with [B * num_windows, window_size, window_size, C]. |
(Hp, Wp)
|
padded height and width before partition |
Source code in ultralytics/models/sam/modules/encoders.py
ultralytics.models.sam.modules.encoders.window_unpartition
window_unpartition(windows: torch.Tensor, window_size: int, pad_hw: Tuple[int, int], hw: Tuple[int, int]) -> torch.Tensor
Window unpartition into original sequences and removing padding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
windows |
tensor
|
input tokens with [B * num_windows, window_size, window_size, C]. |
required |
window_size |
int
|
window size. |
required |
pad_hw |
Tuple
|
padded height and width (Hp, Wp). |
required |
hw |
Tuple
|
original height and width (H, W) before padding. |
required |
Returns:
Name | Type | Description |
---|---|---|
x |
Tensor
|
unpartitioned sequences with [B, H, W, C]. |
Source code in ultralytics/models/sam/modules/encoders.py
ultralytics.models.sam.modules.encoders.get_rel_pos
Get relative positional embeddings according to the relative positions of query and key sizes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
q_size |
int
|
size of query q. |
required |
k_size |
int
|
size of key k. |
required |
rel_pos |
Tensor
|
relative position embeddings (L, C). |
required |
Returns:
Type | Description |
---|---|
Tensor
|
Extracted positional embeddings according to relative positions. |
Source code in ultralytics/models/sam/modules/encoders.py
ultralytics.models.sam.modules.encoders.add_decomposed_rel_pos
add_decomposed_rel_pos(attn: torch.Tensor, q: torch.Tensor, rel_pos_h: torch.Tensor, rel_pos_w: torch.Tensor, q_size: Tuple[int, int], k_size: Tuple[int, int]) -> torch.Tensor
Calculate decomposed Relative Positional Embeddings from mvitv2 paper at https://github.com/facebookresearch/mvit/blob/main/mvit/models/attention.py.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
attn |
Tensor
|
attention map. |
required |
q |
Tensor
|
query q in the attention layer with shape (B, q_h * q_w, C). |
required |
rel_pos_h |
Tensor
|
relative position embeddings (Lh, C) for height axis. |
required |
rel_pos_w |
Tensor
|
relative position embeddings (Lw, C) for width axis. |
required |
q_size |
Tuple
|
spatial sequence size of query q with (q_h, q_w). |
required |
k_size |
Tuple
|
spatial sequence size of key k with (k_h, k_w). |
required |
Returns:
Name | Type | Description |
---|---|---|
attn |
Tensor
|
attention map with added relative positional embeddings. |