Salta para o conteúdo

Referência para ultralytics/models/sam/modules/encoders.py

Nota

Este ficheiro está disponível em https://github.com/ultralytics/ ultralytics/blob/main/ ultralytics/models/ sam/modules/encoders .py. Se detectares um problema, por favor ajuda a corrigi-lo contribuindo com um Pull Request 🛠️. Obrigado 🙏!



ultralytics.models.sam.modules.encoders.ImageEncoderViT

Bases: Module

Um codificador de imagem que utiliza a arquitetura Vision Transformer (ViT) para codificar uma imagem num espaço latente compacto. O codificador O codificador pega numa imagem, divide-a em fragmentos e processa esses fragmentos através de uma série de blocos transformadores. Os patches codificados são então processados através de um pescoço para gerar a representação codificada final.

Esta classe e as suas funções de apoio abaixo indicadas foram ligeiramente adaptadas da estrutura do ViTDet disponível em https://github.com/facebookresearch/detectron2/blob/main/detectron2/modeling/backbone/vit.py.

Atributos:

Nome Tipo Descrição
img_size int

Dimensão das imagens de entrada, assumidas como quadradas.

patch_embed PatchEmbed

Módulo para a incorporação de correcções.

pos_embed Parameter

Incorporação posicional absoluta para patches.

blocks ModuleList

Lista de blocos de transformadores para o processamento de patch embeddings.

neck Sequential

Módulo de pescoço para continuar a processar a saída.

Código fonte em ultralytics/models/sam/modules/encoders.py
class ImageEncoderViT(nn.Module):
    """
    An image encoder using Vision Transformer (ViT) architecture for encoding an image into a compact latent space. The
    encoder takes an image, splits it into patches, and processes these patches through a series of transformer blocks.
    The encoded patches are then processed through a neck to generate the final encoded representation.

    This class and its supporting functions below lightly adapted from the ViTDet backbone available at
    https://github.com/facebookresearch/detectron2/blob/main/detectron2/modeling/backbone/vit.py.

    Attributes:
        img_size (int): Dimension of input images, assumed to be square.
        patch_embed (PatchEmbed): Module for patch embedding.
        pos_embed (nn.Parameter, optional): Absolute positional embedding for patches.
        blocks (nn.ModuleList): List of transformer blocks for processing patch embeddings.
        neck (nn.Sequential): Neck module to further process the output.
    """

    def __init__(
        self,
        img_size: int = 1024,
        patch_size: int = 16,
        in_chans: int = 3,
        embed_dim: int = 768,
        depth: int = 12,
        num_heads: int = 12,
        mlp_ratio: float = 4.0,
        out_chans: int = 256,
        qkv_bias: bool = True,
        norm_layer: Type[nn.Module] = nn.LayerNorm,
        act_layer: Type[nn.Module] = nn.GELU,
        use_abs_pos: bool = True,
        use_rel_pos: bool = False,
        rel_pos_zero_init: bool = True,
        window_size: int = 0,
        global_attn_indexes: Tuple[int, ...] = (),
    ) -> None:
        """
        Args:
            img_size (int): Input image size.
            patch_size (int): Patch size.
            in_chans (int): Number of input image channels.
            embed_dim (int): Patch embedding dimension.
            depth (int): Depth of ViT.
            num_heads (int): Number of attention heads in each ViT block.
            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
            qkv_bias (bool): If True, add a learnable bias to query, key, value.
            norm_layer (nn.Module): Normalization layer.
            act_layer (nn.Module): Activation layer.
            use_abs_pos (bool): If True, use absolute positional embeddings.
            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            window_size (int): Window size for window attention blocks.
            global_attn_indexes (list): Indexes for blocks using global attention.
        """
        super().__init__()
        self.img_size = img_size

        self.patch_embed = PatchEmbed(
            kernel_size=(patch_size, patch_size),
            stride=(patch_size, patch_size),
            in_chans=in_chans,
            embed_dim=embed_dim,
        )

        self.pos_embed: Optional[nn.Parameter] = None
        if use_abs_pos:
            # Initialize absolute positional embedding with pretrain image size.
            self.pos_embed = nn.Parameter(torch.zeros(1, img_size // patch_size, img_size // patch_size, embed_dim))

        self.blocks = nn.ModuleList()
        for i in range(depth):
            block = Block(
                dim=embed_dim,
                num_heads=num_heads,
                mlp_ratio=mlp_ratio,
                qkv_bias=qkv_bias,
                norm_layer=norm_layer,
                act_layer=act_layer,
                use_rel_pos=use_rel_pos,
                rel_pos_zero_init=rel_pos_zero_init,
                window_size=window_size if i not in global_attn_indexes else 0,
                input_size=(img_size // patch_size, img_size // patch_size),
            )
            self.blocks.append(block)

        self.neck = nn.Sequential(
            nn.Conv2d(
                embed_dim,
                out_chans,
                kernel_size=1,
                bias=False,
            ),
            LayerNorm2d(out_chans),
            nn.Conv2d(
                out_chans,
                out_chans,
                kernel_size=3,
                padding=1,
                bias=False,
            ),
            LayerNorm2d(out_chans),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Processes input through patch embedding, applies positional embedding if present, and passes through blocks
        and neck.
        """
        x = self.patch_embed(x)
        if self.pos_embed is not None:
            x = x + self.pos_embed
        for blk in self.blocks:
            x = blk(x)
        return self.neck(x.permute(0, 3, 1, 2))

__init__(img_size=1024, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, out_chans=256, qkv_bias=True, norm_layer=nn.LayerNorm, act_layer=nn.GELU, use_abs_pos=True, use_rel_pos=False, rel_pos_zero_init=True, window_size=0, global_attn_indexes=())

Parâmetros:

Nome Tipo Descrição Predefinição
img_size int

Introduz o tamanho da imagem.

1024
patch_size int

Tamanho do remendo.

16
in_chans int

Número de canais de imagem de entrada.

3
embed_dim int

Dimensão de incorporação do remendo.

768
depth int

Profundidade de ViT.

12
num_heads int

Número de cabeças de atenção em cada bloco ViT.

12
mlp_ratio float

Rácio entre a dimensão oculta do mlp e a dimensão da incorporação.

4.0
qkv_bias bool

Se for Verdadeiro, adiciona uma tendência aprendível à consulta, chave, valor.

True
norm_layer Module

Camada de normalização.

LayerNorm
act_layer Module

Camada de ativação.

GELU
use_abs_pos bool

Se for Verdadeiro, utiliza incorporações posicionais absolutas.

True
use_rel_pos bool

Se for Verdadeiro, adiciona as incorporações de posição relativa ao mapa de atenção.

False
rel_pos_zero_init bool

Se for True, inicializa a zero os parâmetros posicionais relativos.

True
window_size int

Tamanho da janela para os blocos de atenção da janela.

0
global_attn_indexes list

Indica os blocos que utilizam a atenção global.

()
Código fonte em ultralytics/models/sam/modules/encoders.py
def __init__(
    self,
    img_size: int = 1024,
    patch_size: int = 16,
    in_chans: int = 3,
    embed_dim: int = 768,
    depth: int = 12,
    num_heads: int = 12,
    mlp_ratio: float = 4.0,
    out_chans: int = 256,
    qkv_bias: bool = True,
    norm_layer: Type[nn.Module] = nn.LayerNorm,
    act_layer: Type[nn.Module] = nn.GELU,
    use_abs_pos: bool = True,
    use_rel_pos: bool = False,
    rel_pos_zero_init: bool = True,
    window_size: int = 0,
    global_attn_indexes: Tuple[int, ...] = (),
) -> None:
    """
    Args:
        img_size (int): Input image size.
        patch_size (int): Patch size.
        in_chans (int): Number of input image channels.
        embed_dim (int): Patch embedding dimension.
        depth (int): Depth of ViT.
        num_heads (int): Number of attention heads in each ViT block.
        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
        qkv_bias (bool): If True, add a learnable bias to query, key, value.
        norm_layer (nn.Module): Normalization layer.
        act_layer (nn.Module): Activation layer.
        use_abs_pos (bool): If True, use absolute positional embeddings.
        use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
        rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
        window_size (int): Window size for window attention blocks.
        global_attn_indexes (list): Indexes for blocks using global attention.
    """
    super().__init__()
    self.img_size = img_size

    self.patch_embed = PatchEmbed(
        kernel_size=(patch_size, patch_size),
        stride=(patch_size, patch_size),
        in_chans=in_chans,
        embed_dim=embed_dim,
    )

    self.pos_embed: Optional[nn.Parameter] = None
    if use_abs_pos:
        # Initialize absolute positional embedding with pretrain image size.
        self.pos_embed = nn.Parameter(torch.zeros(1, img_size // patch_size, img_size // patch_size, embed_dim))

    self.blocks = nn.ModuleList()
    for i in range(depth):
        block = Block(
            dim=embed_dim,
            num_heads=num_heads,
            mlp_ratio=mlp_ratio,
            qkv_bias=qkv_bias,
            norm_layer=norm_layer,
            act_layer=act_layer,
            use_rel_pos=use_rel_pos,
            rel_pos_zero_init=rel_pos_zero_init,
            window_size=window_size if i not in global_attn_indexes else 0,
            input_size=(img_size // patch_size, img_size // patch_size),
        )
        self.blocks.append(block)

    self.neck = nn.Sequential(
        nn.Conv2d(
            embed_dim,
            out_chans,
            kernel_size=1,
            bias=False,
        ),
        LayerNorm2d(out_chans),
        nn.Conv2d(
            out_chans,
            out_chans,
            kernel_size=3,
            padding=1,
            bias=False,
        ),
        LayerNorm2d(out_chans),
    )

forward(x)

Processa a entrada através da incorporação de patches, aplica a incorporação posicional, se presente, e passa por blocos e pescoço.

Código fonte em ultralytics/models/sam/modules/encoders.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Processes input through patch embedding, applies positional embedding if present, and passes through blocks
    and neck.
    """
    x = self.patch_embed(x)
    if self.pos_embed is not None:
        x = x + self.pos_embed
    for blk in self.blocks:
        x = blk(x)
    return self.neck(x.permute(0, 3, 1, 2))



ultralytics.models.sam.modules.encoders.PromptEncoder

Bases: Module

Codifica diferentes tipos de prompts, incluindo pontos, caixas e máscaras, para entrada no descodificador de máscaras do SAM. O codificador produz tanto embeddings esparsos como densos para os prompts de entrada.

Atributos:

Nome Tipo Descrição
embed_dim int

Dimensão das incorporações.

input_image_size Tuple[int, int]

Dimensão da imagem de entrada como (H, W).

image_embedding_size Tuple[int, int]

Dimensão espacial da incorporação da imagem como (H, W).

pe_layer PositionEmbeddingRandom

Módulo para a incorporação de posições aleatórias.

num_point_embeddings int

Número de pontos incorporados para diferentes tipos de pontos.

point_embeddings ModuleList

Lista de incorporações de pontos.

not_a_point_embed Embedding

Incorporação de pontos que não fazem parte de nenhuma etiqueta.

mask_input_size Tuple[int, int]

Tamanho da máscara de entrada.

mask_downscaling Sequential

Rede neural para reduzir a escala da máscara.

no_mask_embed Embedding

Incorporação para os casos em que não é fornecida qualquer máscara.

Código fonte em ultralytics/models/sam/modules/encoders.py
class PromptEncoder(nn.Module):
    """
    Encodes different types of prompts, including points, boxes, and masks, for input to SAM's mask decoder. The encoder
    produces both sparse and dense embeddings for the input prompts.

    Attributes:
        embed_dim (int): Dimension of the embeddings.
        input_image_size (Tuple[int, int]): Size of the input image as (H, W).
        image_embedding_size (Tuple[int, int]): Spatial size of the image embedding as (H, W).
        pe_layer (PositionEmbeddingRandom): Module for random position embedding.
        num_point_embeddings (int): Number of point embeddings for different types of points.
        point_embeddings (nn.ModuleList): List of point embeddings.
        not_a_point_embed (nn.Embedding): Embedding for points that are not a part of any label.
        mask_input_size (Tuple[int, int]): Size of the input mask.
        mask_downscaling (nn.Sequential): Neural network for downscaling the mask.
        no_mask_embed (nn.Embedding): Embedding for cases where no mask is provided.
    """

    def __init__(
        self,
        embed_dim: int,
        image_embedding_size: Tuple[int, int],
        input_image_size: Tuple[int, int],
        mask_in_chans: int,
        activation: Type[nn.Module] = nn.GELU,
    ) -> None:
        """
        Encodes prompts for input to SAM's mask decoder.

        Args:
          embed_dim (int): The prompts' embedding dimension
          image_embedding_size (tuple(int, int)): The spatial size of the
            image embedding, as (H, W).
          input_image_size (int): The padded size of the image as input
            to the image encoder, as (H, W).
          mask_in_chans (int): The number of hidden channels used for
            encoding input masks.
          activation (nn.Module): The activation to use when encoding
            input masks.
        """
        super().__init__()
        self.embed_dim = embed_dim
        self.input_image_size = input_image_size
        self.image_embedding_size = image_embedding_size
        self.pe_layer = PositionEmbeddingRandom(embed_dim // 2)

        self.num_point_embeddings: int = 4  # pos/neg point + 2 box corners
        point_embeddings = [nn.Embedding(1, embed_dim) for _ in range(self.num_point_embeddings)]
        self.point_embeddings = nn.ModuleList(point_embeddings)
        self.not_a_point_embed = nn.Embedding(1, embed_dim)

        self.mask_input_size = (4 * image_embedding_size[0], 4 * image_embedding_size[1])
        self.mask_downscaling = nn.Sequential(
            nn.Conv2d(1, mask_in_chans // 4, kernel_size=2, stride=2),
            LayerNorm2d(mask_in_chans // 4),
            activation(),
            nn.Conv2d(mask_in_chans // 4, mask_in_chans, kernel_size=2, stride=2),
            LayerNorm2d(mask_in_chans),
            activation(),
            nn.Conv2d(mask_in_chans, embed_dim, kernel_size=1),
        )
        self.no_mask_embed = nn.Embedding(1, embed_dim)

    def get_dense_pe(self) -> torch.Tensor:
        """
        Returns the positional encoding used to encode point prompts, applied to a dense set of points the shape of the
        image encoding.

        Returns:
          torch.Tensor: Positional encoding with shape 1x(embed_dim)x(embedding_h)x(embedding_w)
        """
        return self.pe_layer(self.image_embedding_size).unsqueeze(0)

    def _embed_points(self, points: torch.Tensor, labels: torch.Tensor, pad: bool) -> torch.Tensor:
        """Embeds point prompts."""
        points = points + 0.5  # Shift to center of pixel
        if pad:
            padding_point = torch.zeros((points.shape[0], 1, 2), device=points.device)
            padding_label = -torch.ones((labels.shape[0], 1), device=labels.device)
            points = torch.cat([points, padding_point], dim=1)
            labels = torch.cat([labels, padding_label], dim=1)
        point_embedding = self.pe_layer.forward_with_coords(points, self.input_image_size)
        point_embedding[labels == -1] = 0.0
        point_embedding[labels == -1] += self.not_a_point_embed.weight
        point_embedding[labels == 0] += self.point_embeddings[0].weight
        point_embedding[labels == 1] += self.point_embeddings[1].weight
        return point_embedding

    def _embed_boxes(self, boxes: torch.Tensor) -> torch.Tensor:
        """Embeds box prompts."""
        boxes = boxes + 0.5  # Shift to center of pixel
        coords = boxes.reshape(-1, 2, 2)
        corner_embedding = self.pe_layer.forward_with_coords(coords, self.input_image_size)
        corner_embedding[:, 0, :] += self.point_embeddings[2].weight
        corner_embedding[:, 1, :] += self.point_embeddings[3].weight
        return corner_embedding

    def _embed_masks(self, masks: torch.Tensor) -> torch.Tensor:
        """Embeds mask inputs."""
        return self.mask_downscaling(masks)

    def _get_batch_size(
        self,
        points: Optional[Tuple[torch.Tensor, torch.Tensor]],
        boxes: Optional[torch.Tensor],
        masks: Optional[torch.Tensor],
    ) -> int:
        """Gets the batch size of the output given the batch size of the input prompts."""
        if points is not None:
            return points[0].shape[0]
        elif boxes is not None:
            return boxes.shape[0]
        elif masks is not None:
            return masks.shape[0]
        else:
            return 1

    def _get_device(self) -> torch.device:
        """Returns the device of the first point embedding's weight tensor."""
        return self.point_embeddings[0].weight.device

    def forward(
        self,
        points: Optional[Tuple[torch.Tensor, torch.Tensor]],
        boxes: Optional[torch.Tensor],
        masks: Optional[torch.Tensor],
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Embeds different types of prompts, returning both sparse and dense embeddings.

        Args:
          points (tuple(torch.Tensor, torch.Tensor), None): point coordinates and labels to embed.
          boxes (torch.Tensor, None): boxes to embed
          masks (torch.Tensor, None): masks to embed

        Returns:
          torch.Tensor: sparse embeddings for the points and boxes, with shape BxNx(embed_dim), where N is determined
            by the number of input points and boxes.
          torch.Tensor: dense embeddings for the masks, in the shape Bx(embed_dim)x(embed_H)x(embed_W)
        """
        bs = self._get_batch_size(points, boxes, masks)
        sparse_embeddings = torch.empty((bs, 0, self.embed_dim), device=self._get_device())
        if points is not None:
            coords, labels = points
            point_embeddings = self._embed_points(coords, labels, pad=(boxes is None))
            sparse_embeddings = torch.cat([sparse_embeddings, point_embeddings], dim=1)
        if boxes is not None:
            box_embeddings = self._embed_boxes(boxes)
            sparse_embeddings = torch.cat([sparse_embeddings, box_embeddings], dim=1)

        if masks is not None:
            dense_embeddings = self._embed_masks(masks)
        else:
            dense_embeddings = self.no_mask_embed.weight.reshape(1, -1, 1, 1).expand(
                bs, -1, self.image_embedding_size[0], self.image_embedding_size[1]
            )

        return sparse_embeddings, dense_embeddings

__init__(embed_dim, image_embedding_size, input_image_size, mask_in_chans, activation=nn.GELU)

Codifica os pedidos de entrada para o descodificador de máscaras do SAM.

Parâmetros:

Nome Tipo Descrição Predefinição
embed_dim int

A dimensão de incorporação dos prompts

necessário
image_embedding_size tuple(int, int

O tamanho espacial da imagem da imagem incorporada, como (H, W).

necessário
input_image_size int

O tamanho da imagem acolchoada que dá entrada para o codificador de imagem, como (H, W).

necessário
mask_in_chans int

O número de canais ocultos utilizados para codifica as máscaras de entrada.

necessário
activation Module

A ativação a utilizar na codificação de máscaras de entrada.

GELU
Código fonte em ultralytics/models/sam/modules/encoders.py
def __init__(
    self,
    embed_dim: int,
    image_embedding_size: Tuple[int, int],
    input_image_size: Tuple[int, int],
    mask_in_chans: int,
    activation: Type[nn.Module] = nn.GELU,
) -> None:
    """
    Encodes prompts for input to SAM's mask decoder.

    Args:
      embed_dim (int): The prompts' embedding dimension
      image_embedding_size (tuple(int, int)): The spatial size of the
        image embedding, as (H, W).
      input_image_size (int): The padded size of the image as input
        to the image encoder, as (H, W).
      mask_in_chans (int): The number of hidden channels used for
        encoding input masks.
      activation (nn.Module): The activation to use when encoding
        input masks.
    """
    super().__init__()
    self.embed_dim = embed_dim
    self.input_image_size = input_image_size
    self.image_embedding_size = image_embedding_size
    self.pe_layer = PositionEmbeddingRandom(embed_dim // 2)

    self.num_point_embeddings: int = 4  # pos/neg point + 2 box corners
    point_embeddings = [nn.Embedding(1, embed_dim) for _ in range(self.num_point_embeddings)]
    self.point_embeddings = nn.ModuleList(point_embeddings)
    self.not_a_point_embed = nn.Embedding(1, embed_dim)

    self.mask_input_size = (4 * image_embedding_size[0], 4 * image_embedding_size[1])
    self.mask_downscaling = nn.Sequential(
        nn.Conv2d(1, mask_in_chans // 4, kernel_size=2, stride=2),
        LayerNorm2d(mask_in_chans // 4),
        activation(),
        nn.Conv2d(mask_in_chans // 4, mask_in_chans, kernel_size=2, stride=2),
        LayerNorm2d(mask_in_chans),
        activation(),
        nn.Conv2d(mask_in_chans, embed_dim, kernel_size=1),
    )
    self.no_mask_embed = nn.Embedding(1, embed_dim)

forward(points, boxes, masks)

Incorpora diferentes tipos de prompts, devolvendo tanto as incorporações esparsas como as densas.

Parâmetros:

Nome Tipo Descrição Predefinição
points (tuple(Tensor, Tensor), None)

coordena os pontos e as etiquetas a incorporar.

necessário
boxes (Tensor, None)

caixas para incorporar

necessário
masks (Tensor, None)

máscaras para incorporar

necessário

Devolve:

Tipo Descrição
Tensor

torch.Tensor: incorporações esparsas para os pontos e caixas, com a forma BxNx(embed_dim), em que N é determinado pelo número de pontos e caixas de entrada.

Tensor

torch.Tensor: incorporações densas para as máscaras, na forma Bx(embed_dim)x(embed_H)x(embed_W)

Código fonte em ultralytics/models/sam/modules/encoders.py
def forward(
    self,
    points: Optional[Tuple[torch.Tensor, torch.Tensor]],
    boxes: Optional[torch.Tensor],
    masks: Optional[torch.Tensor],
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Embeds different types of prompts, returning both sparse and dense embeddings.

    Args:
      points (tuple(torch.Tensor, torch.Tensor), None): point coordinates and labels to embed.
      boxes (torch.Tensor, None): boxes to embed
      masks (torch.Tensor, None): masks to embed

    Returns:
      torch.Tensor: sparse embeddings for the points and boxes, with shape BxNx(embed_dim), where N is determined
        by the number of input points and boxes.
      torch.Tensor: dense embeddings for the masks, in the shape Bx(embed_dim)x(embed_H)x(embed_W)
    """
    bs = self._get_batch_size(points, boxes, masks)
    sparse_embeddings = torch.empty((bs, 0, self.embed_dim), device=self._get_device())
    if points is not None:
        coords, labels = points
        point_embeddings = self._embed_points(coords, labels, pad=(boxes is None))
        sparse_embeddings = torch.cat([sparse_embeddings, point_embeddings], dim=1)
    if boxes is not None:
        box_embeddings = self._embed_boxes(boxes)
        sparse_embeddings = torch.cat([sparse_embeddings, box_embeddings], dim=1)

    if masks is not None:
        dense_embeddings = self._embed_masks(masks)
    else:
        dense_embeddings = self.no_mask_embed.weight.reshape(1, -1, 1, 1).expand(
            bs, -1, self.image_embedding_size[0], self.image_embedding_size[1]
        )

    return sparse_embeddings, dense_embeddings

get_dense_pe()

Devolve a codificação posicional utilizada para codificar prompts de pontos, aplicada a um conjunto denso de pontos com a forma da codificação de imagem.

Devolve:

Tipo Descrição
Tensor

torch.Tensor: Codificação posicional com a forma 1x(embed_dim)x(embedding_h)x(embedding_w)

Código fonte em ultralytics/models/sam/modules/encoders.py
def get_dense_pe(self) -> torch.Tensor:
    """
    Returns the positional encoding used to encode point prompts, applied to a dense set of points the shape of the
    image encoding.

    Returns:
      torch.Tensor: Positional encoding with shape 1x(embed_dim)x(embedding_h)x(embedding_w)
    """
    return self.pe_layer(self.image_embedding_size).unsqueeze(0)



ultralytics.models.sam.modules.encoders.PositionEmbeddingRandom

Bases: Module

Codificação posicional utilizando frequências espaciais aleatórias.

Código fonte em ultralytics/models/sam/modules/encoders.py
class PositionEmbeddingRandom(nn.Module):
    """Positional encoding using random spatial frequencies."""

    def __init__(self, num_pos_feats: int = 64, scale: Optional[float] = None) -> None:
        """Initializes a position embedding using random spatial frequencies."""
        super().__init__()
        if scale is None or scale <= 0.0:
            scale = 1.0
        self.register_buffer("positional_encoding_gaussian_matrix", scale * torch.randn((2, num_pos_feats)))

        # Set non-deterministic for forward() error 'cumsum_cuda_kernel does not have a deterministic implementation'
        torch.use_deterministic_algorithms(False)
        torch.backends.cudnn.deterministic = False

    def _pe_encoding(self, coords: torch.Tensor) -> torch.Tensor:
        """Positionally encode points that are normalized to [0,1]."""
        # Assuming coords are in [0, 1]^2 square and have d_1 x ... x d_n x 2 shape
        coords = 2 * coords - 1
        coords = coords @ self.positional_encoding_gaussian_matrix
        coords = 2 * np.pi * coords
        # Outputs d_1 x ... x d_n x C shape
        return torch.cat([torch.sin(coords), torch.cos(coords)], dim=-1)

    def forward(self, size: Tuple[int, int]) -> torch.Tensor:
        """Generate positional encoding for a grid of the specified size."""
        h, w = size
        device: Any = self.positional_encoding_gaussian_matrix.device
        grid = torch.ones((h, w), device=device, dtype=torch.float32)
        y_embed = grid.cumsum(dim=0) - 0.5
        x_embed = grid.cumsum(dim=1) - 0.5
        y_embed = y_embed / h
        x_embed = x_embed / w

        pe = self._pe_encoding(torch.stack([x_embed, y_embed], dim=-1))
        return pe.permute(2, 0, 1)  # C x H x W

    def forward_with_coords(self, coords_input: torch.Tensor, image_size: Tuple[int, int]) -> torch.Tensor:
        """Positionally encode points that are not normalized to [0,1]."""
        coords = coords_input.clone()
        coords[:, :, 0] = coords[:, :, 0] / image_size[1]
        coords[:, :, 1] = coords[:, :, 1] / image_size[0]
        return self._pe_encoding(coords.to(torch.float))  # B x N x C

__init__(num_pos_feats=64, scale=None)

Inicializa uma incorporação de posição utilizando frequências espaciais aleatórias.

Código fonte em ultralytics/models/sam/modules/encoders.py
def __init__(self, num_pos_feats: int = 64, scale: Optional[float] = None) -> None:
    """Initializes a position embedding using random spatial frequencies."""
    super().__init__()
    if scale is None or scale <= 0.0:
        scale = 1.0
    self.register_buffer("positional_encoding_gaussian_matrix", scale * torch.randn((2, num_pos_feats)))

    # Set non-deterministic for forward() error 'cumsum_cuda_kernel does not have a deterministic implementation'
    torch.use_deterministic_algorithms(False)
    torch.backends.cudnn.deterministic = False

forward(size)

Gera codificação posicional para uma grelha com o tamanho especificado.

Código fonte em ultralytics/models/sam/modules/encoders.py
def forward(self, size: Tuple[int, int]) -> torch.Tensor:
    """Generate positional encoding for a grid of the specified size."""
    h, w = size
    device: Any = self.positional_encoding_gaussian_matrix.device
    grid = torch.ones((h, w), device=device, dtype=torch.float32)
    y_embed = grid.cumsum(dim=0) - 0.5
    x_embed = grid.cumsum(dim=1) - 0.5
    y_embed = y_embed / h
    x_embed = x_embed / w

    pe = self._pe_encoding(torch.stack([x_embed, y_embed], dim=-1))
    return pe.permute(2, 0, 1)  # C x H x W

forward_with_coords(coords_input, image_size)

Codifica posicionalmente os pontos que não estão normalizados para [0,1].

Código fonte em ultralytics/models/sam/modules/encoders.py
def forward_with_coords(self, coords_input: torch.Tensor, image_size: Tuple[int, int]) -> torch.Tensor:
    """Positionally encode points that are not normalized to [0,1]."""
    coords = coords_input.clone()
    coords[:, :, 0] = coords[:, :, 0] / image_size[1]
    coords[:, :, 1] = coords[:, :, 1] / image_size[0]
    return self._pe_encoding(coords.to(torch.float))  # B x N x C



ultralytics.models.sam.modules.encoders.Block

Bases: Module

Blocos de transformadores com suporte de atenção à janela e blocos de propagação residual.

Código fonte em ultralytics/models/sam/modules/encoders.py
class Block(nn.Module):
    """Transformer blocks with support of window attention and residual propagation blocks."""

    def __init__(
        self,
        dim: int,
        num_heads: int,
        mlp_ratio: float = 4.0,
        qkv_bias: bool = True,
        norm_layer: Type[nn.Module] = nn.LayerNorm,
        act_layer: Type[nn.Module] = nn.GELU,
        use_rel_pos: bool = False,
        rel_pos_zero_init: bool = True,
        window_size: int = 0,
        input_size: Optional[Tuple[int, int]] = None,
    ) -> None:
        """
        Args:
            dim (int): Number of input channels.
            num_heads (int): Number of attention heads in each ViT block.
            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
            qkv_bias (bool): If True, add a learnable bias to query, key, value.
            norm_layer (nn.Module): Normalization layer.
            act_layer (nn.Module): Activation layer.
            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            window_size (int): Window size for window attention blocks. If it equals 0, then
                use global attention.
            input_size (tuple(int, int), None): Input resolution for calculating the relative
                positional parameter size.
        """
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(
            dim,
            num_heads=num_heads,
            qkv_bias=qkv_bias,
            use_rel_pos=use_rel_pos,
            rel_pos_zero_init=rel_pos_zero_init,
            input_size=input_size if window_size == 0 else (window_size, window_size),
        )

        self.norm2 = norm_layer(dim)
        self.mlp = MLPBlock(embedding_dim=dim, mlp_dim=int(dim * mlp_ratio), act=act_layer)

        self.window_size = window_size

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Executes a forward pass through the transformer block with window attention and non-overlapping windows."""
        shortcut = x
        x = self.norm1(x)
        # Window partition
        if self.window_size > 0:
            H, W = x.shape[1], x.shape[2]
            x, pad_hw = window_partition(x, self.window_size)

        x = self.attn(x)
        # Reverse window partition
        if self.window_size > 0:
            x = window_unpartition(x, self.window_size, pad_hw, (H, W))

        x = shortcut + x
        return x + self.mlp(self.norm2(x))

__init__(dim, num_heads, mlp_ratio=4.0, qkv_bias=True, norm_layer=nn.LayerNorm, act_layer=nn.GELU, use_rel_pos=False, rel_pos_zero_init=True, window_size=0, input_size=None)

Parâmetros:

Nome Tipo Descrição Predefinição
dim int

Número de canais de entrada.

necessário
num_heads int

Número de cabeças de atenção em cada bloco ViT.

necessário
mlp_ratio float

Rácio entre a dimensão oculta do mlp e a dimensão da incorporação.

4.0
qkv_bias bool

Se for Verdadeiro, adiciona uma tendência aprendível à consulta, chave, valor.

True
norm_layer Module

Camada de normalização.

LayerNorm
act_layer Module

Camada de ativação.

GELU
use_rel_pos bool

Se for Verdadeiro, adiciona as incorporações de posição relativa ao mapa de atenção.

False
rel_pos_zero_init bool

Se for True, inicializa a zero os parâmetros posicionais relativos.

True
window_size int

Tamanho da janela para os blocos de atenção da janela. Se for igual a 0, então utiliza a atenção global.

0
input_size (tuple(int, int), None)

Resolução de entrada para calcular o tamanho relativo do parâmetro tamanho do parâmetro posicional relativo.

None
Código fonte em ultralytics/models/sam/modules/encoders.py
def __init__(
    self,
    dim: int,
    num_heads: int,
    mlp_ratio: float = 4.0,
    qkv_bias: bool = True,
    norm_layer: Type[nn.Module] = nn.LayerNorm,
    act_layer: Type[nn.Module] = nn.GELU,
    use_rel_pos: bool = False,
    rel_pos_zero_init: bool = True,
    window_size: int = 0,
    input_size: Optional[Tuple[int, int]] = None,
) -> None:
    """
    Args:
        dim (int): Number of input channels.
        num_heads (int): Number of attention heads in each ViT block.
        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
        qkv_bias (bool): If True, add a learnable bias to query, key, value.
        norm_layer (nn.Module): Normalization layer.
        act_layer (nn.Module): Activation layer.
        use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
        rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
        window_size (int): Window size for window attention blocks. If it equals 0, then
            use global attention.
        input_size (tuple(int, int), None): Input resolution for calculating the relative
            positional parameter size.
    """
    super().__init__()
    self.norm1 = norm_layer(dim)
    self.attn = Attention(
        dim,
        num_heads=num_heads,
        qkv_bias=qkv_bias,
        use_rel_pos=use_rel_pos,
        rel_pos_zero_init=rel_pos_zero_init,
        input_size=input_size if window_size == 0 else (window_size, window_size),
    )

    self.norm2 = norm_layer(dim)
    self.mlp = MLPBlock(embedding_dim=dim, mlp_dim=int(dim * mlp_ratio), act=act_layer)

    self.window_size = window_size

forward(x)

Executa uma passagem para a frente através do bloco de transformação com atenção à janela e janelas não sobrepostas.

Código fonte em ultralytics/models/sam/modules/encoders.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Executes a forward pass through the transformer block with window attention and non-overlapping windows."""
    shortcut = x
    x = self.norm1(x)
    # Window partition
    if self.window_size > 0:
        H, W = x.shape[1], x.shape[2]
        x, pad_hw = window_partition(x, self.window_size)

    x = self.attn(x)
    # Reverse window partition
    if self.window_size > 0:
        x = window_unpartition(x, self.window_size, pad_hw, (H, W))

    x = shortcut + x
    return x + self.mlp(self.norm2(x))



ultralytics.models.sam.modules.encoders.Attention

Bases: Module

Bloco de atenção multi-cabeça com incorporação de posição relativa.

Código fonte em ultralytics/models/sam/modules/encoders.py
class Attention(nn.Module):
    """Multi-head Attention block with relative position embeddings."""

    def __init__(
        self,
        dim: int,
        num_heads: int = 8,
        qkv_bias: bool = True,
        use_rel_pos: bool = False,
        rel_pos_zero_init: bool = True,
        input_size: Optional[Tuple[int, int]] = None,
    ) -> None:
        """
        Initialize Attention module.

        Args:
            dim (int): Number of input channels.
            num_heads (int): Number of attention heads.
            qkv_bias (bool):  If True, add a learnable bias to query, key, value.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            input_size (tuple(int, int), None): Input resolution for calculating the relative
                positional parameter size.
        """
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim**-0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.proj = nn.Linear(dim, dim)

        self.use_rel_pos = use_rel_pos
        if self.use_rel_pos:
            assert input_size is not None, "Input size must be provided if using relative positional encoding."
            # Initialize relative positional embeddings
            self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))
            self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Applies the forward operation including attention, normalization, MLP, and indexing within window limits."""
        B, H, W, _ = x.shape
        # qkv with shape (3, B, nHead, H * W, C)
        qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
        # q, k, v with shape (B * nHead, H * W, C)
        q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)

        attn = (q * self.scale) @ k.transpose(-2, -1)

        if self.use_rel_pos:
            attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W))

        attn = attn.softmax(dim=-1)
        x = (attn @ v).view(B, self.num_heads, H, W, -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1)
        return self.proj(x)

__init__(dim, num_heads=8, qkv_bias=True, use_rel_pos=False, rel_pos_zero_init=True, input_size=None)

Inicializa o módulo de atenção.

Parâmetros:

Nome Tipo Descrição Predefinição
dim int

Número de canais de entrada.

necessário
num_heads int

Número de cabeças de atenção.

8
qkv_bias bool

Se for Verdadeiro, adiciona uma tendência aprendível à consulta, chave, valor.

True
rel_pos_zero_init bool

Se for True, inicializa a zero os parâmetros posicionais relativos.

True
input_size (tuple(int, int), None)

Resolução de entrada para calcular o tamanho relativo do parâmetro tamanho do parâmetro posicional relativo.

None
Código fonte em ultralytics/models/sam/modules/encoders.py
def __init__(
    self,
    dim: int,
    num_heads: int = 8,
    qkv_bias: bool = True,
    use_rel_pos: bool = False,
    rel_pos_zero_init: bool = True,
    input_size: Optional[Tuple[int, int]] = None,
) -> None:
    """
    Initialize Attention module.

    Args:
        dim (int): Number of input channels.
        num_heads (int): Number of attention heads.
        qkv_bias (bool):  If True, add a learnable bias to query, key, value.
        rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
        input_size (tuple(int, int), None): Input resolution for calculating the relative
            positional parameter size.
    """
    super().__init__()
    self.num_heads = num_heads
    head_dim = dim // num_heads
    self.scale = head_dim**-0.5

    self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
    self.proj = nn.Linear(dim, dim)

    self.use_rel_pos = use_rel_pos
    if self.use_rel_pos:
        assert input_size is not None, "Input size must be provided if using relative positional encoding."
        # Initialize relative positional embeddings
        self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))
        self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim))

forward(x)

Aplica a operação avançada incluindo atenção, normalização, MLP e indexação dentro dos limites da janela.

Código fonte em ultralytics/models/sam/modules/encoders.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Applies the forward operation including attention, normalization, MLP, and indexing within window limits."""
    B, H, W, _ = x.shape
    # qkv with shape (3, B, nHead, H * W, C)
    qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
    # q, k, v with shape (B * nHead, H * W, C)
    q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)

    attn = (q * self.scale) @ k.transpose(-2, -1)

    if self.use_rel_pos:
        attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W))

    attn = attn.softmax(dim=-1)
    x = (attn @ v).view(B, self.num_heads, H, W, -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1)
    return self.proj(x)



ultralytics.models.sam.modules.encoders.PatchEmbed

Bases: Module

Imagem para incorporação de patches.

Código fonte em ultralytics/models/sam/modules/encoders.py
class PatchEmbed(nn.Module):
    """Image to Patch Embedding."""

    def __init__(
        self,
        kernel_size: Tuple[int, int] = (16, 16),
        stride: Tuple[int, int] = (16, 16),
        padding: Tuple[int, int] = (0, 0),
        in_chans: int = 3,
        embed_dim: int = 768,
    ) -> None:
        """
        Initialize PatchEmbed module.

        Args:
            kernel_size (Tuple): kernel size of the projection layer.
            stride (Tuple): stride of the projection layer.
            padding (Tuple): padding size of the projection layer.
            in_chans (int): Number of input image channels.
            embed_dim (int): Patch embedding dimension.
        """
        super().__init__()

        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Computes patch embedding by applying convolution and transposing resulting tensor."""
        return self.proj(x).permute(0, 2, 3, 1)  # B C H W -> B H W C

__init__(kernel_size=(16, 16), stride=(16, 16), padding=(0, 0), in_chans=3, embed_dim=768)

Inicializa o módulo PatchEmbed.

Parâmetros:

Nome Tipo Descrição Predefinição
kernel_size Tuple

tamanho do kernel da camada de projeção.

(16, 16)
stride Tuple

passo da camada de projeção.

(16, 16)
padding Tuple

Tamanho do preenchimento da camada de projeção.

(0, 0)
in_chans int

Número de canais de imagem de entrada.

3
embed_dim int

Dimensão de incorporação do remendo.

768
Código fonte em ultralytics/models/sam/modules/encoders.py
def __init__(
    self,
    kernel_size: Tuple[int, int] = (16, 16),
    stride: Tuple[int, int] = (16, 16),
    padding: Tuple[int, int] = (0, 0),
    in_chans: int = 3,
    embed_dim: int = 768,
) -> None:
    """
    Initialize PatchEmbed module.

    Args:
        kernel_size (Tuple): kernel size of the projection layer.
        stride (Tuple): stride of the projection layer.
        padding (Tuple): padding size of the projection layer.
        in_chans (int): Number of input image channels.
        embed_dim (int): Patch embedding dimension.
    """
    super().__init__()

    self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding)

forward(x)

Calcula a incorporação de amostras aplicando a convolução e transpondo o resultado tensor.

Código fonte em ultralytics/models/sam/modules/encoders.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Computes patch embedding by applying convolution and transposing resulting tensor."""
    return self.proj(x).permute(0, 2, 3, 1)  # B C H W -> B H W C



ultralytics.models.sam.modules.encoders.window_partition(x, window_size)

Divide em janelas não sobrepostas com preenchimento, se necessário. Args: x (tensor): tokens de entrada com [B, H, W, C]. tamanho_da_janela (int): tamanho da janela.

Devolve:

Nome Tipo Descrição
windows Tensor

janelas após a partição com [B * num_janelas, tamanho_janela, tamanho_janela, C].

(Hp, Wp)

altura e largura almofadadas antes da divisória

Código fonte em ultralytics/models/sam/modules/encoders.py
def window_partition(x: torch.Tensor, window_size: int) -> Tuple[torch.Tensor, Tuple[int, int]]:
    """
    Partition into non-overlapping windows with padding if needed.
    Args:
        x (tensor): input tokens with [B, H, W, C].
        window_size (int): window size.

    Returns:
        windows: windows after partition with [B * num_windows, window_size, window_size, C].
        (Hp, Wp): padded height and width before partition
    """
    B, H, W, C = x.shape

    pad_h = (window_size - H % window_size) % window_size
    pad_w = (window_size - W % window_size) % window_size
    if pad_h > 0 or pad_w > 0:
        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))
    Hp, Wp = H + pad_h, W + pad_w

    x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)
    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
    return windows, (Hp, Wp)



ultralytics.models.sam.modules.encoders.window_unpartition(windows, window_size, pad_hw, hw)

Desmembra a janela em sequências originais e remove o preenchimento.

Parâmetros:

Nome Tipo Descrição Predefinição
windows tensor

introduz tokens com [B * num_windows, window_size, window_size, C].

necessário
window_size int

tamanho da janela.

necessário
pad_hw Tuple

altura e largura almofadadas (Hp, Wp).

necessário
hw Tuple

altura e largura originais (H, W) antes do preenchimento.

necessário

Devolve:

Nome Tipo Descrição
x Tensor

sequências não particionadas com [B, H, W, C].

Código fonte em ultralytics/models/sam/modules/encoders.py
def window_unpartition(
    windows: torch.Tensor, window_size: int, pad_hw: Tuple[int, int], hw: Tuple[int, int]
) -> torch.Tensor:
    """
    Window unpartition into original sequences and removing padding.

    Args:
        windows (tensor): input tokens with [B * num_windows, window_size, window_size, C].
        window_size (int): window size.
        pad_hw (Tuple): padded height and width (Hp, Wp).
        hw (Tuple): original height and width (H, W) before padding.

    Returns:
        x: unpartitioned sequences with [B, H, W, C].
    """
    Hp, Wp = pad_hw
    H, W = hw
    B = windows.shape[0] // (Hp * Wp // window_size // window_size)
    x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)
    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)

    if Hp > H or Wp > W:
        x = x[:, :H, :W, :].contiguous()
    return x



ultralytics.models.sam.modules.encoders.get_rel_pos(q_size, k_size, rel_pos)

Obtém os embeddings posicionais relativos de acordo com as posições relativas da consulta e dos tamanhos das chaves.

Parâmetros:

Nome Tipo Descrição Predefinição
q_size int

tamanho da consulta q.

necessário
k_size int

tamanho da chave k.

necessário
rel_pos Tensor

incorpora as posições relativas (L, C).

necessário

Devolve:

Tipo Descrição
Tensor

Extrai as incorporações posicionais de acordo com as posições relativas.

Código fonte em ultralytics/models/sam/modules/encoders.py
def get_rel_pos(q_size: int, k_size: int, rel_pos: torch.Tensor) -> torch.Tensor:
    """
    Get relative positional embeddings according to the relative positions of query and key sizes.

    Args:
        q_size (int): size of query q.
        k_size (int): size of key k.
        rel_pos (Tensor): relative position embeddings (L, C).

    Returns:
        Extracted positional embeddings according to relative positions.
    """
    max_rel_dist = int(2 * max(q_size, k_size) - 1)
    # Interpolate rel pos if needed.
    if rel_pos.shape[0] != max_rel_dist:
        # Interpolate rel pos.
        rel_pos_resized = F.interpolate(
            rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),
            size=max_rel_dist,
            mode="linear",
        )
        rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)
    else:
        rel_pos_resized = rel_pos

    # Scale the coords with short length if shapes for q and k are different.
    q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)
    k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)
    relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)

    return rel_pos_resized[relative_coords.long()]



ultralytics.models.sam.modules.encoders.add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size)

Calcula os Relative Positional Embeddings decompostos a partir do documento mvitv2 em https://github.com/facebookresearch/mvit/blob/main/mvit/models/attention.py.

Parâmetros:

Nome Tipo Descrição Predefinição
attn Tensor

mapa de atenção.

necessário
q Tensor

Consulta q na camada de atenção com a forma (B, q_h * q_w, C).

necessário
rel_pos_h Tensor

incorporações de posição relativa (Lh, C) para o eixo da altura.

necessário
rel_pos_w Tensor

incorporações de posição relativa (Lw, C) para o eixo da largura.

necessário
q_size Tuple

dimensão da sequência espacial da consulta q com (q_h, q_w).

necessário
k_size Tuple

tamanho da sequência espacial da chave k com (k_h, k_w).

necessário

Devolve:

Nome Tipo Descrição
attn Tensor

mapa de atenção com incorporação de posição relativa adicionada.

Código fonte em ultralytics/models/sam/modules/encoders.py
def add_decomposed_rel_pos(
    attn: torch.Tensor,
    q: torch.Tensor,
    rel_pos_h: torch.Tensor,
    rel_pos_w: torch.Tensor,
    q_size: Tuple[int, int],
    k_size: Tuple[int, int],
) -> torch.Tensor:
    """
    Calculate decomposed Relative Positional Embeddings from mvitv2 paper at
    https://github.com/facebookresearch/mvit/blob/main/mvit/models/attention.py.

    Args:
        attn (Tensor): attention map.
        q (Tensor): query q in the attention layer with shape (B, q_h * q_w, C).
        rel_pos_h (Tensor): relative position embeddings (Lh, C) for height axis.
        rel_pos_w (Tensor): relative position embeddings (Lw, C) for width axis.
        q_size (Tuple): spatial sequence size of query q with (q_h, q_w).
        k_size (Tuple): spatial sequence size of key k with (k_h, k_w).

    Returns:
        attn (Tensor): attention map with added relative positional embeddings.
    """
    q_h, q_w = q_size
    k_h, k_w = k_size
    Rh = get_rel_pos(q_h, k_h, rel_pos_h)
    Rw = get_rel_pos(q_w, k_w, rel_pos_w)

    B, _, dim = q.shape
    r_q = q.reshape(B, q_h, q_w, dim)
    rel_h = torch.einsum("bhwc,hkc->bhwk", r_q, Rh)
    rel_w = torch.einsum("bhwc,wkc->bhwk", r_q, Rw)

    attn = (attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]).view(
        B, q_h * q_w, k_h * k_w
    )

    return attn





Criado em 2023-11-12, Atualizado em 2024-06-02
Autores: glenn-jocher (5), Burhan-Q (1), Laughing-q (1)