Transformers documentation

Sapiens2

You are viewing v5.10.0 version. A newer version v5.10.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

This model was published in HF papers on 2026-04-23 and contributed to Hugging Face Transformers on 2026-06-03.

Sapiens2

SDPA FlashAttention

Overview

The Sapiens2 model was proposed in Sapiens2 by Rawal Khirodkar, He Wen, Julieta Martinez, Yuan Dong, Zhaoen Su, Shunsuke Saito. Sapiens2 is a family of high-resolution vision transformers pretrained on ~1 billion curated human images, designed for human-centric computer vision tasks including pose estimation, body-part segmentation, surface normal estimation, and pointmap estimation.

You can find all the original Sapiens2 checkpoints under the Sapiens2 collection.

The abstract from the paper is the following:

We present Sapiens2, a family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. We pretrain on ~1 billion curated high-quality human images with improved task annotations and combine masked image reconstruction with self-distilled contrastive objectives to learn both low-level and semantic features. Our models scale from 0.4B to 5B parameters and train at native 1K resolution, with hierarchical 4K variants for extended spatial reasoning. Sapiens2 achieves substantial improvements over its predecessor: +4 mAP in pose estimation, +24.3 mIoU in body-part segmentation, and 45.6% error reduction in normal estimation, while extending to new tasks like pointmap and albedo estimation. Code is publicly available.

Tips:

  • Sapiens2 uses Rotary Position Embeddings (RoPE) and supports arbitrary input resolutions. The default image processor resizes images to 1024×768 (height×width).
  • The model uses Grouped Query Attention (GQA) for middle layers and full multi-head attention for the first and last 8 layers.
  • Register tokens (8 by default) reduce high-norm artifacts in patch tokens, yielding cleaner attention maps and better performance on dense prediction tasks.

This model was contributed by guarin. The original code can be found here.

Usage examples

AutoModel
AutoBackbone
Normal estimation
Pointmap estimation
Pose estimation
Pose estimation with flip augmentation
Semantic segmentation
Matting

The example below shows how to obtain the CLS token (whole-image embedding) with Sapiens2Model.

import torch
from transformers import AutoImageProcessor, AutoModel
from transformers.image_utils import load_image

image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")

image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pretrain-0.4b")
model = AutoModel.from_pretrained("facebook/sapiens2-pretrain-0.4b", device_map="auto")

inputs = image_processor(images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model(**inputs)

# outputs.pooler_output is the CLS token (whole-image embedding)
cls_token = outputs.pooler_output
print("CLS token shape:", cls_token.shape)  # [1, 1024]

Sapiens2Config

class transformers.Sapiens2Config

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None patch_size: int | list[int] | tuple[int, int] = 16 hidden_size: int = 1024 intermediate_size: int = 4096 num_hidden_layers: int = 24 num_attention_heads: int = 16 hidden_act: str = 'silu' attention_dropout: float | int = 0.0 initializer_range: float = 0.02 rope_theta: float = 100.0 image_size: int | list[int] | tuple[int, int] = 224 num_channels: int = 3 query_bias: bool = True key_bias: bool = True value_bias: bool = True proj_bias: bool = True mlp_bias: bool = True layerscale_value: float = 1.0 drop_path_rate: float | int = 0.0 use_gated_mlp: bool = True num_register_tokens: int = 8 pos_embed_shift: float | None = None pos_embed_jitter: float | None = None pos_embed_rescale: float | None = 2.0 _out_features: list[str] | None = None _out_indices: list[int] | None = None reshape_hidden_states: bool = True use_mask_token: bool = False rms_norm_eps: float = 1e-06 normalize_backbone_outputs: bool = True use_qk_norm: bool = True num_key_value_heads_per_layer: list[int] | None = None num_key_value_attention_heads: int = 8 num_first_full_attention_layers: int = 8 num_last_full_attention_layers: int = 8 semantic_loss_ignore_index: int = 255 flip_pairs: list[list[int]] | None = None head_config: transformers.models.sapiens2.configuration_sapiens2.Sapiens2HeadConfig | dict | None = None )

Parameters

  • patch_size (Union[int, list[int], tuple[int, int]], optional, defaults to 16) — The size (resolution) of each patch.
  • hidden_size (int, optional, defaults to 1024) — Dimension of the hidden representations.
  • intermediate_size (int, optional, defaults to 4096) — Dimension of the MLP representations.
  • num_hidden_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer decoder.
  • num_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer decoder.
  • hidden_act (str, optional, defaults to silu) — The non-linear activation function (function or string) in the decoder. For example, "gelu", "relu", "silu", etc.
  • attention_dropout (Union[float, int], optional, defaults to 0.0) — The dropout ratio for the attention probabilities.
  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • rope_theta (float, optional, defaults to 100.0) — The base period of the RoPE embeddings.
  • image_size (Union[int, list[int], tuple[int, int]], optional, defaults to 224) — The size (resolution) of each image.
  • num_channels (int, optional, defaults to 3) — The number of input channels.
  • query_bias (bool, optional, defaults to True) — Whether to add a bias to the query projection.
  • key_bias (bool, optional, defaults to False) — Whether to add a bias to the key projection.
  • value_bias (bool, optional, defaults to True) — Whether to add a bias to the value projection.
  • proj_bias (bool, optional, defaults to True) — Whether to add a bias to the output projection.
  • mlp_bias (bool, optional, defaults to True) — Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
  • layerscale_value (float, optional, defaults to 1.0) — Initial value to use for layer scale.
  • drop_path_rate (Union[float, int], optional, defaults to 0.0) — Drop path rate for the patch fusion.
  • use_gated_mlp (bool, optional, defaults to False) — Whether to use the SwiGLU feedforward neural network.
  • num_register_tokens (int, optional, defaults to 0) — The number of register tokens.
  • pos_embed_shift (float, optional) — Amount to randomly shift position embedding coordinates in [-shift, shift], applied only in training mode if not None.
  • pos_embed_jitter (float, optional) — Amount to randomly jitter position embedding coordinates in log-uniform value in [1/jitter, jitter], applied only in training mode if not None.
  • pos_embed_rescale (float, optional, defaults to 2.0) — Amount to randomly rescale position embedding coordinates in log-uniform value in [1/rescale, rescale], applied only in training mode if not None.
  • reshape_hidden_states (bool, optional, defaults to True) — Whether to reshape the hidden states to spatial dimensions when used as backbone.
  • use_mask_token (bool, optional, defaults to False) — Whether to use a mask token in the embeddings (needed for masked image modeling pretraining).
  • rms_norm_eps (float, optional, defaults to 1e-6) — Epsilon for the RMS normalization layers.
  • normalize_backbone_outputs (bool, optional, defaults to True) — Whether to apply RMSNorm to the backbone feature_maps and cls_tokens outputs before returning them from the forward pass. Only applies when the model is used as a backbone.
  • use_qk_norm (bool, optional, defaults to True) — Whether to apply RMSNorm to queries and keys before RoPE in attention layers.
  • num_key_value_heads_per_layer (list[int], optional) — Number of key/value heads for each transformer layer. Setting a layer’s value equal to num_attention_heads gives full multi-head attention; a smaller value gives grouped-query attention. Defaults to num_attention_heads for the first num_first_full_attention_layers and last num_last_full_attention_layers layers and num_key_valueattention_heads for all other layers.
  • num_key_value_attention_heads (int, optional, defaults to 8) — Number of key/value heads for layers that use grouped-query attention when num_key_value_heads_per_layer is not set. Ignored when num_key_value_heads_per_layer is set.
  • num_first_full_attention_layers (int, optional, defaults to 8) — Number of leading transformer layers that use full multi-head attention. Only used when num_key_value_heads_per_layer is None.
  • num_last_full_attention_layers (int, optional, defaults to 8) — Number of trailing transformer layers that use full multi-head attention. Only used when num_key_value_heads_per_layer is None.
  • semantic_loss_ignore_index (int, optional, defaults to 255) — Label index ignored when computing the segmentation loss.
  • flip_pairs (list[list[int]], optional) — Pairs of keypoint indices that are mirrored horizontally (e.g., left ear ↔ right ear). Each pair is a two-element list [left_index, right_index]. Used for test-time horizontal flip augmentation in pose estimation: pass these pairs to the second forward call so the model flips heatmaps back before returning them.
  • head_config (Sapiens2HeadConfig, optional) — Configuration for the decode head. See Sapiens2HeadConfig for the available options.

This is the configuration class to store the configuration of a Sapiens2Model. It is used to instantiate a Sapiens2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/sapiens2-pretrain-0.4b

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Sapiens2HeadConfig

class transformers.Sapiens2HeadConfig

< >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None upsample_out_channels: list[int] | None = None upsample_kernel_sizes: list[int] | None = None upsample_kernel_size: int = 4 use_pixel_shuffle: bool | None = None conv_out_channels: list[int] | None = None conv_kernel_sizes: list[int] | None = None conv_kernel_size: int = 1 scale_conv_out_channels: list[int] | None = None scale_conv_kernel_sizes: list[int] | None = None scale_conv_kernel_size: int = 1 scale_final_input_size: int | None = None scale_final_hidden_sizes: list[int] | None = None )

Parameters

  • upsample_out_channels (list[int], optional) — Output channel counts for each upsample block. The first block takes hidden_size channels as input; subsequent blocks use the previous output.
  • upsample_kernel_sizes (list[int], optional) — Kernel size for each upsample block. Auto-filled with [4, ...] when upsample_out_channels is set but this is None. Must have the same length as upsample_out_channels.
  • upsample_kernel_size (int, defaults to 4) — Default kernel size for upsample blocks when upsample_kernel_sizes is not set.
  • use_pixel_shuffle (bool, optional) — Whether the upsample head uses pixel-shuffle upsampling instead of transposed convolutions. When None (default), the head uses transposed convolutions.
  • conv_out_channels (list[int], optional) — Output channel counts for the refinement conv layers that follow the upsample blocks.
  • conv_kernel_sizes (list[int], optional) — Kernel size for each refinement conv layer. Auto-filled with [1, ...] when conv_out_channels is set but this is None. Must have the same length as conv_out_channels.
  • conv_kernel_size (int, defaults to 1) — Default kernel size for conv layers when conv_kernel_sizes is not set.
  • scale_conv_out_channels (list[int], optional) — Output channel counts for the stride-2 conv layers used to predict the focal-length scale. When None (default), no scale branch is built.
  • scale_conv_kernel_sizes (list[int], optional) — Kernel size for each scale conv layer. Auto-filled with [1, ...] when scale_conv_out_channels is set but this is None. Must have the same length as scale_conv_out_channels.
  • scale_conv_kernel_size (int, defaults to 1) — Default kernel size for scale conv layers when scale_conv_kernel_sizes is not set.
  • scale_final_input_size (int, optional) — Flattened feature size passed into the scale MLP. When None (default), it is automatically inferred from image_size and patch_size in the parent Sapiens2Config.
  • scale_final_hidden_sizes (list[int], optional) — Hidden-layer sizes for the MLP that maps flattened scale features to the scalar scale output. When None (default), no scale branch is built.

This is the configuration class to store the configuration of a Sapiens2Model. It is used to instantiate a Sapiens2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/sapiens2-seg-0.4b

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Sapiens2ImageProcessor

class transformers.Sapiens2ImageProcessor

< >

( **kwargs: typing_extensions.Unpack[transformers.models.sapiens2.image_processing_sapiens2.Sapiens2ImageProcessorKwargs] )

Parameters

  • do_reduce_labels (bool, kwargs, optional, defaults to self.do_reduce_labels) — Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by 255.
  • **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

Constructs a Sapiens2ImageProcessor image processor.

preprocess

< >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None boxes: list[list[list[float]]] | None = None **kwargs: typing_extensions.Unpack[transformers.models.sapiens2.image_processing_sapiens2.Sapiens2ImageProcessorKwargs] ) ~image_processing_base.BatchFeature

Parameters

  • images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.
  • segmentation_maps (ImageInput, optional) — The segmentation maps to preprocess.
  • boxes (list[list[list[float]]] or np.ndarray, optional) — List or array of bounding boxes for each image. Each box should be a list of 4 floats representing the bounding box coordinates in COCO format (top_left_x, top_left_y, width, height). When provided, each person crop is affine-warped to the model input size instead of resizing the full image.
  • do_reduce_labels (bool, kwargs, optional, defaults to self.do_reduce_labels) — Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by 255.
  • return_tensors (str or TensorType, optional) — Returns stacked tensors if set to 'pt', otherwise returns a list of tensors.
  • **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

Returns

~image_processing_base.BatchFeature

  • data (dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.).
  • tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.

post_process_image_matting

< >

( outputs: Sapiens2ImageMattingOutput target_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None backgrounds: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None ) list[dict] of length batch_size. Each dict has

Parameters

  • outputs (Sapiens2ImageMattingOutput) — Raw outputs of the model.
  • target_sizes (torch.Tensor or list[tuple[int, int]] of length batch_size, optional) — Requested final (height, width) for each prediction. Resized with bilinear interpolation. If unset, predictions are returned at the model output resolution.
  • backgrounds (ImageInput, optional) — Background image(s) to composite over. Can be a single image (applied to every item in the batch) or a list of images, one per batch item. Accepts PIL images, numpy arrays, or torch tensors of any dtype; integer types (e.g. uint8) are scaled to [0, 1] automatically. When provided, each result dict gains a "composite" key with the composited image as a uint8 tensor in [0, 255].

Returns

list[dict] of length batch_size. Each dict has

  • "alpha" (torch.Tensor of shape (1, height, width)): alpha values in [0, 1].
  • "foreground" (torch.Tensor of shape (3, height, width)): pre-multiplied RGB in [0, 1].
  • "composite" (torch.Tensor of shape (3, height, width) or None): foreground composited over backgrounds as a uint8 tensor in [0, 255]; None when backgrounds is not provided.

Converts the output of Sapiens2ForImageMatting into alpha mattes and foreground maps.

post_process_normal_estimation

< >

( outputs: Sapiens2NormalEstimatorOutput source_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None target_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None do_remove_padding: bool | None = None )

Parameters

  • outputs (Sapiens2NormalEstimatorOutput) — Raw outputs of the model.
  • source_sizes (torch.Tensor or list[tuple[int, int]] of length batch_size, optional) — Original (height, width) of each image before preprocessing. When provided, the padding added during preprocessing is removed and predictions are resized back to the original image size (unless target_sizes overrides the final size).
  • target_sizes (torch.Tensor or list[tuple[int, int]] of length batch_size, optional) — Requested final (height, width) for each prediction. When provided, used as the resize target instead of source_sizes. Resized with bilinear interpolation after L2 normalization.
  • do_remove_padding (bool, optional) — Whether to crop away the zero-padding added during preprocessing before resizing. Defaults to True when source_sizes is provided, False otherwise.

Converts the output of Sapiens2ForNormalEstimation into L2-normalized surface normal maps.

post_process_pointmap_estimation

< >

( outputs: Sapiens2PointmapEstimatorOutput source_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None target_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None do_remove_padding: bool | None = None )

Parameters

  • outputs (Sapiens2PointmapEstimatorOutput) — Raw outputs of the model.
  • source_sizes (torch.Tensor or list[tuple[int, int]] of length batch_size, optional) — Original (height, width) of each image before preprocessing. When provided, the padding added during preprocessing is removed and predictions are resized back to the original image size (unless target_sizes overrides the final size).
  • target_sizes (torch.Tensor or list[tuple[int, int]] of length batch_size, optional) — Requested final (height, width) for each prediction. Overrides source_sizes as the resize target.
  • do_remove_padding (bool, optional) — Whether to crop away the zero-padding added during preprocessing before resizing. Defaults to True when source_sizes is provided, False otherwise.

Converts the output of Sapiens2ForPointmapEstimation into pointmap tensors in image space.

post_process_pose_estimation

< >

( outputs: Sapiens2PoseEstimatorOutput boxes: list outputs_flipped: transformers.models.sapiens2.modeling_sapiens2.Sapiens2PoseEstimatorOutput | None = None kernel_size: int = 11 threshold: float | None = None source_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None target_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None ) list[list[dict]]

Parameters

  • outputs (Sapiens2PoseEstimatorOutput) — Raw outputs of the model. outputs.heatmaps must have shape (N_total, num_keypoints, heatmap_height, heatmap_width) where N_total = sum(len(b) for b in boxes).
  • boxes (list[list[list[float]]] or np.ndarray) — List or array of bounding boxes for each image in absolute pixel coordinates. Each box should be a list of 4 floats representing the bounding box coordinates in COCO format (top_left_x, top_left_y, width, height). Must match the boxes argument passed to preprocess.
  • outputs_flipped (Sapiens2PoseEstimatorOutput, optional) — Outputs from running the model on horizontally flipped inputs. When provided, heatmaps are averaged with outputs before keypoint extraction to improve accuracy: avg_heatmaps = (outputs.heatmaps + outputs_flipped.heatmaps) / 2.
  • kernel_size (int, optional, defaults to 11) — Kernel size for the Gaussian blur used in UDP Dark Pose refinement.
  • threshold (float, optional) — Score threshold. Keypoints with scores at or below this value are filtered out from the result dictionaries.
  • source_sizes (torch.Tensor or list[tuple[int, int]] of length batch_size, optional) — Original (height, width) of each image in pixels. Required when target_sizes is provided, as the source coordinate space for scaling keypoints and bounding boxes.
  • target_sizes (torch.Tensor or list[tuple[int, int]] of length batch_size, optional) — Desired output (height, width) coordinate space for each image. When provided alongside source_sizes, keypoint coordinates and bounding boxes are scaled from source to target space.

Returns

list[list[dict]]

Outer list is over images, inner list is over persons. Each dict contains:

  • keypoints (torch.FloatTensor of shape (num_keypoints, 2)): absolut x/y coordinates in the source image space, or in target space if target_sizes is provided.
  • scores (torch.FloatTensor of shape (num_keypoints,)): per-keypoint confidence.
  • labels (torch.LongTensor of shape (num_keypoints,)): keypoint indices.
  • bbox (torch.FloatTensor of shape (4,)): bounding box in absolute (x_min, y_min, x_max, y_max) format, in the same coordinate space as keypoints.

Converts the output of Sapiens2ForPoseEstimation into keypoint predictions in image space.

post_process_semantic_segmentation

< >

( outputs target_sizes: list[tuple] | None = None ) semantic_segmentation

Parameters

  • outputs (Sapiens2ForSemanticSegmentation) — Raw outputs of the model.
  • target_sizes (list[Tuple] of length batch_size, optional) — List of tuples corresponding to the requested final size (height, width) of each prediction. If unset, predictions will not be resized.

Returns

semantic_segmentation

list[torch.Tensor] of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of each torch.Tensor correspond to a semantic class id.

Converts the output of Sapiens2ForSemanticSegmentation into semantic segmentation maps.

Sapiens2Model

class transformers.Sapiens2Model

< >

( config: Sapiens2Config )

Parameters

  • config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare Sapiens2 Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( pixel_values: Tensor bool_masked_pos: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) BaseModelOutputWithPooling or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.Tensor of shape (batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. See Sapiens2ImageProcessor.__call__() for details (processor_class uses Sapiens2ImageProcessor for processing images).
  • bool_masked_pos (torch.BoolTensor of shape (batch_size, sequence_length), optional) — Boolean masked positions. Indicates which patches are masked (1) and which aren’t (0). Only relevant for pre-training.

Returns

BaseModelOutputWithPooling or tuple(torch.FloatTensor)

A BaseModelOutputWithPooling or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Sapiens2Config) and inputs.

The Sapiens2Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.

  • pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch

>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pretrain-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-pretrain-0.4b")

>>> inputs = image_processor(images=image, return_tensors="pt")
>>> with torch.inference_mode():
...     outputs = model(**inputs)

>>> cls_token = outputs.pooler_output
>>> cls_token.shape
torch.Size([1, 1024])

Sapiens2Backbone

class transformers.Sapiens2Backbone

< >

( config: Sapiens2Config )

Parameters

  • config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The Sapiens2 backbone.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( pixel_values: Tensor **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) Sapiens2BackboneOutput or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.Tensor of shape (batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. See Sapiens2ImageProcessor.__call__() for details (processor_class uses Sapiens2ImageProcessor for processing images).

Returns

Sapiens2BackboneOutput or tuple(torch.FloatTensor)

A Sapiens2BackboneOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Sapiens2Config) and inputs.

The Sapiens2Backbone forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • feature_maps (tuple(torch.FloatTensor) of shape (batch_size, num_channels, height, width)) — Feature maps of the stages.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size) or (batch_size, num_channels, height, width), depending on the backbone.

    Hidden-states of the model at the output of each stage plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Only applicable if the backbone uses attention.

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cls_tokens (tuple(torch.FloatTensor), optional) — CLS token from each selected feature stage, each of shape (batch_size, hidden_size). Only present when config.return_class_token=True.

Example:

>>> from transformers import AutoBackbone, AutoImageProcessor
>>> from transformers.image_utils import load_image
>>> import torch

>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pretrain-0.4b")
>>> model = AutoBackbone.from_pretrained("facebook/sapiens2-pretrain-0.4b")

>>> inputs = image_processor(images=image, return_tensors="pt")
>>> with torch.inference_mode():
...     outputs = model(**inputs, return_class_token=True)

>>> outputs.feature_maps[0].shape
torch.Size([1, 1024, 64, 48])
>>> outputs.cls_tokens[0].shape
torch.Size([1, 1024])

Sapiens2ForImageMatting

class transformers.Sapiens2ForImageMatting

< >

( config: Sapiens2Config )

Parameters

  • config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The Sapiens2 model with a matting head on top (a PixelShuffle-based decoder that predicts a pre-multiplied RGB foreground and an alpha matte).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( pixel_values: FloatTensor labels: torch.FloatTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) Sapiens2ImageMattingOutput or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. See Sapiens2ImageProcessor.__call__() for details (processor_class uses Sapiens2ImageProcessor for processing images).
  • labels (torch.FloatTensor of shape (batch_size, 4, height, width), optional) — Ground-truth matting targets for computing the loss.

Returns

Sapiens2ImageMattingOutput or tuple(torch.FloatTensor)

A Sapiens2ImageMattingOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Sapiens2Config) and inputs.

The Sapiens2ForImageMatting forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Loss.

  • alphas (torch.FloatTensor of shape (batch_size, 1, height, width)) — Estimated alpha values.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.

  • attentions (tuple[torch.FloatTensor], optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • foregrounds (torch.FloatTensor of shape (batch_size, 3, height, width)) — Pre-multiplied RGB foreground predictions in [0, 1] (sigmoid-activated).

Example:

>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch

>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-matting-1b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-matting-1b")

>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.inference_mode():
...     outputs = model(**inputs)

>>> outputs.alphas.shape
torch.Size([1, 1, 1024, 768])
>>> outputs.foregrounds.shape
torch.Size([1, 3, 1024, 768])

Sapiens2ForNormalEstimation

class transformers.Sapiens2ForNormalEstimation

< >

( config: Sapiens2Config )

Parameters

  • config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The Sapiens2 model with a normal estimation head on top (a PixelShuffle-based decoder that predicts surface normal maps).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( pixel_values: FloatTensor labels: torch.FloatTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) Sapiens2NormalEstimatorOutput or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. See Sapiens2ImageProcessor.__call__() for details (processor_class uses Sapiens2ImageProcessor for processing images).
  • labels (torch.FloatTensor of shape (batch_size, num_labels, height, width), optional) — Ground-truth surface normal maps for computing the loss.

Returns

Sapiens2NormalEstimatorOutput or tuple(torch.FloatTensor)

A Sapiens2NormalEstimatorOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Sapiens2Config) and inputs.

The Sapiens2ForNormalEstimation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Normal estimation loss.
  • normals (torch.FloatTensor of shape (batch_size, num_labels, height, width)) — Raw normal map predictions as output by the model (unnormalized).
  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one per layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax.

Example:

>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch

>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-normal-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-normal-0.4b")

>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.inference_mode():
...     outputs = model(**inputs)

>>> outputs.normals.shape
torch.Size([1, 3, 1024, 768])

Sapiens2ForPointmapEstimation

class transformers.Sapiens2ForPointmapEstimation

< >

( config: Sapiens2Config )

Parameters

  • config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The Sapiens2 model with a pointmap head on top (a PixelShuffle-based decoder that predicts per-pixel 3D XYZ coordinates, plus an optional scale branch for focal-length normalization).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( pixel_values: FloatTensor labels: torch.FloatTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) Sapiens2PointmapEstimatorOutput or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. See Sapiens2ImageProcessor.__call__() for details (processor_class uses Sapiens2ImageProcessor for processing images).
  • labels (torch.FloatTensor of shape (batch_size, 3, height, width), optional) — Ground-truth pointmap for computing the loss.

Returns

Sapiens2PointmapEstimatorOutput or tuple(torch.FloatTensor)

A Sapiens2PointmapEstimatorOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Sapiens2Config) and inputs.

The Sapiens2ForPointmapEstimation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Pointmap estimation loss.
  • pointmaps (torch.FloatTensor of shape (batch_size, 3, height, width)) — Per-pixel 3D XYZ coordinate predictions in canonical camera space.
  • scales (torch.FloatTensor of shape (batch_size, 1), optional) — Canonical focal length / actual focal length ratio. None when no scale branch is configured.
  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.
  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one per layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax.

Example:

>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch

>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pointmap-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-pointmap-0.4b")

>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.inference_mode():
...     outputs = model(**inputs)

>>> outputs.pointmaps.shape
torch.Size([1, 3, 1024, 768])

Sapiens2ForPoseEstimation

class transformers.Sapiens2ForPoseEstimation

< >

( config: Sapiens2Config )

Parameters

  • config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The Sapiens2 model with a pose estimation head on top (a set of heatmap predictors on top of the hidden states output).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( pixel_values: FloatTensor flip_pairs: torch.Tensor | None = None labels: torch.FloatTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) Sapiens2PoseEstimatorOutput or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. See Sapiens2ImageProcessor.__call__() for details (processor_class uses Sapiens2ImageProcessor for processing images).
  • flip_pairs (torch.Tensor of shape (num_pairs, 2), optional) — Pairs of keypoints which are mirrored (for example, left ear — right ear), used for test-time flip augmentation. When provided, the model assumes pixel_values contains horizontally-flipped images and calls flip_back on the output heatmaps to restore the original orientation.
  • labels (torch.FloatTensor of shape (batch_size, num_keypoints, height, width), optional) — Heatmap ground truth for computing the loss.

Returns

Sapiens2PoseEstimatorOutput or tuple(torch.FloatTensor)

A Sapiens2PoseEstimatorOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Sapiens2Config) and inputs.

The Sapiens2ForPoseEstimation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Pose estimation loss.

  • heatmaps (torch.FloatTensor of shape (batch_size, num_keypoints, height, width)) — Heatmaps as predicted by the model.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.

  • attentions (tuple[torch.FloatTensor, ...], optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch

>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pose-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-pose-0.4b")

>>> boxes = [[[270.8, 0.6, 294.1, 379.5]]]
>>> inputs = image_processor(image, boxes=boxes, return_tensors="pt")
>>> with torch.inference_mode():
...     outputs = model(**inputs)

>>> outputs.heatmaps.shape
torch.Size([1, 308, 256, 192])

Sapiens2ForSemanticSegmentation

class transformers.Sapiens2ForSemanticSegmentation

< >

( config: Sapiens2Config )

Parameters

  • config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The Sapiens2 Model with a semantic segmentation head on top e.g. for ADE20K, CityScapes.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( pixel_values: FloatTensor labels: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) SemanticSegmenterOutput or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. See Sapiens2ImageProcessor.__call__() for details (processor_class uses Sapiens2ImageProcessor for processing images).
  • labels (torch.LongTensor of shape (batch_size, height, width), optional) — Ground truth semantic segmentation maps for computing the loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels > 1, a classification loss is computed (Cross-Entropy).

Returns

SemanticSegmenterOutput or tuple(torch.FloatTensor)

A SemanticSegmenterOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Sapiens2Config) and inputs.

The Sapiens2ForSemanticSegmentation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss.

  • logits (torch.FloatTensor of shape (batch_size, config.num_labels, logits_height, logits_width)) — Classification scores for each pixel.

    The logits returned do not necessarily have the same size as the pixel_values passed as inputs. This is to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the original image size as post-processing. You should always check your logits shape and resize as needed.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, patch_size, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Example:

>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch

>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-seg-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-seg-0.4b")

>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.inference_mode():
...     outputs = model(**inputs)

>>> outputs.logits.shape
torch.Size([1, 29, 1024, 768])
Update on GitHub