Transformers documentation
Sapiens2
This model was published in HF papers on 2026-04-23 and contributed to Hugging Face Transformers on 2026-06-03.
Sapiens2
Overview
The Sapiens2 model was proposed in Sapiens2 by Rawal Khirodkar, He Wen, Julieta Martinez, Yuan Dong, Zhaoen Su, Shunsuke Saito. Sapiens2 is a family of high-resolution vision transformers pretrained on ~1 billion curated human images, designed for human-centric computer vision tasks including pose estimation, body-part segmentation, surface normal estimation, and pointmap estimation.
You can find all the original Sapiens2 checkpoints under the Sapiens2 collection.
The abstract from the paper is the following:
We present Sapiens2, a family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. We pretrain on ~1 billion curated high-quality human images with improved task annotations and combine masked image reconstruction with self-distilled contrastive objectives to learn both low-level and semantic features. Our models scale from 0.4B to 5B parameters and train at native 1K resolution, with hierarchical 4K variants for extended spatial reasoning. Sapiens2 achieves substantial improvements over its predecessor: +4 mAP in pose estimation, +24.3 mIoU in body-part segmentation, and 45.6% error reduction in normal estimation, while extending to new tasks like pointmap and albedo estimation. Code is publicly available.
Tips:
- Sapiens2 uses Rotary Position Embeddings (RoPE) and supports arbitrary input resolutions. The default image processor resizes images to 1024×768 (height×width).
- The model uses Grouped Query Attention (GQA) for middle layers and full multi-head attention for the first and last 8 layers.
- Register tokens (8 by default) reduce high-norm artifacts in patch tokens, yielding cleaner attention maps and better performance on dense prediction tasks.
This model was contributed by guarin. The original code can be found here.
Usage examples
The example below shows how to obtain the CLS token (whole-image embedding) with Sapiens2Model.
import torch
from transformers import AutoImageProcessor, AutoModel
from transformers.image_utils import load_image
image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pretrain-0.4b")
model = AutoModel.from_pretrained("facebook/sapiens2-pretrain-0.4b", device_map="auto")
inputs = image_processor(images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model(**inputs)
# outputs.pooler_output is the CLS token (whole-image embedding)
cls_token = outputs.pooler_output
print("CLS token shape:", cls_token.shape) # [1, 1024]Sapiens2Config
class transformers.Sapiens2Config
< source >( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None patch_size: int | list[int] | tuple[int, int] = 16 hidden_size: int = 1024 intermediate_size: int = 4096 num_hidden_layers: int = 24 num_attention_heads: int = 16 hidden_act: str = 'silu' attention_dropout: float | int = 0.0 initializer_range: float = 0.02 rope_theta: float = 100.0 image_size: int | list[int] | tuple[int, int] = 224 num_channels: int = 3 query_bias: bool = True key_bias: bool = True value_bias: bool = True proj_bias: bool = True mlp_bias: bool = True layerscale_value: float = 1.0 drop_path_rate: float | int = 0.0 use_gated_mlp: bool = True num_register_tokens: int = 8 pos_embed_shift: float | None = None pos_embed_jitter: float | None = None pos_embed_rescale: float | None = 2.0 _out_features: list[str] | None = None _out_indices: list[int] | None = None reshape_hidden_states: bool = True use_mask_token: bool = False rms_norm_eps: float = 1e-06 normalize_backbone_outputs: bool = True use_qk_norm: bool = True num_key_value_heads_per_layer: list[int] | None = None num_key_value_attention_heads: int = 8 num_first_full_attention_layers: int = 8 num_last_full_attention_layers: int = 8 semantic_loss_ignore_index: int = 255 flip_pairs: list[list[int]] | None = None head_config: transformers.models.sapiens2.configuration_sapiens2.Sapiens2HeadConfig | dict | None = None )
Parameters
- patch_size (
Union[int, list[int], tuple[int, int]], optional, defaults to16) — The size (resolution) of each patch. - hidden_size (
int, optional, defaults to1024) — Dimension of the hidden representations. - intermediate_size (
int, optional, defaults to4096) — Dimension of the MLP representations. - num_hidden_layers (
int, optional, defaults to24) — Number of hidden layers in the Transformer decoder. - num_attention_heads (
int, optional, defaults to16) — Number of attention heads for each attention layer in the Transformer decoder. - hidden_act (
str, optional, defaults tosilu) — The non-linear activation function (function or string) in the decoder. For example,"gelu","relu","silu", etc. - attention_dropout (
Union[float, int], optional, defaults to0.0) — The dropout ratio for the attention probabilities. - initializer_range (
float, optional, defaults to0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rope_theta (
float, optional, defaults to 100.0) — The base period of the RoPE embeddings. - image_size (
Union[int, list[int], tuple[int, int]], optional, defaults to224) — The size (resolution) of each image. - num_channels (
int, optional, defaults to3) — The number of input channels. - query_bias (
bool, optional, defaults toTrue) — Whether to add a bias to the query projection. - key_bias (
bool, optional, defaults toFalse) — Whether to add a bias to the key projection. - value_bias (
bool, optional, defaults toTrue) — Whether to add a bias to the value projection. - proj_bias (
bool, optional, defaults toTrue) — Whether to add a bias to the output projection. - mlp_bias (
bool, optional, defaults toTrue) — Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers. - layerscale_value (
float, optional, defaults to 1.0) — Initial value to use for layer scale. - drop_path_rate (
Union[float, int], optional, defaults to0.0) — Drop path rate for the patch fusion. - use_gated_mlp (
bool, optional, defaults toFalse) — Whether to use the SwiGLU feedforward neural network. - num_register_tokens (
int, optional, defaults to 0) — The number of register tokens. - pos_embed_shift (
float, optional) — Amount to randomly shift position embedding coordinates in [-shift, shift], applied only in training mode if notNone. - pos_embed_jitter (
float, optional) — Amount to randomly jitter position embedding coordinates in log-uniform value in [1/jitter, jitter], applied only in training mode if notNone. - pos_embed_rescale (
float, optional, defaults to 2.0) — Amount to randomly rescale position embedding coordinates in log-uniform value in [1/rescale, rescale], applied only in training mode if notNone. - reshape_hidden_states (
bool, optional, defaults toTrue) — Whether to reshape the hidden states to spatial dimensions when used as backbone. - use_mask_token (
bool, optional, defaults toFalse) — Whether to use a mask token in the embeddings (needed for masked image modeling pretraining). - rms_norm_eps (
float, optional, defaults to 1e-6) — Epsilon for the RMS normalization layers. - normalize_backbone_outputs (
bool, optional, defaults toTrue) — Whether to apply RMSNorm to the backbonefeature_mapsandcls_tokensoutputs before returning them from the forward pass. Only applies when the model is used as a backbone. - use_qk_norm (
bool, optional, defaults toTrue) — Whether to apply RMSNorm to queries and keys before RoPE in attention layers. - num_key_value_heads_per_layer (
list[int], optional) — Number of key/value heads for each transformer layer. Setting a layer’s value equal tonum_attention_headsgives full multi-head attention; a smaller value gives grouped-query attention. Defaults tonum_attention_headsfor the firstnum_first_full_attention_layersand lastnum_last_full_attention_layerslayers andnum_key_valueattention_headsfor all other layers. - num_key_value_attention_heads (
int, optional, defaults to8) — Number of key/value heads for layers that use grouped-query attention whennum_key_value_heads_per_layeris not set. Ignored whennum_key_value_heads_per_layeris set. - num_first_full_attention_layers (
int, optional, defaults to 8) — Number of leading transformer layers that use full multi-head attention. Only used whennum_key_value_heads_per_layerisNone. - num_last_full_attention_layers (
int, optional, defaults to 8) — Number of trailing transformer layers that use full multi-head attention. Only used whennum_key_value_heads_per_layerisNone. - semantic_loss_ignore_index (
int, optional, defaults to 255) — Label index ignored when computing the segmentation loss. - flip_pairs (
list[list[int]], optional) — Pairs of keypoint indices that are mirrored horizontally (e.g., left ear ↔ right ear). Each pair is a two-element list[left_index, right_index]. Used for test-time horizontal flip augmentation in pose estimation: pass these pairs to the second forward call so the model flips heatmaps back before returning them. - head_config (
Sapiens2HeadConfig, optional) — Configuration for the decode head. See Sapiens2HeadConfig for the available options.
This is the configuration class to store the configuration of a Sapiens2Model. It is used to instantiate a Sapiens2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/sapiens2-pretrain-0.4b
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Sapiens2HeadConfig
class transformers.Sapiens2HeadConfig
< source >( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None upsample_out_channels: list[int] | None = None upsample_kernel_sizes: list[int] | None = None upsample_kernel_size: int = 4 use_pixel_shuffle: bool | None = None conv_out_channels: list[int] | None = None conv_kernel_sizes: list[int] | None = None conv_kernel_size: int = 1 scale_conv_out_channels: list[int] | None = None scale_conv_kernel_sizes: list[int] | None = None scale_conv_kernel_size: int = 1 scale_final_input_size: int | None = None scale_final_hidden_sizes: list[int] | None = None )
Parameters
- upsample_out_channels (
list[int], optional) — Output channel counts for each upsample block. The first block takeshidden_sizechannels as input; subsequent blocks use the previous output. - upsample_kernel_sizes (
list[int], optional) — Kernel size for each upsample block. Auto-filled with[4, ...]whenupsample_out_channelsis set but this isNone. Must have the same length asupsample_out_channels. - upsample_kernel_size (
int, defaults to 4) — Default kernel size for upsample blocks whenupsample_kernel_sizesis not set. - use_pixel_shuffle (
bool, optional) — Whether the upsample head uses pixel-shuffle upsampling instead of transposed convolutions. WhenNone(default), the head uses transposed convolutions. - conv_out_channels (
list[int], optional) — Output channel counts for the refinement conv layers that follow the upsample blocks. - conv_kernel_sizes (
list[int], optional) — Kernel size for each refinement conv layer. Auto-filled with[1, ...]whenconv_out_channelsis set but this isNone. Must have the same length asconv_out_channels. - conv_kernel_size (
int, defaults to 1) — Default kernel size for conv layers whenconv_kernel_sizesis not set. - scale_conv_out_channels (
list[int], optional) — Output channel counts for the stride-2 conv layers used to predict the focal-length scale. WhenNone(default), no scale branch is built. - scale_conv_kernel_sizes (
list[int], optional) — Kernel size for each scale conv layer. Auto-filled with[1, ...]whenscale_conv_out_channelsis set but this isNone. Must have the same length asscale_conv_out_channels. - scale_conv_kernel_size (
int, defaults to 1) — Default kernel size for scale conv layers whenscale_conv_kernel_sizesis not set. - scale_final_input_size (
int, optional) — Flattened feature size passed into the scale MLP. WhenNone(default), it is automatically inferred fromimage_sizeandpatch_sizein the parent Sapiens2Config. - scale_final_hidden_sizes (
list[int], optional) — Hidden-layer sizes for the MLP that maps flattened scale features to the scalar scale output. WhenNone(default), no scale branch is built.
This is the configuration class to store the configuration of a Sapiens2Model. It is used to instantiate a Sapiens2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/sapiens2-seg-0.4b
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Sapiens2ImageProcessor
class transformers.Sapiens2ImageProcessor
< source >( **kwargs: typing_extensions.Unpack[transformers.models.sapiens2.image_processing_sapiens2.Sapiens2ImageProcessorKwargs] )
Parameters
- do_reduce_labels (
bool, kwargs, optional, defaults toself.do_reduce_labels) — Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by 255. - **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
Constructs a Sapiens2ImageProcessor image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None boxes: list[list[list[float]]] | None = None **kwargs: typing_extensions.Unpack[transformers.models.sapiens2.image_processing_sapiens2.Sapiens2ImageProcessorKwargs] ) → ~image_processing_base.BatchFeature
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False. - segmentation_maps (
ImageInput, optional) — The segmentation maps to preprocess. - boxes (
list[list[list[float]]]ornp.ndarray, optional) — List or array of bounding boxes for each image. Each box should be a list of 4 floats representing the bounding box coordinates in COCO format (top_left_x, top_left_y, width, height). When provided, each person crop is affine-warped to the model input size instead of resizing the full image. - do_reduce_labels (
bool, kwargs, optional, defaults toself.do_reduce_labels) — Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by 255. - return_tensors (
stror TensorType, optional) — Returns stacked tensors if set to'pt', otherwise returns a list of tensors. - **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
Returns
~image_processing_base.BatchFeature
- data (
dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.
post_process_image_matting
< source >( outputs: Sapiens2ImageMattingOutput target_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None backgrounds: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None ) → list[dict] of length batch_size. Each dict has
Parameters
- outputs (
Sapiens2ImageMattingOutput) — Raw outputs of the model. - target_sizes (
torch.Tensororlist[tuple[int, int]]of lengthbatch_size, optional) — Requested final(height, width)for each prediction. Resized with bilinear interpolation. If unset, predictions are returned at the model output resolution. - backgrounds (
ImageInput, optional) — Background image(s) to composite over. Can be a single image (applied to every item in the batch) or a list of images, one per batch item. Accepts PIL images, numpy arrays, or torch tensors of any dtype; integer types (e.g. uint8) are scaled to[0, 1]automatically. When provided, each result dict gains a"composite"key with the composited image as a uint8 tensor in[0, 255].
Returns
list[dict] of length batch_size. Each dict has
"alpha"(torch.Tensorof shape(1, height, width)): alpha values in[0, 1]."foreground"(torch.Tensorof shape(3, height, width)): pre-multiplied RGB in[0, 1]."composite"(torch.Tensorof shape(3, height, width)orNone): foreground composited overbackgroundsas a uint8 tensor in[0, 255];Nonewhenbackgroundsis not provided.
Converts the output of Sapiens2ForImageMatting into alpha mattes and foreground maps.
post_process_normal_estimation
< source >( outputs: Sapiens2NormalEstimatorOutput source_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None target_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None do_remove_padding: bool | None = None )
Parameters
- outputs (
Sapiens2NormalEstimatorOutput) — Raw outputs of the model. - source_sizes (
torch.Tensororlist[tuple[int, int]]of lengthbatch_size, optional) — Original(height, width)of each image before preprocessing. When provided, the padding added during preprocessing is removed and predictions are resized back to the original image size (unlesstarget_sizesoverrides the final size). - target_sizes (
torch.Tensororlist[tuple[int, int]]of lengthbatch_size, optional) — Requested final(height, width)for each prediction. When provided, used as the resize target instead ofsource_sizes. Resized with bilinear interpolation after L2 normalization. - do_remove_padding (
bool, optional) — Whether to crop away the zero-padding added during preprocessing before resizing. Defaults toTruewhensource_sizesis provided,Falseotherwise.
Converts the output of Sapiens2ForNormalEstimation into L2-normalized surface normal maps.
post_process_pointmap_estimation
< source >( outputs: Sapiens2PointmapEstimatorOutput source_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None target_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None do_remove_padding: bool | None = None )
Parameters
- outputs (
Sapiens2PointmapEstimatorOutput) — Raw outputs of the model. - source_sizes (
torch.Tensororlist[tuple[int, int]]of lengthbatch_size, optional) — Original(height, width)of each image before preprocessing. When provided, the padding added during preprocessing is removed and predictions are resized back to the original image size (unlesstarget_sizesoverrides the final size). - target_sizes (
torch.Tensororlist[tuple[int, int]]of lengthbatch_size, optional) — Requested final(height, width)for each prediction. Overridessource_sizesas the resize target. - do_remove_padding (
bool, optional) — Whether to crop away the zero-padding added during preprocessing before resizing. Defaults toTruewhensource_sizesis provided,Falseotherwise.
Converts the output of Sapiens2ForPointmapEstimation into pointmap tensors in image space.
post_process_pose_estimation
< source >( outputs: Sapiens2PoseEstimatorOutput boxes: list outputs_flipped: transformers.models.sapiens2.modeling_sapiens2.Sapiens2PoseEstimatorOutput | None = None kernel_size: int = 11 threshold: float | None = None source_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None target_sizes: transformers.utils.generic.TensorType | list[tuple[int, int]] | None = None ) → list[list[dict]]
Parameters
- outputs (
Sapiens2PoseEstimatorOutput) — Raw outputs of the model.outputs.heatmapsmust have shape(N_total, num_keypoints, heatmap_height, heatmap_width)whereN_total = sum(len(b) for b in boxes). - boxes (
list[list[list[float]]]ornp.ndarray) — List or array of bounding boxes for each image in absolute pixel coordinates. Each box should be a list of 4 floats representing the bounding box coordinates in COCO format (top_left_x, top_left_y, width, height). Must match theboxesargument passed topreprocess. - outputs_flipped (
Sapiens2PoseEstimatorOutput, optional) — Outputs from running the model on horizontally flipped inputs. When provided, heatmaps are averaged withoutputsbefore keypoint extraction to improve accuracy:avg_heatmaps = (outputs.heatmaps + outputs_flipped.heatmaps) / 2. - kernel_size (
int, optional, defaults to 11) — Kernel size for the Gaussian blur used in UDP Dark Pose refinement. - threshold (
float, optional) — Score threshold. Keypoints with scores at or below this value are filtered out from the result dictionaries. - source_sizes (
torch.Tensororlist[tuple[int, int]]of lengthbatch_size, optional) — Original(height, width)of each image in pixels. Required whentarget_sizesis provided, as the source coordinate space for scaling keypoints and bounding boxes. - target_sizes (
torch.Tensororlist[tuple[int, int]]of lengthbatch_size, optional) — Desired output(height, width)coordinate space for each image. When provided alongsidesource_sizes, keypoint coordinates and bounding boxes are scaled from source to target space.
Returns
list[list[dict]]
Outer list is over images, inner list is over persons. Each dict contains:
keypoints(torch.FloatTensorof shape(num_keypoints, 2)): absolut x/y coordinates in the source image space, or in target space iftarget_sizesis provided.scores(torch.FloatTensorof shape(num_keypoints,)): per-keypoint confidence.labels(torch.LongTensorof shape(num_keypoints,)): keypoint indices.bbox(torch.FloatTensorof shape(4,)): bounding box in absolute (x_min, y_min, x_max, y_max) format, in the same coordinate space askeypoints.
Converts the output of Sapiens2ForPoseEstimation into keypoint predictions in image space.
post_process_semantic_segmentation
< source >( outputs target_sizes: list[tuple] | None = None ) → semantic_segmentation
Parameters
- outputs (Sapiens2ForSemanticSegmentation) — Raw outputs of the model.
- target_sizes (
list[Tuple]of lengthbatch_size, optional) — List of tuples corresponding to the requested final size (height, width) of each prediction. If unset, predictions will not be resized.
Returns
semantic_segmentation
list[torch.Tensor] of length batch_size, where each item is a semantic
segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is
specified). Each entry of each torch.Tensor correspond to a semantic class id.
Converts the output of Sapiens2ForSemanticSegmentation into semantic segmentation maps.
Sapiens2Model
class transformers.Sapiens2Model
< source >( config: Sapiens2Config )
Parameters
- config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Sapiens2 Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Tensor bool_masked_pos: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → BaseModelOutputWithPooling or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. SeeSapiens2ImageProcessor.__call__()for details (processor_classuses Sapiens2ImageProcessor for processing images). - bool_masked_pos (
torch.BoolTensorof shape(batch_size, sequence_length), optional) — Boolean masked positions. Indicates which patches are masked (1) and which aren’t (0). Only relevant for pre-training.
Returns
BaseModelOutputWithPooling or tuple(torch.FloatTensor)
A BaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Sapiens2Config) and inputs.
The Sapiens2Model forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.pooler_output (
torch.FloatTensorof shape(batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch
>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pretrain-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-pretrain-0.4b")
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> with torch.inference_mode():
... outputs = model(**inputs)
>>> cls_token = outputs.pooler_output
>>> cls_token.shape
torch.Size([1, 1024])Sapiens2Backbone
class transformers.Sapiens2Backbone
< source >( config: Sapiens2Config )
Parameters
- config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Sapiens2 backbone.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Tensor **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → Sapiens2BackboneOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. SeeSapiens2ImageProcessor.__call__()for details (processor_classuses Sapiens2ImageProcessor for processing images).
Returns
Sapiens2BackboneOutput or tuple(torch.FloatTensor)
A Sapiens2BackboneOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Sapiens2Config) and inputs.
The Sapiens2Backbone forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
feature_maps (
tuple(torch.FloatTensor)of shape(batch_size, num_channels, height, width)) — Feature maps of the stages.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)or(batch_size, num_channels, height, width), depending on the backbone.Hidden-states of the model at the output of each stage plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length). Only applicable if the backbone uses attention.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cls_tokens (
tuple(torch.FloatTensor), optional) — CLS token from each selected feature stage, each of shape(batch_size, hidden_size). Only present whenconfig.return_class_token=True.
Example:
>>> from transformers import AutoBackbone, AutoImageProcessor
>>> from transformers.image_utils import load_image
>>> import torch
>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pretrain-0.4b")
>>> model = AutoBackbone.from_pretrained("facebook/sapiens2-pretrain-0.4b")
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> with torch.inference_mode():
... outputs = model(**inputs, return_class_token=True)
>>> outputs.feature_maps[0].shape
torch.Size([1, 1024, 64, 48])
>>> outputs.cls_tokens[0].shape
torch.Size([1, 1024])Sapiens2ForImageMatting
class transformers.Sapiens2ForImageMatting
< source >( config: Sapiens2Config )
Parameters
- config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Sapiens2 model with a matting head on top (a PixelShuffle-based decoder that predicts a pre-multiplied RGB foreground and an alpha matte).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: FloatTensor labels: torch.FloatTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → Sapiens2ImageMattingOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. SeeSapiens2ImageProcessor.__call__()for details (processor_classuses Sapiens2ImageProcessor for processing images). - labels (
torch.FloatTensorof shape(batch_size, 4, height, width), optional) — Ground-truth matting targets for computing the loss.
Returns
Sapiens2ImageMattingOutput or tuple(torch.FloatTensor)
A Sapiens2ImageMattingOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Sapiens2Config) and inputs.
The Sapiens2ForImageMatting forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Loss.alphas (
torch.FloatTensorof shape(batch_size, 1, height, width)) — Estimated alpha values.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.attentions (
tuple[torch.FloatTensor], optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
foregrounds (
torch.FloatTensorof shape(batch_size, 3, height, width)) — Pre-multiplied RGB foreground predictions in[0, 1](sigmoid-activated).
Example:
>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch
>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-matting-1b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-matting-1b")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.inference_mode():
... outputs = model(**inputs)
>>> outputs.alphas.shape
torch.Size([1, 1, 1024, 768])
>>> outputs.foregrounds.shape
torch.Size([1, 3, 1024, 768])Sapiens2ForNormalEstimation
class transformers.Sapiens2ForNormalEstimation
< source >( config: Sapiens2Config )
Parameters
- config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Sapiens2 model with a normal estimation head on top (a PixelShuffle-based decoder that predicts surface normal maps).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: FloatTensor labels: torch.FloatTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → Sapiens2NormalEstimatorOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. SeeSapiens2ImageProcessor.__call__()for details (processor_classuses Sapiens2ImageProcessor for processing images). - labels (
torch.FloatTensorof shape(batch_size, num_labels, height, width), optional) — Ground-truth surface normal maps for computing the loss.
Returns
Sapiens2NormalEstimatorOutput or tuple(torch.FloatTensor)
A Sapiens2NormalEstimatorOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Sapiens2Config) and inputs.
The Sapiens2ForNormalEstimation forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
- loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Normal estimation loss. - normals (
torch.FloatTensorof shape(batch_size, num_labels, height, width)) — Raw normal map predictions as output by the model (unnormalized). - hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs. - attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one per layer) of shape(batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax.
Example:
>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch
>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-normal-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-normal-0.4b")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.inference_mode():
... outputs = model(**inputs)
>>> outputs.normals.shape
torch.Size([1, 3, 1024, 768])Sapiens2ForPointmapEstimation
class transformers.Sapiens2ForPointmapEstimation
< source >( config: Sapiens2Config )
Parameters
- config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Sapiens2 model with a pointmap head on top (a PixelShuffle-based decoder that predicts per-pixel 3D XYZ coordinates, plus an optional scale branch for focal-length normalization).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: FloatTensor labels: torch.FloatTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → Sapiens2PointmapEstimatorOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. SeeSapiens2ImageProcessor.__call__()for details (processor_classuses Sapiens2ImageProcessor for processing images). - labels (
torch.FloatTensorof shape(batch_size, 3, height, width), optional) — Ground-truth pointmap for computing the loss.
Returns
Sapiens2PointmapEstimatorOutput or tuple(torch.FloatTensor)
A Sapiens2PointmapEstimatorOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Sapiens2Config) and inputs.
The Sapiens2ForPointmapEstimation forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
- loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Pointmap estimation loss. - pointmaps (
torch.FloatTensorof shape(batch_size, 3, height, width)) — Per-pixel 3D XYZ coordinate predictions in canonical camera space. - scales (
torch.FloatTensorof shape(batch_size, 1), optional) — Canonical focal length / actual focal length ratio.Nonewhen no scale branch is configured. - hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs. - attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one per layer) of shape(batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax.
Example:
>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch
>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pointmap-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-pointmap-0.4b")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.inference_mode():
... outputs = model(**inputs)
>>> outputs.pointmaps.shape
torch.Size([1, 3, 1024, 768])Sapiens2ForPoseEstimation
class transformers.Sapiens2ForPoseEstimation
< source >( config: Sapiens2Config )
Parameters
- config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Sapiens2 model with a pose estimation head on top (a set of heatmap predictors on top of the hidden states output).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: FloatTensor flip_pairs: torch.Tensor | None = None labels: torch.FloatTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → Sapiens2PoseEstimatorOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. SeeSapiens2ImageProcessor.__call__()for details (processor_classuses Sapiens2ImageProcessor for processing images). - flip_pairs (
torch.Tensorof shape(num_pairs, 2), optional) — Pairs of keypoints which are mirrored (for example, left ear — right ear), used for test-time flip augmentation. When provided, the model assumespixel_valuescontains horizontally-flipped images and callsflip_backon the output heatmaps to restore the original orientation. - labels (
torch.FloatTensorof shape(batch_size, num_keypoints, height, width), optional) — Heatmap ground truth for computing the loss.
Returns
Sapiens2PoseEstimatorOutput or tuple(torch.FloatTensor)
A Sapiens2PoseEstimatorOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Sapiens2Config) and inputs.
The Sapiens2ForPoseEstimation forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Pose estimation loss.heatmaps (
torch.FloatTensorof shape(batch_size, num_keypoints, height, width)) — Heatmaps as predicted by the model.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the model at the output of each stage.attentions (
tuple[torch.FloatTensor, ...], optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch
>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-pose-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-pose-0.4b")
>>> boxes = [[[270.8, 0.6, 294.1, 379.5]]]
>>> inputs = image_processor(image, boxes=boxes, return_tensors="pt")
>>> with torch.inference_mode():
... outputs = model(**inputs)
>>> outputs.heatmaps.shape
torch.Size([1, 308, 256, 192])Sapiens2ForSemanticSegmentation
class transformers.Sapiens2ForSemanticSegmentation
< source >( config: Sapiens2Config )
Parameters
- config (Sapiens2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Sapiens2 Model with a semantic segmentation head on top e.g. for ADE20K, CityScapes.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: FloatTensor labels: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → SemanticSegmenterOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using Sapiens2ImageProcessor. SeeSapiens2ImageProcessor.__call__()for details (processor_classuses Sapiens2ImageProcessor for processing images). - labels (
torch.LongTensorof shape(batch_size, height, width), optional) — Ground truth semantic segmentation maps for computing the loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels > 1, a classification loss is computed (Cross-Entropy).
Returns
SemanticSegmenterOutput or tuple(torch.FloatTensor)
A SemanticSegmenterOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (Sapiens2Config) and inputs.
The Sapiens2ForSemanticSegmentation forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Classification (or regression if config.num_labels==1) loss.logits (
torch.FloatTensorof shape(batch_size, config.num_labels, logits_height, logits_width)) — Classification scores for each pixel.The logits returned do not necessarily have the same size as the
pixel_valuespassed as inputs. This is to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the original image size as post-processing. You should always check your logits shape and resize as needed.hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, patch_size, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, patch_size, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Example:
>>> from transformers import AutoImageProcessor, AutoModel
>>> from transformers.image_utils import load_image
>>> import torch
>>> image = load_image("http://images.cocodataset.org/val2017/000000004016.jpg")
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/sapiens2-seg-0.4b")
>>> model = AutoModel.from_pretrained("facebook/sapiens2-seg-0.4b")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.inference_mode():
... outputs = model(**inputs)
>>> outputs.logits.shape
torch.Size([1, 29, 1024, 768])