Transformers documentation

Sapiens2

Transformers

You are viewing v5.10.0 version. A newer version v5.10.1 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was published in HF papers on 2026-04-23 and contributed to Hugging Face Transformers on 2026-06-03.

Sapiens2

Overview

The Sapiens2 model was proposed in Sapiens2 by Rawal Khirodkar, He Wen, Julieta Martinez, Yuan Dong, Zhaoen Su, Shunsuke Saito. Sapiens2 is a family of high-resolution vision transformers pretrained on ~1 billion curated human images, designed for human-centric computer vision tasks including pose estimation, body-part segmentation, surface normal estimation, and pointmap estimation.

You can find all the original Sapiens2 checkpoints under the Sapiens2 collection.

The abstract from the paper is the following:

We present Sapiens2, a family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. We pretrain on ~1 billion curated high-quality human images with improved task annotations and combine masked image reconstruction with self-distilled contrastive objectives to learn both low-level and semantic features. Our models scale from 0.4B to 5B parameters and train at native 1K resolution, with hierarchical 4K variants for extended spatial reasoning. Sapiens2 achieves substantial improvements over its predecessor: +4 mAP in pose estimation, +24.3 mIoU in body-part segmentation, and 45.6% error reduction in normal estimation, while extending to new tasks like pointmap and albedo estimation. Code is publicly available.

Tips:

Sapiens2 uses Rotary Position Embeddings (RoPE) and supports arbitrary input resolutions. The default image processor resizes images to 1024×768 (height×width).
The model uses Grouped Query Attention (GQA) for middle layers and full multi-head attention for the first and last 8 layers.
Register tokens (8 by default) reduce high-norm artifacts in patch tokens, yielding cleaner attention maps and better performance on dense prediction tasks.

This model was contributed by guarin. The original code can be found here.

Usage examples

AutoModel

AutoBackbone

Normal estimation

Pointmap estimation

Pose estimation

Pose estimation with flip augmentation

Semantic segmentation

Matting

Transformers

Sapiens2

Overview

Usage examples

Sapiens2Config

class transformers.Sapiens2Config

Sapiens2HeadConfig

class transformers.Sapiens2HeadConfig

Sapiens2ImageProcessor

class transformers.Sapiens2ImageProcessor

preprocess

post_process_image_matting

post_process_normal_estimation

post_process_pointmap_estimation

post_process_pose_estimation

post_process_semantic_segmentation

Sapiens2Model

class transformers.Sapiens2Model

forward

Sapiens2Backbone

class transformers.Sapiens2Backbone

forward

Sapiens2ForImageMatting

class transformers.Sapiens2ForImageMatting

forward

Sapiens2ForNormalEstimation

class transformers.Sapiens2ForNormalEstimation

forward

Sapiens2ForPointmapEstimation

class transformers.Sapiens2ForPointmapEstimation

forward

Sapiens2ForPoseEstimation

class transformers.Sapiens2ForPoseEstimation

forward

Sapiens2ForSemanticSegmentation

class transformers.Sapiens2ForSemanticSegmentation

forward