Buckets:
| # AutoencoderRAE | |
| The Representation Autoencoder (RAE) model introduced in [Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690) by Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie from NYU VISIONx. | |
| RAE combines a frozen pretrained vision encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT-MAE-style decoder. In the two-stage RAE training recipe, the autoencoder is trained in stage 1 (reconstruction), and then a diffusion model is trained on the resulting latent space in stage 2 (generation). | |
| The following RAE models are released and supported in Diffusers: | |
| | Model | Encoder | Latent shape (224px input) | | |
| |:------|:--------|:---------------------------| | |
| | [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08) | DINOv2-base | 768 x 16 x 16 | | |
| | [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512) | DINOv2-base (512px) | 768 x 32 x 32 | | |
| | [`nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08) | DINOv2-small | 384 x 16 x 16 | | |
| | [`nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08) | DINOv2-large | 1024 x 16 x 16 | | |
| | [`nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08) | SigLIP2-base | 768 x 16 x 16 | | |
| | [`nyu-visionx/RAE-mae-base-p16-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-mae-base-p16-ViTXL-n08) | MAE-base | 768 x 16 x 16 | | |
| ## Loading a pretrained model | |
| ```python | |
| from diffusers import AutoencoderRAE | |
| model = AutoencoderRAE.from_pretrained( | |
| "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" | |
| ).to("cuda").eval() | |
| ``` | |
| ## Encoding and decoding a real image | |
| ```python | |
| import torch | |
| from diffusers import AutoencoderRAE | |
| from diffusers.utils import load_image | |
| from torchvision.transforms.functional import to_tensor, to_pil_image | |
| model = AutoencoderRAE.from_pretrained( | |
| "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" | |
| ).to("cuda").eval() | |
| image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") | |
| image = image.convert("RGB").resize((224, 224)) | |
| x = to_tensor(image).unsqueeze(0).to("cuda") # (1, 3, 224, 224), values in [0, 1] | |
| with torch.no_grad(): | |
| latents = model.encode(x).latent # (1, 768, 16, 16) | |
| recon = model.decode(latents).sample # (1, 3, 256, 256) | |
| recon_image = to_pil_image(recon[0].clamp(0, 1).cpu()) | |
| recon_image.save("recon.png") | |
| ``` | |
| ## Latent normalization | |
| Some pretrained checkpoints include per-channel `latents_mean` and `latents_std` statistics for normalizing the latent space. When present, `encode` and `decode` automatically apply the normalization and denormalization, respectively. | |
| ```python | |
| model = AutoencoderRAE.from_pretrained( | |
| "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" | |
| ).to("cuda").eval() | |
| # Latent normalization is handled automatically inside encode/decode | |
| # when the checkpoint config includes latents_mean/latents_std. | |
| with torch.no_grad(): | |
| latents = model.encode(x).latent # normalized latents | |
| recon = model.decode(latents).sample | |
| ``` | |
| ## AutoencoderRAE[[diffusers.AutoencoderRAE]] | |
| #### diffusers.AutoencoderRAE[[diffusers.AutoencoderRAE]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/models/autoencoders/autoencoder_rae.py#L393) | |
| Representation Autoencoder (RAE) model for encoding images to latents and decoding latents to images. | |
| This model uses a frozen pretrained encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT decoder to reconstruct | |
| images from learned representations. | |
| This model inherits from [ModelMixin](/docs/diffusers/pr_12652/en/api/models/overview#diffusers.ModelMixin). Check the superclass documentation for its generic methods implemented for | |
| all models (such as downloading or saving). | |
| wrapperdiffusers.AutoencoderRAE.encodehttps://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/utils/accelerate_utils.py#L43[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}] | |
| **Parameters:** | |
| encoder_type (`str`, *optional*, defaults to `"dinov2"`) : Type of frozen encoder to use. One of `"dinov2"`, `"siglip2"`, or `"mae"`. | |
| encoder_hidden_size (`int`, *optional*, defaults to `768`) : Hidden size of the encoder model. | |
| encoder_patch_size (`int`, *optional*, defaults to `14`) : Patch size of the encoder model. | |
| encoder_num_hidden_layers (`int`, *optional*, defaults to `12`) : Number of hidden layers in the encoder model. | |
| patch_size (`int`, *optional*, defaults to `16`) : Decoder patch size (used for unpatchify and decoder head). | |
| encoder_input_size (`int`, *optional*, defaults to `224`) : Input size expected by the encoder. | |
| image_size (`int`, *optional*) : Decoder output image size. If `None`, it is derived from encoder token count and `patch_size` like RAE-main: `image_size = patch_size * sqrt(num_patches)`, where `num_patches = (encoder_input_size // encoder_patch_size) ** 2`. | |
| num_channels (`int`, *optional*, defaults to `3`) : Number of input/output channels. | |
| encoder_norm_mean (`list`, *optional*, defaults to `[0.485, 0.456, 0.406]`) : Channel-wise mean for encoder input normalization (ImageNet defaults). | |
| encoder_norm_std (`list`, *optional*, defaults to `[0.229, 0.224, 0.225]`) : Channel-wise std for encoder input normalization (ImageNet defaults). | |
| latents_mean (`list` or `tuple`, *optional*) : Optional mean for latent normalization. Tensor inputs are accepted and converted to config-serializable lists. | |
| latents_std (`list` or `tuple`, *optional*) : Optional standard deviation for latent normalization. Tensor inputs are accepted and converted to config-serializable lists. | |
| noise_tau (`float`, *optional*, defaults to `0.0`) : Noise level for training (adds noise to latents during training). | |
| reshape_to_2d (`bool`, *optional*, defaults to `True`) : Whether to reshape latents to 2D (B, C, H, W) format. | |
| use_encoder_loss (`bool`, *optional*, defaults to `False`) : Whether to use encoder hidden states in the loss (for advanced training). | |
| #### wrapper[[diffusers.AutoencoderRAE.decode]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/utils/accelerate_utils.py#L43) | |
| ## DecoderOutput[[diffusers.models.autoencoders.vae.DecoderOutput]] | |
| #### diffusers.models.autoencoders.vae.DecoderOutput[[diffusers.models.autoencoders.vae.DecoderOutput]] | |
| [Source](https://github.com/huggingface/diffusers/blob/vr_12652/src/diffusers/models/autoencoders/vae.py#L46) | |
| Output of decoding method. | |
| **Parameters:** | |
| sample (`torch.Tensor` of shape `(batch_size, num_channels, height, width)`) : The decoded output sample from the last layer of the model. | |
Xet Storage Details
- Size:
- 6.82 kB
- Xet hash:
- 1b8cf2890b836511308cef21280670f8aef735bf6fd2138efff7b8f3cda6cea6
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.