Diffusers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

SkyReelsV2Transformer3DModel

A Diffusion Transformer model for 3D video-like data was introduced in SkyReels-V2 by the Skywork AI.

The model can be loaded with the following code snippet.

from diffusers import SkyReelsV2Transformer3DModel

transformer = SkyReelsV2Transformer3DModel.from_pretrained("Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)

SkyReelsV2Transformer3DModel

class diffusers.SkyReelsV2Transformer3DModel

< source >

( patch_size: tuple = (1, 2, 2) num_attention_heads: int = 16 attention_head_dim: int = 128 in_channels: int = 16 out_channels: int = 16 text_dim: int = 4096 freq_dim: int = 256 ffn_dim: int = 8192 num_layers: int = 32 cross_attn_norm: bool = True qk_norm: str | None = 'rms_norm_across_heads' eps: float = 1e-06 image_dim: int | None = None added_kv_proj_dim: int | None = None rope_max_seq_len: int = 1024 pos_embed_seq_len: int | None = None inject_sample_info: bool = False num_frame_per_block: int = 1 )

Parameters

patch_size (tuple[int], defaults to (1, 2, 2)) — 3D patch dimensions for video embedding (t_patch, h_patch, w_patch).
num_attention_heads (int, defaults to 16) — Fixed length for text embeddings.
attention_head_dim (int, defaults to 128) — The number of channels in each head.
in_channels (int, defaults to 16) — The number of channels in the input.
out_channels (int, defaults to 16) — The number of channels in the output.
text_dim (int, defaults to 4096) — Input dimension for text embeddings.
freq_dim (int, defaults to 256) — Dimension for sinusoidal time embeddings.
ffn_dim (int, defaults to 8192) — Intermediate dimension in feed-forward network.
num_layers (int, defaults to 32) — The number of layers of transformer blocks to use.
window_size (tuple[int], defaults to (-1, -1)) — Window size for local attention (-1 indicates global attention).
cross_attn_norm (bool, defaults to True) — Enable cross-attention normalization.
qk_norm (str, optional, defaults to "rms_norm_across_heads") — Enable query/key normalization.
eps (float, defaults to 1e-6) — Epsilon value for normalization layers.
inject_sample_info (bool, defaults to False) — Whether to inject sample information into the model.
image_dim (int, optional) — The dimension of the image embeddings.
added_kv_proj_dim (int, optional) — The dimension of the added key/value projection.
rope_max_seq_len (int, defaults to 1024) — The maximum sequence length for the rotary embeddings.
pos_embed_seq_len (int, optional) — The sequence length for the positional embeddings.

A Transformer model for video-like data used in the Wan-based SkyReels-V2 model.

forward

< source >

( hidden_states: Tensor timestep: LongTensor encoder_hidden_states: Tensor encoder_hidden_states_image: torch.Tensor | None = None enable_diffusion_forcing: bool = False fps: torch.Tensor | None = None return_dict: bool = True attention_kwargs: dict[str, typing.Any] | None = None )

Parameters

hidden_states (torch.Tensor of shape (batch_size, num_channels, num_frames, height, width)) — Input hidden_states.
timestep (torch.LongTensor) — Used to indicate denoising step.
encoder_hidden_states (torch.Tensor of shape (batch_size, sequence_len, embed_dims)) — Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_hidden_states_image (torch.Tensor, optional) — Conditional image embeddings for image-conditioned generation.
enable_diffusion_forcing (bool, optional, defaults to False) — Whether to enable diffusion forcing (per-block causal masking).
fps (torch.Tensor, optional) — FPS conditioning embedding.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~models.transformer_2d.Transformer2DModelOutput instead of a plain tuple.
attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.

The SkyReelsV2Transformer3DModel forward method.

Transformer2DModelOutput

class diffusers.models.modeling_outputs.Transformer2DModelOutput

< source >

( sample: torch.Tensor )

Parameters

sample (torch.Tensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) — The hidden states output conditioned on the encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.

The output of Transformer2DModel.

Update on GitHub

←SD3Transformer2DModel StableAudioDiTModel→