Buckets:

hf-doc-build/doc-dev / diffusers /pr_12595 /en /api /models /cogview3plus_transformer2d.md
rtrm's picture
|
download
raw
14.6 kB

CogView3PlusTransformer2DModel

A Diffusion Transformer model for 2D data from CogView3Plus was introduced in CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion by Tsinghua University & ZhipuAI.

The model can be loaded with the following code snippet.

from diffusers import CogView3PlusTransformer2DModel

transformer = CogView3PlusTransformer2DModel.from_pretrained("THUDM/CogView3Plus-3b", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")

CogView3PlusTransformer2DModel[[diffusers.CogView3PlusTransformer2DModel]]

class diffusers.CogView3PlusTransformer2DModeldiffusers.CogView3PlusTransformer2DModelhttps://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/transformers/transformer_cogview3plus.py#L128[{"name": "patch_size", "val": ": int = 2"}, {"name": "in_channels", "val": ": int = 16"}, {"name": "num_layers", "val": ": int = 30"}, {"name": "attention_head_dim", "val": ": int = 40"}, {"name": "num_attention_heads", "val": ": int = 64"}, {"name": "out_channels", "val": ": int = 16"}, {"name": "text_embed_dim", "val": ": int = 4096"}, {"name": "time_embed_dim", "val": ": int = 512"}, {"name": "condition_dim", "val": ": int = 256"}, {"name": "pos_embed_max_size", "val": ": int = 128"}, {"name": "sample_size", "val": ": int = 128"}]- patch_size (int, defaults to 2) -- The size of the patches to use in the patch embedding layer.

  • in_channels (int, defaults to 16) -- The number of channels in the input.
  • num_layers (int, defaults to 30) -- The number of layers of Transformer blocks to use.
  • attention_head_dim (int, defaults to 40) -- The number of channels in each head.
  • num_attention_heads (int, defaults to 64) -- The number of heads to use for multi-head attention.
  • out_channels (int, defaults to 16) -- The number of channels in the output.
  • text_embed_dim (int, defaults to 4096) -- Input dimension of text embeddings from the text encoder.
  • time_embed_dim (int, defaults to 512) -- Output dimension of timestep embeddings.
  • condition_dim (int, defaults to 256) -- The embedding dimension of the input SDXL-style resolution conditions (original_size, target_size, crop_coords).
  • pos_embed_max_size (int, defaults to 128) -- The maximum resolution of the positional embeddings, from which slices of shape H x W are taken and added to input patched latents, where H and W are the latent height and width respectively. A value of 128 means that the maximum supported height and width for image generation is 128 * vae_scale_factor * patch_size => 128 * 8 * 2 => 2048.
  • sample_size (int, defaults to 128) -- The base resolution of input latents. If height/width is not provided during generation, this value is used to determine the resolution as sample_size * vae_scale_factor => 128 * 8 => 10240

The Transformer model introduced in CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion.

forwarddiffusers.CogView3PlusTransformer2DModel.forwardhttps://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/transformers/transformer_cogview3plus.py#L287[{"name": "hidden_states", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": LongTensor"}, {"name": "original_size", "val": ": Tensor"}, {"name": "target_size", "val": ": Tensor"}, {"name": "crop_coords", "val": ": Tensor"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor) -- Input hidden_states of shape (batch size, channel, height, width).

  • encoder_hidden_states (torch.Tensor) -- Conditional embeddings (embeddings computed from the input conditions such as prompts) of shape (batch_size, sequence_len, text_embed_dim)
  • timestep (torch.LongTensor) -- Used to indicate denoising step.
  • original_size (torch.Tensor) -- CogView3 uses SDXL-like micro-conditioning for original image size as explained in section 2.2 of https://huggingface.co/papers/2307.01952.
  • target_size (torch.Tensor) -- CogView3 uses SDXL-like micro-conditioning for target image size as explained in section 2.2 of https://huggingface.co/papers/2307.01952.
  • crop_coords (torch.Tensor) -- CogView3 uses SDXL-like micro-conditioning for crop coordinates as explained in section 2.2 of https://huggingface.co/papers/2307.01952.
  • return_dict (bool, optional, defaults to True) -- Whether or not to return a ~models.transformer_2d.Transformer2DModelOutput instead of a plain tuple.0torch.Tensor or ~models.transformer_2d.Transformer2DModelOutputThe denoised latents using provided inputs as conditioning.

The CogView3PlusTransformer2DModel forward method.

set_attn_processordiffusers.CogView3PlusTransformer2DModel.set_attn_processorhttps://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/transformers/transformer_cogview3plus.py#L253[{"name": "processor", "val": ": typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.JointAttnProcessor2_0, diffusers.models.attention_processor.PAGJointAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGJointAttnProcessor2_0, diffusers.models.attention_processor.FusedJointAttnProcessor2_0, diffusers.models.attention_processor.AllegroAttnProcessor2_0, diffusers.models.attention_processor.AuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FusedAuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.CogVideoXAttnProcessor2_0, diffusers.models.attention_processor.FusedCogVideoXAttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.XLAFlashAttnProcessor2_0, diffusers.models.attention_processor.AttnProcessorNPU, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.MochiVaeAttnProcessor2_0, diffusers.models.attention_processor.MochiAttnProcessor2_0, diffusers.models.attention_processor.StableAudioAttnProcessor2_0, diffusers.models.attention_processor.HunyuanAttnProcessor2_0, diffusers.models.attention_processor.FusedHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.LuminaAttnProcessor2_0, diffusers.models.attention_processor.FusedAttnProcessor2_0, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor2_0, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.SanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGSanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySanaLinearAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleLinearAttention, diffusers.models.attention_processor.SanaMultiscaleAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleAttentionProjection, diffusers.models.attention_processor.IPAdapterAttnProcessor, diffusers.models.attention_processor.IPAdapterAttnProcessor2_0, diffusers.models.attention_processor.IPAdapterXFormersAttnProcessor, diffusers.models.attention_processor.SD3IPAdapterJointAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.JointAttnProcessor2_0, diffusers.models.attention_processor.PAGJointAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGJointAttnProcessor2_0, diffusers.models.attention_processor.FusedJointAttnProcessor2_0, diffusers.models.attention_processor.AllegroAttnProcessor2_0, diffusers.models.attention_processor.AuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FusedAuraFlowAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0, diffusers.models.attention_processor.FluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0, diffusers.models.attention_processor.FusedFluxAttnProcessor2_0_NPU, diffusers.models.attention_processor.CogVideoXAttnProcessor2_0, diffusers.models.attention_processor.FusedCogVideoXAttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.XLAFlashAttnProcessor2_0, diffusers.models.attention_processor.AttnProcessorNPU, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.MochiVaeAttnProcessor2_0, diffusers.models.attention_processor.MochiAttnProcessor2_0, diffusers.models.attention_processor.StableAudioAttnProcessor2_0, diffusers.models.attention_processor.HunyuanAttnProcessor2_0, diffusers.models.attention_processor.FusedHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGHunyuanAttnProcessor2_0, diffusers.models.attention_processor.LuminaAttnProcessor2_0, diffusers.models.attention_processor.FusedAttnProcessor2_0, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor2_0, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.SanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGSanaLinearAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySanaLinearAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleLinearAttention, diffusers.models.attention_processor.SanaMultiscaleAttnProcessor2_0, diffusers.models.attention_processor.SanaMultiscaleAttentionProjection, diffusers.models.attention_processor.IPAdapterAttnProcessor, diffusers.models.attention_processor.IPAdapterAttnProcessor2_0, diffusers.models.attention_processor.IPAdapterXFormersAttnProcessor, diffusers.models.attention_processor.SD3IPAdapterJointAttnProcessor2_0, diffusers.models.attention_processor.PAGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.PAGCFGIdentitySelfAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor]]]"}]- processor (dict of AttentionProcessor or only AttentionProcessor) -- The instantiated processor class or a dictionary of processor classes that will be set as the processor for all Attention layers.

If processor is a dict, the key needs to define the path to the corresponding cross attention processor. This is strongly recommended when setting trainable attention processors.0

Sets the attention processor to use to compute attention.

Transformer2DModelOutput[[diffusers.models.modeling_outputs.Transformer2DModelOutput]]

class diffusers.models.modeling_outputs.Transformer2DModelOutputdiffusers.models.modeling_outputs.Transformer2DModelOutputhttps://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/modeling_outputs.py#L21[{"name": "sample", "val": ": torch.Tensor"}]- sample (torch.Tensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) -- The hidden states output conditioned on the encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.0

The output of Transformer2DModel.

Xet Storage Details

Size:
14.6 kB
·
Xet hash:
77b8252af55995e93ceb17328c391787154915413f425d6be4135107adf7a9f3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.