Diffusers documentation

AceStepTransformer1DModel

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

AceStepTransformer1DModel

A 1D Diffusion Transformer for music generation from ACE-Step 1.5. The model operates on the 25 Hz stereo latents produced by AutoencoderOobleck using flow matching, and is trained with a Qwen3-derived backbone (grouped-query attention, rotary position embedding, RMSNorm, AdaLN-Zero timestep conditioning) plus cross-attention to the text / lyric / timbre conditions built by AceStepConditionEncoder.

AceStepTransformer1DModel

class diffusers.AceStepTransformer1DModel

< >

( hidden_size: int = 2048 intermediate_size: int = 6144 num_hidden_layers: int = 24 num_attention_heads: int = 16 num_key_value_heads: int = 8 head_dim: int = 128 in_channels: int = 192 audio_acoustic_hidden_dim: int = 64 patch_size: int = 2 rope_theta: float = 1000000.0 attention_bias: bool = False attention_dropout: float = 0.0 rms_norm_eps: float = 1e-06 sliding_window: int = 128 layer_types: typing.Optional[typing.List[str]] = None encoder_hidden_size: typing.Optional[int] = None is_turbo: bool = False model_version: typing.Optional[str] = None )

Diffusion Transformer for ACE-Step 1.5 music generation.

Generates audio latents conditioned on text, lyrics, and timbre. Uses 1D patch embedding (Conv1d with stride patch_size) followed by a stack of AceStepTransformerBlocks with alternating sliding-window / full attention on the self-attention branch. Cross-attention consumes the packed encoder_hidden_states produced by AceStepConditionEncoder.

forward

< >

( hidden_states: Tensor timestep: Tensor timestep_r: Tensor encoder_hidden_states: Tensor context_latents: Tensor return_dict: bool = True ) Transformer2DModelOutput or tuple

Parameters

  • hidden_states (torch.Tensor of shape (batch_size, seq_len, channels)) — Noisy latent input for the diffusion process.
  • timestep (torch.Tensor of shape (batch_size,)) — Current diffusion timestep t.
  • timestep_r (torch.Tensor of shape (batch_size,)) — Reference timestep r (set equal to t for standard inference).
  • encoder_hidden_states (torch.Tensor of shape (batch_size, encoder_seq_len, hidden_size)) — Conditioning embeddings from the condition encoder (text + lyrics + timbre).
  • context_latents (torch.Tensor of shape (batch_size, seq_len, context_dim)) — Context latents (source latents concatenated with chunk masks) — fed to the patchify conv alongside hidden_states.
  • return_dict (bool, defaults to True) — Whether to return a Transformer2DModelOutput or a plain tuple.

Returns

Transformer2DModelOutput or tuple

The predicted velocity field.

The AceStepTransformer1DModel forward method.

Update on GitHub