Diffusers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

AceStepTransformer1DModel

A 1D Diffusion Transformer for music generation from ACE-Step 1.5. The model operates on the 25 Hz stereo latents produced by AutoencoderOobleck using flow matching, and is trained with a Qwen3-derived backbone (grouped-query attention, rotary position embedding, RMSNorm, AdaLN-Zero timestep conditioning) plus cross-attention to the text / lyric / timbre conditions built by AceStepConditionEncoder.

AceStepTransformer1DModel

class diffusers.AceStepTransformer1DModel

< source >

( hidden_size: int = 2048 intermediate_size: int = 6144 num_hidden_layers: int = 24 num_attention_heads: int = 16 num_key_value_heads: int = 8 head_dim: int = 128 in_channels: int = 192 audio_acoustic_hidden_dim: int = 64 patch_size: int = 2 rope_theta: float = 1000000.0 attention_bias: bool = False attention_dropout: float = 0.0 rms_norm_eps: float = 1e-06 sliding_window: int = 128 layer_types: typing.Optional[typing.List[str]] = None encoder_hidden_size: typing.Optional[int] = None is_turbo: bool = False model_version: typing.Optional[str] = None )

Diffusion Transformer for ACE-Step 1.5 music generation.

Generates audio latents conditioned on text, lyrics, and timbre. Uses 1D patch embedding (Conv1d with stride patch_size) followed by a stack of AceStepTransformerBlocks with alternating sliding-window / full attention on the self-attention branch. Cross-attention consumes the packed encoder_hidden_states produced by AceStepConditionEncoder.

forward

< source >

( hidden_states: Tensor timestep: Tensor timestep_r: Tensor encoder_hidden_states: Tensor context_latents: Tensor return_dict: bool = True ) → Transformer2DModelOutput or tuple

Parameters

hidden_states (torch.Tensor of shape (batch_size, seq_len, channels)) — Noisy latent input for the diffusion process.
timestep (torch.Tensor of shape (batch_size,)) — Current diffusion timestep t.
timestep_r (torch.Tensor of shape (batch_size,)) — Reference timestep r (set equal to t for standard inference).
encoder_hidden_states (torch.Tensor of shape (batch_size, encoder_seq_len, hidden_size)) — Conditioning embeddings from the condition encoder (text + lyrics + timbre).
context_latents (torch.Tensor of shape (batch_size, seq_len, context_dim)) — Context latents (source latents concatenated with chunk masks) — fed to the patchify conv alongside hidden_states.
return_dict (bool, defaults to True) — Whether to return a Transformer2DModelOutput or a plain tuple.

Returns

Transformer2DModelOutput or tuple

The predicted velocity field.

The AceStepTransformer1DModel forward method.

Update on GitHub

←SparseControlNetModel AllegroTransformer3DModel→