Diffusers documentation
AceStepTransformer1DModel
AceStepTransformer1DModel
A 1D Diffusion Transformer for music generation from ACE-Step 1.5. The model operates on the 25 Hz stereo latents produced by AutoencoderOobleck using flow matching, and is trained with a Qwen3-derived backbone (grouped-query attention, rotary position embedding, RMSNorm, AdaLN-Zero timestep conditioning) plus cross-attention to the text / lyric / timbre conditions built by AceStepConditionEncoder.
AceStepTransformer1DModel
class diffusers.AceStepTransformer1DModel
< source >( hidden_size: int = 2048 intermediate_size: int = 6144 num_hidden_layers: int = 24 num_attention_heads: int = 16 num_key_value_heads: int = 8 head_dim: int = 128 in_channels: int = 192 audio_acoustic_hidden_dim: int = 64 patch_size: int = 2 rope_theta: float = 1000000.0 attention_bias: bool = False attention_dropout: float = 0.0 rms_norm_eps: float = 1e-06 sliding_window: int = 128 layer_types: typing.Optional[typing.List[str]] = None encoder_hidden_size: typing.Optional[int] = None is_turbo: bool = False model_version: typing.Optional[str] = None )
Diffusion Transformer for ACE-Step 1.5 music generation.
Generates audio latents conditioned on text, lyrics, and timbre. Uses 1D patch embedding (Conv1d with stride
patch_size) followed by a stack of AceStepTransformerBlocks with alternating sliding-window / full attention on
the self-attention branch. Cross-attention consumes the packed encoder_hidden_states produced by
AceStepConditionEncoder.
forward
< source >( hidden_states: Tensor timestep: Tensor timestep_r: Tensor encoder_hidden_states: Tensor context_latents: Tensor return_dict: bool = True ) → Transformer2DModelOutput or tuple
Parameters
- hidden_states (
torch.Tensorof shape(batch_size, seq_len, channels)) — Noisy latent input for the diffusion process. - timestep (
torch.Tensorof shape(batch_size,)) — Current diffusion timestept. - timestep_r (
torch.Tensorof shape(batch_size,)) — Reference timestepr(set equal totfor standard inference). - encoder_hidden_states (
torch.Tensorof shape(batch_size, encoder_seq_len, hidden_size)) — Conditioning embeddings from the condition encoder (text + lyrics + timbre). - context_latents (
torch.Tensorof shape(batch_size, seq_len, context_dim)) — Context latents (source latents concatenated with chunk masks) — fed to the patchify conv alongsidehidden_states. - return_dict (
bool, defaults toTrue) — Whether to return aTransformer2DModelOutputor a plain tuple.
Returns
Transformer2DModelOutput or tuple
The predicted velocity field.
The AceStepTransformer1DModel forward method.