Diffusers documentation
ACE-Step 1.5
ACE-Step 1.5
ACE-Step 1.5 was introduced in ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation by the ACE-Step Team (ACE Studio and StepFun). It is an open-source music foundation model that generates commercial-grade stereo music with lyrics from text prompts.
ACE-Step 1.5 generates variable-length stereo audio at 48 kHz (10 seconds to 10 minutes) from text prompts and optional lyrics. The full system pairs a Language Model planner with a Diffusion Transformer (DiT) synthesizer; this pipeline wraps the DiT half of that stack, and consists of three components: an AutoencoderOobleck VAE that compresses waveforms into 25 Hz stereo latents, a Qwen3-based text encoder for prompt and lyric conditioning, and an AceStepTransformer1DModel DiT that operates in the VAE latent space using flow matching.
The model supports 50+ languages for lyrics — including English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, and Russian — and runs on consumer GPUs (under 4 GB of VRAM when offloaded).
This pipeline was contributed by the ACE-Step Team. The original codebase can be found at ace-step/ACE-Step-1.5.
Variants
ACE-Step 1.5 ships three DiT checkpoints that share the same transformer architecture but differ in guidance behavior; the pipeline auto-detects turbo checkpoints from the loaded transformer config and ignores CFG guidance for those guidance-distilled weights.
| Variant | CFG | Default steps | Default guidance_scale | Default shift | HF repo |
|---|---|---|---|---|---|
turbo (guidance-distilled) | off | 8 | ignored | 3.0 | ACE-Step/Ace-Step1.5 |
base | on | 8 | 7.0 | 3.0 | ACE-Step/acestep-v15-base |
sft | on | 8 | 7.0 | 3.0 | ACE-Step/acestep-v15-sft |
Base and SFT use the learned null_condition_emb for classifier-free guidance (APG, not vanilla CFG). Users commonly override num_inference_steps to 30–60 on base/sft for higher quality.
Tips
When constructing a prompt, keep in mind:
- Descriptive prompt inputs work best; use adjectives to describe the music style, instruments, mood, and tempo.
- The prompt should describe the overall musical characteristics (e.g., “upbeat pop song with electric guitar and drums”).
- Lyrics should be structured with tags like
[verse],[chorus],[bridge], etc.
During inference:
num_inference_steps,guidance_scale, andshiftdefault to the values shown above. For turbo checkpoints,guidance_scale > 1.0is ignored with a warning because guidance is distilled into the weights.- The
audio_durationparameter controls the length of the generated music in seconds. - The
vocal_languageparameter should match the language of the lyrics. pipe.sample_rateandpipe.latents_per_secondare sourced from the VAE config (48000 Hz and 25 fps for the released checkpoints).- For audio-to-audio tasks, pass
src_audioandreference_audioas preprocessed stereo tensors atpipe.sample_rate. flashandflash_hubuse FlashAttention’s native sliding-window support for ACE-Step’s self-attention and expect unpadded text batches. If a batched prompt contains padding, useflash_varlenorflash_varlen_hubinstead. Single-prompt inference withpadding="longest"is normally unpadded.
import torch
import soundfile as sf
from diffusers import AceStepPipeline
pipe = AceStepPipeline.from_pretrained("ACE-Step/Ace-Step1.5", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
audio = pipe(
prompt="A beautiful piano piece with soft melodies and gentle rhythm",
lyrics="[verse]\nSoft notes in the morning light\nDancing through the air so bright\n[chorus]\nMusic fills the air tonight\nEvery note feels just right",
audio_duration=30.0,
).audios
sf.write("output.wav", audio[0].T.cpu().float().numpy(), pipe.sample_rate)AceStepPipeline
class diffusers.AceStepPipeline
< source >( vae: AutoencoderOobleck text_encoder: PreTrainedModel tokenizer: PreTrainedTokenizerFast transformer: AceStepTransformer1DModel condition_encoder: AceStepConditionEncoder scheduler: FlowMatchEulerDiscreteScheduler audio_tokenizer: typing.Optional[diffusers.pipelines.ace_step.modeling_ace_step.AceStepAudioTokenizer] = None audio_token_detokenizer: typing.Optional[diffusers.pipelines.ace_step.modeling_ace_step.AceStepAudioTokenDetokenizer] = None )
Parameters
- vae (AutoencoderOobleck) — Variational Auto-Encoder (VAE) model to encode and decode audio waveforms to and from latent representations.
- text_encoder (AutoModel) — Text encoder model (e.g., Qwen3-Embedding-0.6B) for encoding text prompts and lyrics.
- tokenizer (AutoTokenizer) — Tokenizer for the text encoder.
- transformer (AceStepTransformer1DModel) — The Diffusion Transformer (DiT) model for denoising audio latents.
- condition_encoder (
AceStepConditionEncoder) — Condition encoder that combines text, lyric, and timbre embeddings for cross-attention. - scheduler (FlowMatchEulerDiscreteScheduler) —
Flow-matching Euler scheduler. ACE-Step feeds the DiT timesteps in
[0, 1], so the scheduler is configured withnum_train_timesteps=1andshift=1.0— the pipeline computes its shifted / turbo sigma schedule itself and passes it viaset_timesteps(sigmas=...).
Pipeline for text-to-music generation using ACE-Step 1.5.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
The pipeline uses flow matching with a custom timestep schedule for the diffusion process. The turbo model variant uses 8 inference steps by default.
Supported task types:
"text2music": Generate music from text prompts and lyrics."cover": Generate audio from source audio / semantic codes with timbre transfer from reference audio."repaint": Regenerate a section of existing audio while keeping the rest."extract": Extract a specific track (e.g., vocals, drums) from audio."lego": Generate a specific track based on audio context."complete": Complete an input audio with additional tracks.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None lyrics: typing.Union[str, typing.List[str]] = '' audio_duration: float = 60.0 vocal_language: typing.Union[str, typing.List[str]] = 'en' num_inference_steps: int = 8 guidance_scale: float = 7.0 shift: float = 3.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pt' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: typing.Optional[int] = 1 callback_on_step_end: typing.Optional[typing.Callable[..., dict]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ('latents',) instruction: typing.Optional[str] = None max_text_length: int = 256 max_lyric_length: int = 2048 bpm: typing.Optional[int] = None keyscale: typing.Optional[str] = None timesignature: typing.Optional[str] = None task_type: str = 'text2music' track_name: typing.Optional[str] = None complete_track_classes: typing.Optional[typing.List[str]] = None src_audio: typing.Optional[torch.Tensor] = None reference_audio: typing.Optional[torch.Tensor] = None audio_codes: typing.Union[str, typing.List[str], NoneType] = None repainting_start: typing.Optional[float] = None repainting_end: typing.Optional[float] = None audio_cover_strength: float = 1.0 cfg_interval_start: float = 0.0 cfg_interval_end: float = 1.0 timesteps: typing.Optional[typing.List[float]] = None ) → AudioPipelineOutput or tuple
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to guide music generation. Describes the style, genre, instruments, etc. - lyrics (
strorList[str], optional, defaults to"") — The lyrics text for the music. Supports structured lyrics with tags like[verse],[chorus], etc. - audio_duration (
float, optional, defaults to 60.0) — Duration of the generated audio in seconds. - vocal_language (
strorList[str], optional, defaults to"en") — Language code for the lyrics (e.g.,"en","zh","ja"). - num_inference_steps (
int, optional, defaults to 8) — The number of denoising steps. The turbo model is designed for 8 steps. - guidance_scale (
float, optional, defaults to 7.0) — Guidance scale for classifier-free guidance. A value of 1.0 disables CFG. - shift (
float, optional, defaults to 3.0) — Shift parameter for the timestep schedule (1.0, 2.0, or 3.0). - generator (
torch.GeneratororList[torch.Generator], optional) — A generator to make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noise latents of shape(batch_size, latent_length, acoustic_dim). - output_type (
str, optional, defaults to"pt") — Output format."pt"for PyTorch tensor,"np"for NumPy array,"latent"for raw latents. - return_dict (
bool, optional, defaults toTrue) — Whether to return anAudioPipelineOutputor a plain tuple. - callback (
Callable, optional) — A function called everycallback_stepssteps with(step, timestep, latents). - callback_steps (
int, optional, defaults to 1) — Frequency of the callback function. - instruction (
str, optional) — Custom instruction text for the generation task. If not provided, it is auto-generated based ontask_type. - max_text_length (
int, optional, defaults to 256) — Maximum token length for text prompt encoding. - max_lyric_length (
int, optional, defaults to 2048) — Maximum token length for lyrics encoding. - bpm (
int, optional) — BPM (beats per minute) for music metadata. IfNone, the model estimates it. - keyscale (
str, optional) — Musical key (e.g.,"C major","A minor"). IfNone, the model estimates it. - timesignature (
str, optional) — Time signature (e.g.,"4"for 4/4,"3"for 3/4). IfNone, the model estimates it. - task_type (
str, optional, defaults to"text2music") — The generation task type. One of"text2music","cover","repaint","extract","lego","complete". - track_name (
str, optional) — Track name for"extract"or"lego"tasks (e.g.,"vocals","drums"). - complete_track_classes (
List[str], optional) — Track classes for the"complete"task. - src_audio (
torch.Tensor, optional) — Source audio tensor of shape[channels, samples]at 48kHz for audio-to-audio tasks (repaint, lego, cover, extract, complete). The audio is encoded through the VAE to produce source latents. - reference_audio (
torch.Tensor, optional) — Reference audio tensor of shape[channels, samples]at 48kHz for timbre conditioning. Used to extract timbre features for style transfer. - audio_codes (
strorList[str], optional) — Audio semantic code strings (e.g."<|audio_code_123|><|audio_code_456|>..."). When provided, the task is automatically switched to"cover"mode and the registered ACE-Step audio tokenizer / detokenizer modules decode the 5 Hz codes into 25 Hz acoustic conditioning. - repainting_start (
float, optional) — Start time in seconds for the repaint region (for"repaint"and"lego"tasks). - repainting_end (
float, optional) — End time in seconds for the repaint region. Use-1orNonefor until end. - audio_cover_strength (
float, optional, defaults to 1.0) — Strength of audio cover blending (0.0 to 1.0). When < 1.0, blends cover-conditioned and text-only-conditioned outputs. Lower values produce more style transfer effect. - cfg_interval_start (
float, optional, defaults to 0.0) — Start ratio (0.0-1.0) of the timestep range where CFG is applied. - cfg_interval_end (
float, optional, defaults to 1.0) — End ratio (0.0-1.0) of the timestep range where CFG is applied. - timesteps (
List[float], optional) — Custom timestep schedule. If provided, overridesnum_inference_stepsandshift.
Returns
AudioPipelineOutput or tuple
If return_dict is True, an AudioPipelineOutput is returned, otherwise a tuple with the generated
audio.
The call function to the pipeline for music generation.
Examples:
>>> import torch
>>> import soundfile as sf
>>> from diffusers import AceStepPipeline
>>> pipe = AceStepPipeline.from_pretrained("ACE-Step/Ace-Step1.5", torch_dtype=torch.bfloat16)
>>> pipe = pipe.to("cuda")
>>> # Text-to-music generation with metadata
>>> audio = pipe(
... prompt="A beautiful piano piece with soft melodies",
... lyrics="[verse]\nSoft notes in the morning light\n[chorus]\nMusic fills the air tonight",
... audio_duration=30.0,
... num_inference_steps=8,
... bpm=120,
... keyscale="C major",
... timesignature="4",
... ).audios
>>> # Save the generated audio
>>> sf.write("output.wav", audio[0, 0].cpu().numpy(), 48000)
>>> # Repaint task: regenerate a section of existing stereo 48kHz audio
>>> src_audio, sr = sf.read("input.wav")
>>> src_audio = torch.from_numpy(src_audio).float().T
>>> audio = pipe(
... prompt="Epic rock guitar solo",
... lyrics="",
... task_type="repaint",
... src_audio=src_audio,
... repainting_start=10.0,
... repainting_end=20.0,
... ).audios
>>> # Cover task with reference audio for timbre transfer
>>> ref_audio, sr = sf.read("reference.wav")
>>> ref_audio = torch.from_numpy(ref_audio).float().T
>>> audio = pipe(
... prompt="Pop song with bright vocals",
... lyrics="[verse]\nHello world",
... task_type="cover",
... reference_audio=ref_audio,
... audio_cover_strength=0.8,
... ).audioscheck_inputs
< source >( prompt: typing.Union[str, typing.List[str]] lyrics: typing.Union[str, typing.List[str]] task_type: str num_inference_steps: int guidance_scale: float shift: float audio_cover_strength: float cfg_interval_start: float cfg_interval_end: float repainting_start: typing.Optional[float] repainting_end: typing.Optional[float] )
Validate user-facing arguments before we start allocating noise tensors.
encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] lyrics: typing.Union[str, typing.List[str]] device: device vocal_language: typing.Union[str, typing.List[str]] = 'en' audio_duration: float = 60.0 instruction: typing.Optional[str] = None bpm: typing.Optional[int] = None keyscale: typing.Optional[str] = None timesignature: typing.Optional[str] = None max_text_length: int = 256 max_lyric_length: int = 2048 )
Parameters
- prompt (
strorList[str]) — Text caption(s) describing the music. - lyrics (
strorList[str]) — Lyric text(s). - device (
torch.device) — Device for tensors. - vocal_language (
strorList[str], optional, defaults to"en") — Language code(s) for lyrics. - audio_duration (
float, optional, defaults to 60.0) — Duration of the audio in seconds. - instruction (
str, optional) — Instruction text for generation. - bpm (
int, optional) — BPM (beats per minute) for metadata. - keyscale (
str, optional) — Musical key (e.g.,"C major"). - timesignature (
str, optional) — Time signature (e.g.,"4"for 4/4). - max_text_length (
int, optional, defaults to 256) — Maximum token length for text prompts. - max_lyric_length (
int, optional, defaults to 2048) — Maximum token length for lyrics.
Encode text prompts and lyrics into embeddings.
Text prompts are encoded through the full text encoder model to produce contextual hidden states. Lyrics are only passed through the text encoder’s embedding layer (token lookup), since the lyric encoder in the condition encoder handles the contextual encoding.
prepare_latents
< source >( batch_size: int audio_duration: float dtype: dtype device: device generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None )
Parameters
- batch_size (
int) — Number of samples to generate. - audio_duration (
float) — Duration of audio in seconds. - dtype (
torch.dtype) — Data type for the latents. - device (
torch.device) — Device for the latents. - generator (
torch.GeneratororList[torch.Generator], optional) — Random number generator(s). - latents (
torch.Tensor, optional) — Pre-generated latents.
Prepare initial noise latents for the flow matching process.
prepare_reference_audio_latents
< source >( reference_audio: Tensor batch_size: int device: device dtype: dtype )
Process reference audio into acoustic latents for the timbre encoder.
The reference audio is repeated/cropped to 30 seconds (3 segments of 10 seconds each from front, middle, and back), encoded through the VAE, and then transposed for the timbre encoder.
prepare_src_latents
< source >( device: device dtype: dtype batch_size: int = 1 src_audio: typing.Optional[torch.Tensor] = None audio_codes: typing.Union[str, typing.List[str], NoneType] = None latent_length: typing.Optional[int] = None task_type: str = 'text2music' )
Parameters
- src_audio (
torch.Tensor, optional) — Source audio tensor of shape[channels, samples]atself.sample_rate. - audio_codes (
strorList[str], optional) — Audio semantic code strings. - latent_length (
int, optional) — Target latent length when no source audio or audio codes are given. - device (
torch.device) — Target device. - dtype (
torch.dtype) — Target dtype. - batch_size (
int) — Batch size. - task_type (
str) — Current task type.
Prepare source latents for text-to-music and audio-to-audio tasks.