Upload README.md with huggingface_hub

e35df4d verified 8 days ago

3.83 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	library_name: diffusers
	pipeline_tag: text-to-audio
	base_model:
	- Qwen/Qwen3-1.7B
	tags:
	- text-to-audio
	- diffusion
	- flow-matching
	- sound-effects
	- audio-generation
	---

	# MOSS-SoundEffect-V2.0

	<p align="center">
	<img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/openmoss_x_mosi" height="50" align="middle" />
	</p>

	<div align="center">
	<a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main/moss_soundeffect_v2"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
	</div>

	MOSS-SoundEffect v2.0 is a text-to-audio model with a Diffusion Transformer (DiT) backbone trained with the Flow Matching objective, paired with a DAC VAE and a Qwen3 text encoder. It generates high-fidelity environmental, urban, creature, and human-action sound effects from natural-language prompts, with controllable duration up to 30 seconds at 48 kHz.

	## 1. Overview

	### 1.1 TTS Family Positioning

	Within the MOSS-TTS Family, MOSS-SoundEffect is the dedicated text-to-sound model — the family member that turns natural-language captions into non-speech audio (ambience, urban scenes, creatures, human actions, short music-like clips). v2.0 supersedes the v1 discrete-token autoregressive backbone (`MossTTSDelay`) with a continuous-latent Diffusion Transformer + Flow Matching design.

	### 1.2 Key Capabilities

	- Broad SFX coverage: natural environments, urban environments, animals & creatures, human actions, and short musical/percussive clips.
	- Long-form generation: stable audio up to 30 seconds per call with the duration tag prepended to the prompt at training time.
	- Bilingual prompts: trained with both English and Chinese captions.

	### 1.3 Released Models

	\| Model \| Architecture \| DiT Variant \| Parameters \|
	\|---\|---\|---\|---:\|
	\| MOSS-SoundEffect-V2.0 \| DiT + Flow Matching \| `1.3B` \| 1.3B \|

	Recommended inference hyperparameters

	\| Parameter \| Default \| Description \|
	\|---\|---:\|---\|
	\| `num_inference_steps` \| 100 \| Number of flow-match solver steps. \|
	\| `cfg_scale` \| 4.0 \| Classifier-free guidance weight. \|
	\| `sigma_shift` \| 5.0 \| Flow-match scheduler shift applied per call. \|
	\| `seconds` \| 10.0 \| Output duration. Up to 30. \|

	## 2. Quick Start

	### Environment Setup

	We recommend a clean, isolated Python 3.12 environment to avoid dependency conflicts with the top-level MOSS-TTS environment.

	```bash
	conda create -n moss-soundeffect-v2 python=3.12 -y
	conda activate moss-soundeffect-v2

	git clone https://github.com/OpenMOSS/MOSS-TTS.git
	cd MOSS-TTS/moss_soundeffect_v2
	pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
	-e ".[torch-cu128,finetune]"
	```

	For a minimal inference-only install (still ships the Gradio demo; skips the fine-tuning extras):

	```bash
	pip install --extra-index-url https://download.pytorch.org/whl/cu128 \
	-e ".[torch-cu128]"
	```

	### Basic Usage

	```python
	import torch
	from moss_soundeffect_v2 import MossSoundEffectPipeline

	pipe = MossSoundEffectPipeline.from_pretrained(
	"OpenMOSS-Team/MOSS-SoundEffect-v2.0", # this repo, or a local dir
	torch_dtype=torch.bfloat16,
	device="cuda",
	)

	audio = pipe(
	prompt="A dog barking loudly in a park.",
	seconds=10,
	num_inference_steps=100,
	cfg_scale=4.0,
	) # (B, C, T) waveform tensor
	pipe.save_audio(audio, "out.wav")
	```

	> The underlying DiT is wrapped with `torch.compile` + Triton CUDA Graph. The first call may take a few minutes to compile. If you hit `TorchDynamo` / Triton errors, set `TORCHDYNAMO_DISABLE=1` before launching Python.

	For a Gradio demo and fine-tuning recipes, see the [GitHub README](https://github.com/OpenMOSS/MOSS-TTS/tree/main/moss_soundeffect_v2#readme).