Update README.md

9491b77 verified 2 months ago

5.97 kB

	---
	license: mit
	tags:
	- audio
	- audio-enhancement
	- speech-enhancement
	- bandwidth-extension
	- codec-repair
	- neural-codec
	- waveform-processing
	- pytorch
	library_name: pytorch
	pipeline_tag: audio-to-audio
	frameworks: PyTorch
	language:
	- en
	---
	# Brontes: Synthesis-First Waveform Enhancement

	Brontes is a time-domain audio enhancement model designed for neural codec repair and bandwidth extension. This is the general pretrained model trained on diverse audio data.

	## Model Description

	Brontes upsamples and repairs speech degraded by neural codec compression. Unlike conventional Wave U-Net approaches that rely on dense skip connections, Brontes uses a synthesis-first architecture with selective deep skips, forcing the model to actively reconstruct rather than copy degraded input details.

	### Key Capabilities

	- Neural codec repair — removes compression artifacts from neural codec outputs
	- Bandwidth extension — upsamples from 24 kHz to 48 kHz (2× extension)
	- Waveform-domain processing — operates directly on audio samples, no spectrogram conversion
	- Synthesis-first design — only the two deepest skips retained, preventing artifact leakage
	- LSTM bottleneck — captures long-range temporal dependencies at maximum compression

	### Model Architecture

	- Type: Encoder-decoder U-Net with selective skip connections
	- Stages: 6 encoder stages + 6 decoder stages (4096× total compression)
	- Bottleneck: Bidirectional LSTM for temporal modeling
	- Parameters: ~29M
	- Input: 24 kHz mono audio (codec-degraded)
	- Output: 48 kHz mono audio (enhanced)

	## Intended Use

	This is a general pretrained model trained on diverse audio data. For optimal performance on your specific use case:

	⚠️ It is strongly recommended to fine-tune this model on your target dataset using the `--pretrained` flag.

	### Primary Use Cases

	- Repairing audio degraded by neural codecs (e.g., EnCodec, SoundStream, Lyra)
	- Bandwidth extension from narrowband/wideband to fullband
	- Speech enhancement and quality improvement
	- Post-processing for codec-compressed audio

	## Quick Start

	For detailed usage instructions, training, and fine-tuning, please see the [GitHub repository](https://github.com/ZDisket/Brontes).

	### Basic Inference Example

	```python
	import torch
	import torchaudio
	import yaml
	from brontes import Brontes

	# Setup device
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	# Load config
	with open('configs/config_brontes_48khz_demucs.yaml', 'r') as f:
	config = yaml.safe_load(f)

	# Create model
	model = Brontes(unet_config=config['model'].get('unet_config', {})).to(device)

	# Load checkpoint
	checkpoint = torch.load('path/to/checkpoint.pt', map_location=device)
	model.load_state_dict(checkpoint['model'] if 'model' in checkpoint else checkpoint)
	model.eval()

	# Load audio
	audio, sr = torchaudio.load('input.wav')
	target_sr = config['dataset']['sample_rate']

	# Resample if necessary
	if sr != target_sr:
	resampler = torchaudio.transforms.Resample(sr, target_sr)
	audio = resampler(audio)

	# Convert to mono and normalize
	if audio.shape[0] > 1:
	audio = audio.mean(dim=0, keepdim=True)
	max_val = audio.abs().max()
	if max_val > 0:
	audio = audio / max_val

	# Add batch dimension and process
	audio = audio.unsqueeze(0).to(device)
	with torch.no_grad():
	output, _, _, _ = model(audio)

	# Save output
	output = output.squeeze(0).cpu()
	if output.abs().max() > 1.0:
	output = output / output.abs().max()
	torchaudio.save('output.wav', output, target_sr)
	```

	Or use the command-line interface:

	```bash
	python infer_brontes.py \
	--config configs/config_brontes_48khz_demucs.yaml \
	--checkpoint path/to/checkpoint.pt \
	--input input.wav \
	--output output.wav
	```

	## Training Details

	### Training Data

	The model was trained on diverse audio data including:
	- Clean speech recordings
	- Codec-degraded audio pairs
	- Various acoustic conditions and speakers

	### Training Procedure

	- Pretraining: 10,000 steps generator-only training
	- Adversarial training: Multi-Period Discriminator (MPD) + Multi-Band Spectral Discriminator (MBSD)
	- Loss functions: Multi-scale mel loss, pitch loss, adversarial loss, feature matching
	- Precision: BF16 mixed precision
	- Framework: PyTorch with custom training loop

	## Fine-tuning Recommendations

	To achieve best results on your specific dataset:

	1. Prepare paired data: Input (degraded) and target (clean) audio pairs
	2. Use the `--pretrained` flag to load model weights without optimizer state
	3. Train for 10-50k steps depending on dataset size
	4. Monitor validation loss to prevent overfitting

	See the [repository README](https://github.com/ZDisket/Brontes) for detailed fine-tuning instructions.

	## Limitations

	- Domain-specific performance: General model may not perform optimally on highly specialized audio (fine-tuning recommended)
	- Mono audio only: Currently supports single-channel audio
	- Fixed sample rates: Designed for 24 kHz input → 48 kHz output
	- Codec-specific artifacts: Performance may vary across different codec types
	- Long-form audio: Very long audio files may require chunking or sufficient GPU memory

	## Ethical Considerations

	- This model is designed for audio enhancement and should not be used to create misleading or deceptive content
	- Users should respect privacy and consent when processing speech recordings
	- Enhanced audio should be clearly labeled as processed when used in sensitive contexts


	## License

	Both the model weights and code are released under the MIT License.

	## Additional Resources

	- GitHub Repository: [https://github.com/ZDisket/Brontes](https://github.com/ZDisket/Brontes)
	- Technical Report: See the repository
	- Issues & Support: [GitHub Issues](https://github.com/ZDisket/Brontes/issues)

	## Acknowledgments

	Compute resources provided by Hot Aisle and AI at AMD.