| --- |
| license: mit |
| tags: |
| - audio |
| - audio-enhancement |
| - speech-enhancement |
| - bandwidth-extension |
| - codec-repair |
| - neural-codec |
| - waveform-processing |
| - pytorch |
| library_name: pytorch |
| pipeline_tag: audio-to-audio |
| frameworks: PyTorch |
| language: |
| - en |
| --- |
| # Brontes: Synthesis-First Waveform Enhancement |
|
|
| **Brontes** is a time-domain audio enhancement model designed for neural codec repair and bandwidth extension. This is the general pretrained model trained on diverse audio data. |
|
|
| ## Model Description |
|
|
| Brontes upsamples and repairs speech degraded by neural codec compression. Unlike conventional Wave U-Net approaches that rely on dense skip connections, Brontes uses a **synthesis-first architecture** with selective deep skips, forcing the model to actively reconstruct rather than copy degraded input details. |
|
|
| ### Key Capabilities |
|
|
| - **Neural codec repair** — removes compression artifacts from neural codec outputs |
| - **Bandwidth extension** — upsamples from 24 kHz to 48 kHz (2× extension) |
| - **Waveform-domain processing** — operates directly on audio samples, no spectrogram conversion |
| - **Synthesis-first design** — only the two deepest skips retained, preventing artifact leakage |
| - **LSTM bottleneck** — captures long-range temporal dependencies at maximum compression |
|
|
| ### Model Architecture |
|
|
| - **Type:** Encoder-decoder U-Net with selective skip connections |
| - **Stages:** 6 encoder stages + 6 decoder stages (4096× total compression) |
| - **Bottleneck:** Bidirectional LSTM for temporal modeling |
| - **Parameters:** ~29M |
| - **Input:** 24 kHz mono audio (codec-degraded) |
| - **Output:** 48 kHz mono audio (enhanced) |
|
|
| ## Intended Use |
|
|
| This is a **general pretrained model** trained on diverse audio data. For optimal performance on your specific use case: |
|
|
| ⚠️ **It is strongly recommended to fine-tune this model on your target dataset** using the `--pretrained` flag. |
|
|
| ### Primary Use Cases |
|
|
| - Repairing audio degraded by neural codecs (e.g., EnCodec, SoundStream, Lyra) |
| - Bandwidth extension from narrowband/wideband to fullband |
| - Speech enhancement and quality improvement |
| - Post-processing for codec-compressed audio |
|
|
| ## Quick Start |
|
|
| For detailed usage instructions, training, and fine-tuning, please see the [GitHub repository](https://github.com/ZDisket/Brontes). |
|
|
| ### Basic Inference Example |
|
|
| ```python |
| import torch |
| import torchaudio |
| import yaml |
| from brontes import Brontes |
| |
| # Setup device |
| device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| |
| # Load config |
| with open('configs/config_brontes_48khz_demucs.yaml', 'r') as f: |
| config = yaml.safe_load(f) |
| |
| # Create model |
| model = Brontes(unet_config=config['model'].get('unet_config', {})).to(device) |
| |
| # Load checkpoint |
| checkpoint = torch.load('path/to/checkpoint.pt', map_location=device) |
| model.load_state_dict(checkpoint['model'] if 'model' in checkpoint else checkpoint) |
| model.eval() |
| |
| # Load audio |
| audio, sr = torchaudio.load('input.wav') |
| target_sr = config['dataset']['sample_rate'] |
| |
| # Resample if necessary |
| if sr != target_sr: |
| resampler = torchaudio.transforms.Resample(sr, target_sr) |
| audio = resampler(audio) |
| |
| # Convert to mono and normalize |
| if audio.shape[0] > 1: |
| audio = audio.mean(dim=0, keepdim=True) |
| max_val = audio.abs().max() |
| if max_val > 0: |
| audio = audio / max_val |
| |
| # Add batch dimension and process |
| audio = audio.unsqueeze(0).to(device) |
| with torch.no_grad(): |
| output, _, _, _ = model(audio) |
| |
| # Save output |
| output = output.squeeze(0).cpu() |
| if output.abs().max() > 1.0: |
| output = output / output.abs().max() |
| torchaudio.save('output.wav', output, target_sr) |
| ``` |
|
|
| Or use the command-line interface: |
|
|
| ```bash |
| python infer_brontes.py \ |
| --config configs/config_brontes_48khz_demucs.yaml \ |
| --checkpoint path/to/checkpoint.pt \ |
| --input input.wav \ |
| --output output.wav |
| ``` |
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| The model was trained on diverse audio data including: |
| - Clean speech recordings |
| - Codec-degraded audio pairs |
| - Various acoustic conditions and speakers |
|
|
| ### Training Procedure |
|
|
| - **Pretraining:** 10,000 steps generator-only training |
| - **Adversarial training:** Multi-Period Discriminator (MPD) + Multi-Band Spectral Discriminator (MBSD) |
| - **Loss functions:** Multi-scale mel loss, pitch loss, adversarial loss, feature matching |
| - **Precision:** BF16 mixed precision |
| - **Framework:** PyTorch with custom training loop |
|
|
| ## Fine-tuning Recommendations |
|
|
| To achieve best results on your specific dataset: |
|
|
| 1. **Prepare paired data:** Input (degraded) and target (clean) audio pairs |
| 2. **Use the `--pretrained` flag** to load model weights without optimizer state |
| 3. **Train for 10-50k steps** depending on dataset size |
| 4. **Monitor validation loss** to prevent overfitting |
|
|
| See the [repository README](https://github.com/ZDisket/Brontes) for detailed fine-tuning instructions. |
|
|
| ## Limitations |
|
|
| - **Domain-specific performance:** General model may not perform optimally on highly specialized audio (fine-tuning recommended) |
| - **Mono audio only:** Currently supports single-channel audio |
| - **Fixed sample rates:** Designed for 24 kHz input → 48 kHz output |
| - **Codec-specific artifacts:** Performance may vary across different codec types |
| - **Long-form audio:** Very long audio files may require chunking or sufficient GPU memory |
|
|
| ## Ethical Considerations |
|
|
| - This model is designed for audio enhancement and should not be used to create misleading or deceptive content |
| - Users should respect privacy and consent when processing speech recordings |
| - Enhanced audio should be clearly labeled as processed when used in sensitive contexts |
|
|
|
|
| ## License |
|
|
| Both the model weights and code are released under the MIT License. |
|
|
| ## Additional Resources |
|
|
| - **GitHub Repository:** [https://github.com/ZDisket/Brontes](https://github.com/ZDisket/Brontes) |
| - **Technical Report:** See the repository |
| - **Issues & Support:** [GitHub Issues](https://github.com/ZDisket/Brontes/issues) |
|
|
| ## Acknowledgments |
|
|
| Compute resources provided by Hot Aisle and AI at AMD. |
|
|