ltuncay's picture
Submission to the Interspeech 2026 Audio Encoder Capability Challenge
eca55dc verified

Audio Embeddings with Lightning & Hydra

This project is a clean, modular, and scalable implementation of audio embedding models using PyTorch Lightning and Hydra. It is designed to be easily extensible and runnable on local or cluster environments. It is based on the Audio-JEPA implementation and therefore implements the Audio-JEPA architecture. Other architecture can and will be added in the future.

🎯 Goal

The goal of this project is to provide a robust codebase for training and experimenting with audio embedding models. Key features include:

  • Modular Architecture: Components like Spectrogram, Masking, and ViT are decoupled.
  • Configurable Positional Embeddings: Support for RoPE (2D Rotary Embeddings), SinCos (2D Sinusoidal), and Learnable embeddings.
  • Hydra Configuration: flexible experiment management via hierarchical config files.
  • Lightning Trainer: Simplified training loop, logging, and checkpointing.
  • Modern Tooling: Uses uv for fast and reliable dependency management.

πŸš€ Installation

This project uses uv for dependency management.

  1. Install uv (if not already installed):

    curl -LsSf https://astral.sh/uv/install.sh | sh
    
  2. Clone the repository:

    git clone <repository_url>
    cd audio-embeddings
    
  3. Install dependencies:

    uv sync
    
  4. Enable shared git hooks (runs uv sync after merge/checkout/rewrite):

    git config core.hooksPath .githooks
    

πŸƒ Usage

Basic Training

To start training with the default configuration:

uv run src/train.py

Common Commands

Run on GPU with Weights & Biases logging:

uv run src/train.py trainer=gpu logger=wandb

Override hyperparameters on the command line:

uv run src/train.py data.batch_size=64 trainer.max_epochs=50

Configurable Positional Embeddings

You can switch between different positional embedding strategies easily:

RoPE:

uv run src/train.py model.net.encoder.pos_embed_type=rope

Offline WandB Logging with Model Checkpoints

To run training offline but still have model checkpoints staged for upload (which standard WandB restricts):

uv run src/train.py \
    logger=wandb \
    logger.wandb.offline=True \
    logger.wandb.log_model=False \
    +callbacks.wandb_offline_checkpoint._target_=src.callbacks.wandb_callbacks.WandbOfflineCheckpointCallback \
    trainer=gpu trainer.devices=1 \
    data.batch_size=128 trainer.max_epochs=100

These checkpoints will be uploaded when you run wandb sync.

2D SinCos:

uv run src/train.py ++model.net.encoder.pos_embed_type=sincos ++model.net.predictor.pos_embed_type=sincos

Learnable:

uv run src/train.py ++model.net.encoder.pos_embed_type=learnable ++model.net.predictor.pos_embed_type=learnable

πŸ“‚ Project Structure

β”œβ”€β”€ configs/                 # Hydra configuration files
β”‚   β”œβ”€β”€ callbacks/           # Callback configs (checkpoints, early stopping)
β”‚   β”œβ”€β”€ data/                # Data configs (AudioSet, etc.)
β”‚   β”œβ”€β”€ logger/              # Logger configs (WandB, Tensorboard)
β”‚   β”œβ”€β”€ model/               # Model configs (AudioJEPA parameters)
β”‚   β”œβ”€β”€ trainer/             # Trainer configs (CPU, GPU, strategies)
β”‚   └── train.yaml           # Main configuration entry point
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/                # Data loading logic
β”‚   β”‚   └── audioset_datamodule.py  # AudioSet DataModule & Dataset
β”‚   β”œβ”€β”€ models/              # Model architectures
β”‚   β”‚   β”œβ”€β”€ components/      # Reusable blocks
β”‚   β”‚   β”‚   β”œβ”€β”€ masking.py   # Masking generators
β”‚   β”‚   β”‚   β”œβ”€β”€ patch_embed.py # Patchification
β”‚   β”‚   β”‚   β”œβ”€β”€ rope.py      # 2D Rotary Embeddings
β”‚   β”‚   β”‚   β”œβ”€β”€ spectrogram.py # Audio preprocessing
β”‚   β”‚   β”‚   └── vit.py       # Vision Transformer (Student/Teacher/Predictor)
β”‚   β”‚   └── audio_jepa_module.py # Main LightningModule
β”‚   β”œβ”€β”€ utils/               # Utility functions
β”‚   └── train.py             # Training entry point
β”œβ”€β”€ scripts/                 # Helper scripts
β”œβ”€β”€ tests/                   # Verification tests
β”œβ”€β”€ pyproject.toml           # Project dependencies
└── README.md                # This file

πŸ› οΈ Extensibility

Adding a New Model

  1. Create your model components in src/models/components/.
  2. Create a new LightningModule in src/models/ (or update AudioJEPAModule).
  3. Create a new config file in configs/model/my_new_model.yaml.
  4. Run with uv run src/train.py model=my_new_model.

Adding a New Dataset

  1. Create a new DataModule in src/data/.
  2. Create a new config file in configs/data/my_dataset.yaml.
  3. Run with uv run src/train.py data=my_dataset.

Adding Functionalities

  • Callbacks: Add custom callbacks in src/callbacks/ (if needed) or use existing Lightning callbacks, and configure them in configs/callbacks/.
  • Metrics: Add metrics logging in training_step or validation_step inside src/models/audio_jepa_module.py.

πŸ§ͺ Testing

Run verification scripts to ensure components are working:

uv run tests/verify_rope.py
uv run tests/verify_custom_rope.py