Audio Spectrogram Transformer (AST) for Music Genre Classification
This model was developed to identify musical genres in complex, noisy audio environments. It was built as part of a deep learning project focused on the Messy Mashup challenge, where the goal was to classify genres even when multiple instrument tracks and environmental noises overlap.
Model Overview
This model is based on the Audio Spectrogram Transformer (AST). Unlike traditional models that look at audio as a simple sequence of sounds, the AST treats an audio spectrogram like a high-resolution image. It uses an Attention Mechanism to look at the entire 10-second clip at once, allowing it to pick up on both subtle instrument textures and overall rhythmic patterns.
Technical Details
- Architecture: Audio Spectrogram Transformer (AST)
- Sampling Rate: 16,000 Hz
- Input Length: 10 Seconds
- Feature Type: Log-Mel Spectrogram (128 Mel bands)
- Performance: Achieved a 0.85 Macro F1-score on the evaluation set.
The Training Strategy: Cross-Song Recombination
The main reason this model performs well is a training technique called Cross-Song Recombination.
In most training setups, a model listens to a single song at a time. To make this model more resilient to noise and "messy" audio, we created synthetic mashups during training. We took individual "stems" (the separate recordings for bass, drums, vocals, and other instruments) from different songs within the same genre and mixed them together.
We also added environmental noise from the ESC-50 dataset. This forced the model to ignore the chaos and focus only on the core spectral patterns that define a genre—such as the specific frequency of a blues guitar or the tempo of a techno beat.
Supported Genres
The model is trained to classify audio into one of the following 10 categories:
- Blues
- Classical
- Country
- Disco
- Hiphop
- Jazz
- Metal
- Pop
- Reggae
- Rock
How to Use
Visit my huggingfaces spaces where you can check what genre your .wav file is!
- Downloads last month
- 109
Model tree for afloven/messymashupclassifier
Base model
MIT/ast-finetuned-audioset-10-10-0.4593