LAION-CLAP (Music) β Core ML
On-device audio-embedding model for Apple Silicon Macs. Converted from
laion/larger_clap_music
(HTSAT-base audio encoder + audio projection) to a self-contained Core ML
.mlpackage, int8-quantized.
Used by Gridshift for sample similarity search β "find samples that sound like this kick" β and (in a later phase) text-to-sample retrieval.
Input / output contract
audio: fp32 tensor [1, 480000] 10 s mono @ 48 kHz, peak-normalized to [-1, 1]
embedding: fp32 tensor [1, 512] L2-normalized, cosine = dot product
Mel-spectrogram preprocessing is baked into the model graph (via convmelspec STFT), so the client does zero DSP preprocessing β just supply raw audio samples.
Accuracy vs PyTorch reference (5 synthetic signals)
| signal | cos(ref, coreml) |
|---|---|
| sine 440 Hz | 0.99851 |
| sine 220 Hz | 0.99746 |
| white noise | 0.99977 |
| silence | 0.99986 |
| clipped noise | 0.99977 |
Pairwise distance structure between signals is preserved with max drift 0.004 (threshold β€ 0.02), so relative similarity rankings between samples remain intact through the int8 quantization.
Handling audio of different lengths
The Core ML graph is shape-rigid at 480 000 samples (10 s). The client is expected to preprocess:
- β€ 10 s: zero-pad on the right.
- < 200 ms (short one-shots): repeat-pad to ~500 ms, then zero-pad. Prevents padding from dominating the embedding.
- > 10 s (loops): sliding-window 3Γ with 50 % overlap, then mean-pool the three 512-d embeddings. Re-normalize to unit length.
License and attribution
Apache-2.0, inherited from upstream LAION-CLAP. Please cite:
Wu et al., "Large-scale Contrastive Language-Audio Pretraining with Feature
Fusion and Keyword-to-Caption Augmentation", 2022.
https://arxiv.org/abs/2211.06687
Conversion details
Conversion was done with the script at
app/ml/clap/convert_to_coreml.py in the Gridshift source tree, using:
- PyTorch 2.11 + torch.export
- coremltools 9.0 MLProgram backend
- int8 symmetric weight quantization
- bicubic β bilinear interp swap for Core ML compat (minimal accuracy impact)
- CLAP window-size patch for
torch.jit.is_tracingbranch divergence - Fixed input shape [1, 480000] baked into the graph
Target: macOS 14+.
Model tree for gridshiftstudio/clap-music-coreml
Base model
laion/larger_clap_music