LAION-CLAP (Music) β†’ Core ML

On-device audio-embedding model for Apple Silicon Macs. Converted from laion/larger_clap_music (HTSAT-base audio encoder + audio projection) to a self-contained Core ML .mlpackage, int8-quantized.

Used by Gridshift for sample similarity search β€” "find samples that sound like this kick" β€” and (in a later phase) text-to-sample retrieval.

Input / output contract

audio:       fp32 tensor [1, 480000]   10 s mono @ 48 kHz, peak-normalized to [-1, 1]
embedding:   fp32 tensor [1, 512]      L2-normalized, cosine = dot product

Mel-spectrogram preprocessing is baked into the model graph (via convmelspec STFT), so the client does zero DSP preprocessing β€” just supply raw audio samples.

Accuracy vs PyTorch reference (5 synthetic signals)

signal cos(ref, coreml)
sine 440 Hz 0.99851
sine 220 Hz 0.99746
white noise 0.99977
silence 0.99986
clipped noise 0.99977

Pairwise distance structure between signals is preserved with max drift 0.004 (threshold ≀ 0.02), so relative similarity rankings between samples remain intact through the int8 quantization.

Handling audio of different lengths

The Core ML graph is shape-rigid at 480 000 samples (10 s). The client is expected to preprocess:

  • ≀ 10 s: zero-pad on the right.
  • < 200 ms (short one-shots): repeat-pad to ~500 ms, then zero-pad. Prevents padding from dominating the embedding.
  • > 10 s (loops): sliding-window 3Γ— with 50 % overlap, then mean-pool the three 512-d embeddings. Re-normalize to unit length.

License and attribution

Apache-2.0, inherited from upstream LAION-CLAP. Please cite:

Wu et al., "Large-scale Contrastive Language-Audio Pretraining with Feature
Fusion and Keyword-to-Caption Augmentation", 2022.
https://arxiv.org/abs/2211.06687

Conversion details

Conversion was done with the script at app/ml/clap/convert_to_coreml.py in the Gridshift source tree, using:

  • PyTorch 2.11 + torch.export
  • coremltools 9.0 MLProgram backend
  • int8 symmetric weight quantization
  • bicubic β†’ bilinear interp swap for Core ML compat (minimal accuracy impact)
  • CLAP window-size patch for torch.jit.is_tracing branch divergence
  • Fixed input shape [1, 480000] baked into the graph

Target: macOS 14+.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gridshiftstudio/clap-music-coreml

Finetuned
(1)
this model

Paper for gridshiftstudio/clap-music-coreml