VoiceCLAP-Large
Voice-text contrastive embedding model โ the larger of the two anchors released with VoiceNet.
VoiceCLAP-Large is a single-tower model: a rank-16 LoRA finetune of LCO-Embedding-Omni-7B (Qwen2.5-Omni-Thinker-7B backbone with a sentence-transformer last-token-pooling head) trained with the symmetric InfoNCE loss. The audio and text embeddings are produced by the same backbone โ the modality is determined by what is fed in via the multimodal chat template.
| Architecture | single-tower Omni-Embedding (Qwen2.5-Omni-Thinker-7B + ST last-token-pool) |
| Adaptation | rank-16 LoRA (alpha 32, dropout 0.05), merged into the released weights |
| Joint embedding | 3 584-d, L2-normalised |
| Loss | symmetric InfoNCE (all-gather negatives) |
| Total parameters | ~7 B (full merged model) |
| Epochs | 1 |
Training data
Trained for 1 epoch on the open voiceclap_10_safe mixture (9 datasets)
used in the VoiceNet paper:
emolia-balanced-5M-subset(annotated subset of Emilia)laions_got_talent_clean_with_captionsmajestrino-datasynthetic_vocal_burstsimproved_synthetic_vocal_burstsearsexpressovoxceleb1voxceleb2
All clips are captioned with MOSS-Audio-8B-Thinking-derived dense
vocal-style captions covering emotions, talking-style attributes, and
demographics.
Standalone load example
The model uses the SentenceTransformer multimodal API โ both
sentence-transformers and transformers are on PyPI; no other deps are
required.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("VoiceNet/voiceclap-large", trust_remote_code=True)
# Text embedding (3 584-d, L2-normalised)
text_emb = model.encode(["a calm and steady voice"])
# Audio embedding โ pass a dict with raw samples + sampling rate.
import soundfile as sf
arr, sr = sf.read("clip.wav")
audio_emb = model.encode([{"array": arr, "sampling_rate": sr}])
# Cosine similarity (embeddings already L2-normalised)
print((audio_emb @ text_emb.T).item())
For convenience the LoRA adapter is also shipped under adapter/ so it can
be reapplied to other LCO-Embedding-Omni-7B forks; the merged
model.safetensors already contains it.
Citation
If you use this model, please cite the VoiceNet paper.
- Downloads last month
- -
Model tree for VoiceNet/voiceclap-large
Base model
LCO-Embedding/LCO-Embedding-Omni-7B