NoesisLab/Collins-Embedding-3M
A 3M-parameter sentence embedding model built on 2-Universal Hash encoding + RoPE positional encoding, trained on AllNLI triplets with MultipleNegativesRankingLoss.
The core insight: replace the vocabulary embedding table β the single largest cost in any transformer β with a 2-Universal Hash function that maps token IDs into a fixed-size bucket space in O(1) time. No lookup table. No gradient-heavy embedding matrix.
Released 2026 by NoesisLab.
Quick Start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("NoesisLab/Collins-Embedding-3M")
embeddings = model.encode(["Hello world", "Hi there"])
similarities = model.similarity(embeddings[0], embeddings[1])
Architecture: Why Hashing Works
Token ID βββΊ h(x) = ((ax + b) mod p) mod B βββΊ bucket index
β
Sign Hash: Ο(x) = sign((cx + d) mod p)
resolves collision ambiguity during training
The sign hash acts as a per-token polarity signal. Under strong contrastive supervision, the model learns to disentangle hash collisions β tokens that share a bucket but carry different semantics get separated via their sign channel. The Chernoff Bound guarantees that the sign channel suppresses collision noise under sufficient supervision signal.
Time complexity vs. standard embedding:
| Operation | Standard Embedding | Collins Hash |
|---|---|---|
| Token β vector | O(1) table lookup | O(1) arithmetic |
| Memory (vocab) | O(V Γ d) | O(B Γ d), B βͺ V |
| Gradient flow | Dense, full vocab | Sparse, bucket-local |
| Cold-start | Requires pretraining | Random init viable |
With V = 30522 and B = 512, Collins uses ~60Γ fewer parameters for the token encoding stage alone.
Cache efficiency: At 3M total parameters, the entire model fits in GPU L2 cache during inference. Standard MiniLM models (15β22M) cannot achieve this, resulting in 1β2 orders of magnitude lower inference latency at equivalent semantic accuracy.
MTEB Benchmark Results
| Task | cosine_spearman |
|---|---|
| STS12 | 0.6038 |
| STS13 | 0.5952 |
| STS14 | 0.6186 |
| STSBenchmark | 0.7114 |
Full Baseline Comparison
Left: STSBenchmark Spearman score. Right: score per million parameters (efficiency). White labels inside bars = parameter count.
| Model | Type | Params | STSB Spearman | Score / M params |
|---|---|---|---|---|
| GloVe (6B, 300d) | Static Embedding | ~120M | ~0.50 | 0.0042 |
| BERT-base (Mean Pool) | Contextual (no NLI FT) | 110M | ~0.50 | 0.0045 |
| Collins (Ours) | Hash + RoPE | 3M | 0.7114 | 0.237 |
| paraphrase-MiniLM-L3-v2 | Contextual | 15M | ~0.75 | 0.050 |
| BGE-micro-v2 | Contextual | 17M | ~0.76 | 0.044 |
| paraphrase-MiniLM-L6-v2 | Contextual | 22M | ~0.79 | 0.036 |
| all-mpnet-base-v2 | Contextual | 110M | ~0.83 | 0.0075 |
Collins achieves 0.237 score/M β 5Γ more efficient than the next best lightweight model (MiniLM-L3 at 0.050/M), and 53Γ more efficient than BERT-base.
Key Findings
- Cross-tier performance: At 3M params, Collins matches 11β17M parameter models on STSBenchmark β 1/5 the parameters for equivalent semantic fidelity.
- Hash compression victory: MiniLM and ALBERT still carry a full vocabulary embedding table as their largest single component. Collins eliminates this entirely via 2-Universal Hashing.
- Sign hash robustness: STS12β14 scores hold at 0.60β0.62 across diverse domains (news, forums, image captions), confirming differential interference resistance at collision points.
- RoPE structural encoding: STSBenchmark (0.71) > STS12-14 (0.60β0.62) gap indicates stronger performance on well-formed, contextually balanced sentence pairs β exactly where RoPE's topological structure contributes most.
Applications (2026)
This model is designed for deployment scenarios where memory and latency are hard constraints:
- Edge / embedded devices: Full model fits in 12MB. Suitable for on-device semantic search on mobile, IoT, and microcontrollers with ML accelerators.
- Ultra-high-throughput vector search: L2-cache residency enables millions of encode calls per second on a single GPU, making it viable as the encoder backbone for billion-scale ANN indexes (FAISS, ScaNN, Milvus).
- Real-time RAG pipelines: Sub-millisecond encoding latency unlocks synchronous retrieval in latency-sensitive LLM inference chains without a separate embedding service.
- Privacy-preserving on-device NLP: No network round-trip required. Encode and search entirely on-device for sensitive document workflows.
- Low-power inference: Power consumption scales with model size. At 3M params, Collins is viable on NPU/TPU edge chips where 100M+ models are cost-prohibitive.
Training
- Dataset:
sentence-transformers/all-nli, triplet split (557,850 samples) - Loss:
MultipleNegativesRankingLoss - Epochs: 2, batch size: 256, lr: 2e-4 (cosine schedule), bf16
Citation
@misc{collins-embedding-3m-2026,
title = {Collins-Embedding-3M: O(1) Hash Encoding for Efficient Sentence Embeddings},
author = {NoesisLab},
year = {2026},
url = {https://huggingface.co/NoesisLab/Collins-Embedding-3M}
}
