NoesisLab/Collins-Embedding-3M

A 3M-parameter sentence embedding model built on 2-Universal Hash encoding + RoPE positional encoding, trained on AllNLI triplets with MultipleNegativesRankingLoss.

The core insight: replace the vocabulary embedding table β€” the single largest cost in any transformer β€” with a 2-Universal Hash function that maps token IDs into a fixed-size bucket space in O(1) time. No lookup table. No gradient-heavy embedding matrix.

Released 2026 by NoesisLab.


Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("NoesisLab/Collins-Embedding-3M")
embeddings = model.encode(["Hello world", "Hi there"])
similarities = model.similarity(embeddings[0], embeddings[1])

Architecture: Why Hashing Works

Token ID  ──►  h(x) = ((ax + b) mod p) mod B  ──►  bucket index
                         ↑
                  Sign Hash: Ο†(x) = sign((cx + d) mod p)
                  resolves collision ambiguity during training

The sign hash acts as a per-token polarity signal. Under strong contrastive supervision, the model learns to disentangle hash collisions β€” tokens that share a bucket but carry different semantics get separated via their sign channel. The Chernoff Bound guarantees that the sign channel suppresses collision noise under sufficient supervision signal.

Time complexity vs. standard embedding:

Operation Standard Embedding Collins Hash
Token β†’ vector O(1) table lookup O(1) arithmetic
Memory (vocab) O(V Γ— d) O(B Γ— d), B β‰ͺ V
Gradient flow Dense, full vocab Sparse, bucket-local
Cold-start Requires pretraining Random init viable

With V = 30522 and B = 512, Collins uses ~60Γ— fewer parameters for the token encoding stage alone.

Cache efficiency: At 3M total parameters, the entire model fits in GPU L2 cache during inference. Standard MiniLM models (15–22M) cannot achieve this, resulting in 1–2 orders of magnitude lower inference latency at equivalent semantic accuracy.


MTEB Benchmark Results

Task cosine_spearman
STS12 0.6038
STS13 0.5952
STS14 0.6186
STSBenchmark 0.7114

Full Baseline Comparison

STSBenchmark score and parameter efficiency comparison

Left: STSBenchmark Spearman score. Right: score per million parameters (efficiency). White labels inside bars = parameter count.

Model Type Params STSB Spearman Score / M params
GloVe (6B, 300d) Static Embedding ~120M ~0.50 0.0042
BERT-base (Mean Pool) Contextual (no NLI FT) 110M ~0.50 0.0045
Collins (Ours) Hash + RoPE 3M 0.7114 0.237
paraphrase-MiniLM-L3-v2 Contextual 15M ~0.75 0.050
BGE-micro-v2 Contextual 17M ~0.76 0.044
paraphrase-MiniLM-L6-v2 Contextual 22M ~0.79 0.036
all-mpnet-base-v2 Contextual 110M ~0.83 0.0075

Collins achieves 0.237 score/M β€” 5Γ— more efficient than the next best lightweight model (MiniLM-L3 at 0.050/M), and 53Γ— more efficient than BERT-base.

Key Findings

  • Cross-tier performance: At 3M params, Collins matches 11–17M parameter models on STSBenchmark β€” 1/5 the parameters for equivalent semantic fidelity.
  • Hash compression victory: MiniLM and ALBERT still carry a full vocabulary embedding table as their largest single component. Collins eliminates this entirely via 2-Universal Hashing.
  • Sign hash robustness: STS12–14 scores hold at 0.60–0.62 across diverse domains (news, forums, image captions), confirming differential interference resistance at collision points.
  • RoPE structural encoding: STSBenchmark (0.71) > STS12-14 (0.60–0.62) gap indicates stronger performance on well-formed, contextually balanced sentence pairs β€” exactly where RoPE's topological structure contributes most.

Applications (2026)

This model is designed for deployment scenarios where memory and latency are hard constraints:

  • Edge / embedded devices: Full model fits in 12MB. Suitable for on-device semantic search on mobile, IoT, and microcontrollers with ML accelerators.
  • Ultra-high-throughput vector search: L2-cache residency enables millions of encode calls per second on a single GPU, making it viable as the encoder backbone for billion-scale ANN indexes (FAISS, ScaNN, Milvus).
  • Real-time RAG pipelines: Sub-millisecond encoding latency unlocks synchronous retrieval in latency-sensitive LLM inference chains without a separate embedding service.
  • Privacy-preserving on-device NLP: No network round-trip required. Encode and search entirely on-device for sensitive document workflows.
  • Low-power inference: Power consumption scales with model size. At 3M params, Collins is viable on NPU/TPU edge chips where 100M+ models are cost-prohibitive.

Training

  • Dataset: sentence-transformers/all-nli, triplet split (557,850 samples)
  • Loss: MultipleNegativesRankingLoss
  • Epochs: 2, batch size: 256, lr: 2e-4 (cosine schedule), bf16

Citation

@misc{collins-embedding-3m-2026,
  title  = {Collins-Embedding-3M: O(1) Hash Encoding for Efficient Sentence Embeddings},
  author = {NoesisLab},
  year   = {2026},
  url    = {https://huggingface.co/NoesisLab/Collins-Embedding-3M}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support