Embedding Model Datasets Collection A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 70 items • Updated Dec 10, 2025 • 173
SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia Paper • 2606.03027 • Published 3 days ago • 1
GrepSeek: Training Search Agents for Direct Corpus Interaction Paper • 2605.29307 • Published 9 days ago • 102
MiniCPM RAG Suite Collection Embedding, re-ranking, generation -- the cornerstone of RAG. • 7 items • Updated 12 days ago • 18
MMTEB: Massive Multilingual Text Embedding Benchmark Paper • 2502.13595 • Published Feb 19, 2025 • 49
ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution Paper • 2604.13787 • Published Apr 15 • 2