chmielvu's picture
Update README.md
d64245f verified
metadata
title: FastEmbed EN Embeddings
emoji: 🚀
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: apache-2.0

FastEmbed Code Embeddings Server

CPU-optimized embedding server using FastEmbed with ONNX quantized models.

Models

Models:

  • Dense: BAAI/bge-base-en-v1.5 (768 dim)
  • Sparse: Qdrant/bm25 (BM25, 0.01GB)
  • Reranker: jinaai/jina-reranker-v1-turbo-en (0.13GB)

Total: ~0.78 GB - Fits easily in CPU Basic (2 vCPU, 16GB RAM)

API Endpoints

Dense Embeddings

curl -X POST https://YOUR_SPACE.hf.space/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["def hello(): pass", "class Foo: ..."], "model": "code-embed"}'

Sparse BM25 Embeddings

curl -X POST https://YOUR_SPACE.hf.space/v1/sparse/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["search query", "document text"]}'

Hybrid Search Embeddings

curl -X POST https://YOUR_SPACE.hf.space/v1/hybrid/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["code snippet"]}'

Reranking

curl -X POST https://YOUR_SPACE.hf.space/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{"query": "python async function", "documents": ["doc1", "doc2", "doc3"]}'

Features

  • ONNX Runtime: Optimized CPU inference, no PyTorch overhead
  • Model Caching: Models loaded once, reused across requests
  • Hybrid Search: Dense + sparse (BM25) for better retrieval
  • Code-Optimized: jina-embeddings-v2-base-code specifically trained for code

Performance

Compared to PyTorch-based SentenceTransformers:

  • 5-10x faster on CPU
  • 5x smaller model footprint
  • Lower latency: ONNX quantization + caching