YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

aurekai/semantic-cache-bench

Semantic caching benchmarks and performance suite for Aurekai. Validates cache consistency, hit rates, and query latency across different model architectures and corpus sizes.

Overview

Semantic caching is a core optimization in Aurekai that deduplicates semantically similar queries without exact matching. This repository hosts:

  • Benchmark Datasets: Query corpora with semantic similarity annotations
  • Evaluation Scripts: Performance measurement and validation tools
  • Results: Baseline metrics across different models and cache configurations
  • Methodology: Detailed documentation of benchmark setup and evaluation protocols

Quick Start

# Download benchmark suite
git clone https://huggingface.co/aurekai/semantic-cache-bench
cd semantic-cache-bench

# Run quick benchmark
akai semantic-cache:bench \
  --dataset queries-10k.jsonl \
  --model qwen3-8b \
  --cache-size 1GB \
  --output results.json

# Compare results
akai semantic-cache:compare \
  --baseline baseline-results.json \
  --current results.json

Benchmark Datasets

queries-1k (Minimal Validation)

  • Size: 1,024 queries
  • Purpose: Quick validation of cache functionality
  • Format: JSONL with semantic similarity pairs
  • Runtime: ~5 minutes on GPU

Schema:

{
  "id": "q_001_234",
  "query": "What are the benefits of renewable energy?",
  "semantic_variations": [
    "Advantages of wind and solar power",
    "Why should we invest in renewables?"
  ],
  "dissimilar_queries": [
    "How do fossil fuels work?"
  ],
  "expected_cache_hit": true,
  "similarity_threshold": 0.87
}

queries-10k (Standard Benchmark)

  • Size: 10,240 queries
  • Purpose: Standard performance baseline
  • Corpus: Diverse knowledge domains and query patterns
  • Expected cache hit rate: 68-72%
  • Runtime: ~45 minutes on GPU

queries-100k (Comprehensive)

  • Size: 102,400 queries
  • Purpose: Large-scale cache behavior validation
  • Corpus: Realistic production query distribution
  • Expected cache hit rate: 72-76%
  • Runtime: ~8 hours on high-end GPU
  • Disk space: 15 GB decompressed

Metrics & Evaluation

Cache Performance

Metric Qwen3-8B LLaMA3-8B Meaning
Hit Rate 71.2% 69.8% % of queries found in cache
False Positives 0.3% 0.4% Incorrect cache matches
False Negatives 2.1% 2.4% Missed cache opportunities
Recall @ 0.90 94.2% 92.8% True positives at high threshold

Latency Improvement

Cache Miss:    125 ms (full inference)
Cache Hit:     2 ms (embedding lookup + cache retrieval)
Speedup:       62.5x

Average (71% hit rate): 125*0.29 + 2*0.71 = 38 ms
Effective speedup:      3.3x vs. no cache

Memory Efficiency

  • Embedding cache size: ~800 MB for 100K queries
  • Memory per cached embedding: ~8 KB
  • Compression ratio: 1.4x with optional zstd compression
  • Peak memory during benchmark: 4 GB (with batch size 32)

Running Benchmarks

Standard Evaluation

# Benchmark specific model
akai semantic-cache:bench \
  --model qwen3-8b \
  --dataset queries-10k.jsonl \
  --batch-size 32 \
  --cache-size 2GB \
  --output results.json

# With logging
akai semantic-cache:bench \
  --model qwen3-8b \
  --dataset queries-10k.jsonl \
  --cache-size 2GB \
  --verbose \
  --log-interval 100 \
  --output results.json

Comparison Between Models

# Run on multiple models
for model in qwen3-8b llama3-8b; do
  akai semantic-cache:bench \
    --model $model \
    --dataset queries-10k.jsonl \
    --output results-$model.json
done

# Compare results
akai semantic-cache:compare \
  --results results-qwen3-8b.json results-llama3-8b.json \
  --report comparison-report.md

Validation with Custom Threshold

# Test different similarity thresholds
akai semantic-cache:threshold-sweep \
  --model qwen3-8b \
  --dataset queries-10k.jsonl \
  --thresholds "0.80,0.85,0.90,0.95" \
  --output threshold-sweep.json

Benchmark Results

Latest Results (Aurekai v0.8.0-alpha.1)

Hardware: NVIDIA H100 80GB, AMD EPYC 9654 Date: 2026-05-02

Model Dataset Hit Rate P@0.90 Latency (hit) Latency (miss)
Qwen3-8B 1K 72.3% 94.1% 1.8ms 124ms
Qwen3-8B 10K 71.2% 93.8% 1.9ms 126ms
LLaMA3-8B 1K 70.1% 92.4% 2.1ms 127ms
LLaMA3-8B 10K 69.8% 92.1% 2.2ms 129ms

See results/ for detailed breakdowns by domain and query type.

Implementation Notes

Cache Configuration

{
  "semantic_cache": {
    "enabled": true,
    "similarity_threshold": 0.88,
    "max_cache_size": "2GB",
    "eviction_policy": "lru",
    "embedding_model": "qwen3-8b",
    "batch_size": 32,
    "use_mmap": true
  }
}

Threshold Selection

  • Aggressive (0.80): High hit rate (75%+), more false positives
  • Balanced (0.88): Recommended default, 71% hit rate, minimal false positives
  • Conservative (0.95): Very few false positives, lower hit rate

Methodology

Query Similarity Annotation

Each benchmark dataset includes human-validated semantic similarity annotations:

  1. Query pairs sampled from corpus
  2. Annotators rate similarity (0-1)
  3. Disagreements resolved with third annotator
  4. Inter-rater reliability: Krippendorff's α = 0.89

Cache Consistency Validation

All cache results validated against ground truth:

# For each cached result:
1. Verify embedding matches original query
2. Re-rank all cached results for current query  
3. Confirm top match was indeed in cache
4. Validate latency was significantly improved

Contributing Results

To contribute benchmark results:

  1. Run benchmark suite on your hardware
  2. Include system specs (GPU, CPU, memory, disk)
  3. Report all metrics from evaluation output
  4. Submit results via PR with hardware metadata

Result file format:

{
  "metadata": {
    "hardware": "NVIDIA H100, 512GB RAM",
    "date": "2026-05-02",
    "aurekai_version": "0.8.0-alpha.1"
  },
  "results": [
    {
      "model": "qwen3-8b",
      "dataset": "queries-10k",
      "hit_rate": 0.712,
      "recall_at_0_90": 0.938
    }
  ]
}

Related Repositories

Tools & Scripts

  • akai semantic-cache:bench: Run full benchmark suite
  • akai semantic-cache:compare: Compare benchmark results
  • akai semantic-cache:threshold-sweep: Test different thresholds
  • benchmark_to_csv.py: Export results to CSV format
  • visualize_results.py: Generate performance plots

Citation

If you reference these benchmarks in research:

@dataset{aurekai_semantic_cache_bench_2026,
  title={Aurekai Semantic Cache Benchmarks},
  author={Aurekai Community},
  year={2026},
  url={https://huggingface.co/aurekai/semantic-cache-bench}
}

License

Licensed under the Aurekai Open Source License. See main repository for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support