Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -37,6 +37,30 @@ the gather step uses dense-style tensor expansion. Compute-reduction
|
|
| 37 |
numbers below are *algorithmic scoring reductions, not measured wall-clock
|
| 38 |
speedups.*
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
## What's in this repo
|
| 41 |
|
| 42 |
Per-layer linear search projections `(W_Qs, W_Ks)` of shape `[2560, 64]`,
|
|
|
|
| 37 |
numbers below are *algorithmic scoring reductions, not measured wall-clock
|
| 38 |
speedups.*
|
| 39 |
|
| 40 |
+
## Relation to RetrievalAttention
|
| 41 |
+
|
| 42 |
+
RetrievalAttention (Liu et al., 2024) shows that **vanilla ANN over the
|
| 43 |
+
model's native Q, K vectors fails** because Q and K live in mismatched
|
| 44 |
+
distributions — they were never trained to be each other's nearest
|
| 45 |
+
neighbors, only to score via dot product. Their fix is at *index time*:
|
| 46 |
+
an attention-aware graph construction (RoarGraph-style).
|
| 47 |
+
|
| 48 |
+
This work attacks the same problem from the opposite direction. We
|
| 49 |
+
**train a tiny shared projection** (`W_Qs, W_Ks → R^64`) so that
|
| 50 |
+
`q_search` and `k_search` live in the same distribution by construction.
|
| 51 |
+
Off-the-shelf FAISS HNSW with default parameters then suffices.
|
| 52 |
+
|
| 53 |
+
| | Search space | Index | Trainable |
|
| 54 |
+
|---|---|---|---|
|
| 55 |
+
| Raw Q/K + vanilla ANN | original Q/K | off-the-shelf | no — fails (Q/K OOD) |
|
| 56 |
+
| RetrievalAttention | original Q/K | attention-aware graph | no |
|
| 57 |
+
| **This work** | **learned Q\_s / K\_s** | **off-the-shelf** | **yes (~2-11M params)** |
|
| 58 |
+
|
| 59 |
+
Contribution: *eliminate Q/K mismatch at index-build time via distillation,
|
| 60 |
+
instead of patching it at search time.* The clean validating experiment —
|
| 61 |
+
vanilla FAISS over raw Q/K vs. learned Q\_s/K\_s vs. exact teacher top-K
|
| 62 |
+
— is the next planned run.
|
| 63 |
+
|
| 64 |
## What's in this repo
|
| 65 |
|
| 66 |
Per-layer linear search projections `(W_Qs, W_Ks)` of shape `[2560, 64]`,
|