datasysdev
/

ann-sparseattention

@@ -37,6 +37,30 @@ the gather step uses dense-style tensor expansion. Compute-reduction
 numbers below are *algorithmic scoring reductions, not measured wall-clock
 speedups.*
 ## What's in this repo
 Per-layer linear search projections `(W_Qs, W_Ks)` of shape `[2560, 64]`,

 numbers below are *algorithmic scoring reductions, not measured wall-clock
 speedups.*
+## Relation to RetrievalAttention
+RetrievalAttention (Liu et al., 2024) shows that **vanilla ANN over the
+model's native Q, K vectors fails** because Q and K live in mismatched
+distributions — they were never trained to be each other's nearest
+neighbors, only to score via dot product. Their fix is at *index time*:
+an attention-aware graph construction (RoarGraph-style).
+This work attacks the same problem from the opposite direction. We
+**train a tiny shared projection** (`W_Qs, W_Ks → R^64`) so that
+`q_search` and `k_search` live in the same distribution by construction.
+Off-the-shelf FAISS HNSW with default parameters then suffices.
+| | Search space | Index | Trainable |
+|---|---|---|---|
+| Raw Q/K + vanilla ANN | original Q/K | off-the-shelf | no — fails (Q/K OOD) |
+| RetrievalAttention | original Q/K | attention-aware graph | no |
+| **This work** | **learned Q\_s / K\_s** | **off-the-shelf** | **yes (~2-11M params)** |
+Contribution: *eliminate Q/K mismatch at index-build time via distillation,
+instead of patching it at search time.* The clean validating experiment —
+vanilla FAISS over raw Q/K vs. learned Q\_s/K\_s vs. exact teacher top-K
+— is the next planned run.
 ## What's in this repo
 Per-layer linear search projections `(W_Qs, W_Ks)` of shape `[2560, 64]`,