Transformers
GGUF
sparse-attention
approximate-nearest-neighbors
faiss
qwen3
long-context
conversational
Instructions to use datasysdev/ann-sparseattention with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use datasysdev/ann-sparseattention with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("datasysdev/ann-sparseattention", dtype="auto") - llama-cpp-python
How to use datasysdev/ann-sparseattention with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="datasysdev/ann-sparseattention", filename="gguf/Qwen3-4B-Instruct-2507-F16-ann-6layer-k128-v2.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use datasysdev/ann-sparseattention with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf datasysdev/ann-sparseattention:F16 # Run inference directly in the terminal: llama-cli -hf datasysdev/ann-sparseattention:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf datasysdev/ann-sparseattention:F16 # Run inference directly in the terminal: llama-cli -hf datasysdev/ann-sparseattention:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf datasysdev/ann-sparseattention:F16 # Run inference directly in the terminal: ./llama-cli -hf datasysdev/ann-sparseattention:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf datasysdev/ann-sparseattention:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf datasysdev/ann-sparseattention:F16
Use Docker
docker model run hf.co/datasysdev/ann-sparseattention:F16
- LM Studio
- Jan
- Ollama
How to use datasysdev/ann-sparseattention with Ollama:
ollama run hf.co/datasysdev/ann-sparseattention:F16
- Unsloth Studio new
How to use datasysdev/ann-sparseattention with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for datasysdev/ann-sparseattention to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for datasysdev/ann-sparseattention to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for datasysdev/ann-sparseattention to start chatting
- Pi new
How to use datasysdev/ann-sparseattention with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf datasysdev/ann-sparseattention:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "datasysdev/ann-sparseattention:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use datasysdev/ann-sparseattention with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf datasysdev/ann-sparseattention:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default datasysdev/ann-sparseattention:F16
Run Hermes
hermes
- Docker Model Runner
How to use datasysdev/ann-sparseattention with Docker Model Runner:
docker model run hf.co/datasysdev/ann-sparseattention:F16
- Lemonade
How to use datasysdev/ann-sparseattention with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull datasysdev/ann-sparseattention:F16
Run and chat with the model
lemonade run user.ann-sparseattention-F16
List all available models
lemonade list
Update README.md
Browse files
README.md
CHANGED
|
@@ -46,8 +46,11 @@ Broad-layer experiments:
|
|
| 46 |
- All-36 step 500: recall@K=0.816, PPL gap +3.23%.
|
| 47 |
- All-36 step 750 regressed to +3.96% despite stable recall.
|
| 48 |
- Per-layer mass@K identified L00/L01/L02 as the weak early layers.
|
| 49 |
-
-
|
| 50 |
-
layers `3..34`
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
## Important results
|
| 53 |
|
|
@@ -98,7 +101,7 @@ This validates that the learned search vectors are compatible with
|
|
| 98 |
off-the-shelf ANN. It is not a wall-clock result: the prototype uses CPU FAISS
|
| 99 |
and per-forward index construction.
|
| 100 |
|
| 101 |
-
### All-36
|
| 102 |
|
| 103 |
| Step | Recall@K eval | PPL gap |
|
| 104 |
|---:|---:|---:|
|
|
@@ -129,15 +132,53 @@ Per-layer step-500 mass@K at K=128:
|
|
| 129 |
| L35 | 0.980 | 0.967 | -0.013 |
|
| 130 |
| avg | 0.966 | 0.960 | -0.006 |
|
| 131 |
|
| 132 |
-
|
|
|
|
| 133 |
|
| 134 |
-
|
| 135 |
|
| 136 |
| Step | Recall@K eval | PPL gap | Read |
|
| 137 |
|---:|---:|---:|---|
|
| 138 |
-
| 250 | 0.812 | +2.
|
| 139 |
-
|
| 140 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
## Positioning against related methods
|
| 143 |
|
|
@@ -156,9 +197,10 @@ closest in practical baseline behavior to Quest.
|
|
| 156 |
| This work | trained low-dim retrieval | yes | yes | O(N log N) | over retrieved set |
|
| 157 |
|
| 158 |
This is a design-positioning table, not a claim of completed production
|
| 159 |
-
superiority. The clean result proves the approach for the six-layer pilot
|
| 160 |
-
|
| 161 |
-
|
|
|
|
| 162 |
|
| 163 |
This method targets a different deployment scenario than native
|
| 164 |
sliding-window/state-space/hybrid architectures such as Mistral-style sliding
|
|
@@ -175,7 +217,9 @@ Important checkpoint paths in this HF repo:
|
|
| 175 |
- `checkpoints_block_d128/search_step_1000.pt`: clean six-layer d128 parity checkpoint.
|
| 176 |
- `checkpoints_all36_d128_block/protected/search_step_500_keep.pt`: best observed all-36 checkpoint so far.
|
| 177 |
- `checkpoints_all36_d128_block/search_step_800.pt`: latest all-36 checkpoint before stopping for analysis.
|
| 178 |
-
- `checkpoints_all32_d128_block_reserve_0_1_2_35/`:
|
|
|
|
|
|
|
| 179 |
|
| 180 |
These checkpoints contain the trained search projection module and optimizer
|
| 181 |
state. They do not contain or modify the base Qwen model weights.
|
|
@@ -187,6 +231,8 @@ state. They do not contain or modify the base Qwen model weights.
|
|
| 187 |
- No autoregressive KV-cache integration yet.
|
| 188 |
- Dynamic indexing is currently supported only by a retrieval-mass proxy.
|
| 189 |
- Main clean results are single-model and mostly single-seed.
|
| 190 |
-
- All-36 broad substitution is not full-attention parity
|
|
|
|
|
|
|
| 191 |
|
| 192 |
Use the GitHub repository for runnable code, scripts, and the LaTeX paper draft.
|
|
|
|
| 46 |
- All-36 step 500: recall@K=0.816, PPL gap +3.23%.
|
| 47 |
- All-36 step 750 regressed to +3.96% despite stable recall.
|
| 48 |
- Per-layer mass@K identified L00/L01/L02 as the weak early layers.
|
| 49 |
+
- The all32 reserved-edge run reserves full attention on `[0, 1, 2, 35]` and
|
| 50 |
+
trains layers `3..34`. Final step 1000: recall@K=0.825, +1.746% PPL gap in
|
| 51 |
+
training eval, and 20.97M trained search-projection parameters. The exact
|
| 52 |
+
K-sweep gives +0.590% PPL gap at K=128 and -0.062% at K=256 on a small clean
|
| 53 |
+
block-causal slice.
|
| 54 |
|
| 55 |
## Important results
|
| 56 |
|
|
|
|
| 101 |
off-the-shelf ANN. It is not a wall-clock result: the prototype uses CPU FAISS
|
| 102 |
and per-forward index construction.
|
| 103 |
|
| 104 |
+
### All-36 and all32 broad-layer results
|
| 105 |
|
| 106 |
| Step | Recall@K eval | PPL gap |
|
| 107 |
|---:|---:|---:|
|
|
|
|
| 132 |
| L35 | 0.980 | 0.967 | -0.013 |
|
| 133 |
| avg | 0.966 | 0.960 | -0.006 |
|
| 134 |
|
| 135 |
+
This diagnostic motivated reserving `[0, 1, 2, 35]` as full-attention layers and
|
| 136 |
+
training only layers `3..34`.
|
| 137 |
|
| 138 |
+
Final all32 reserved-edge training trajectory:
|
| 139 |
|
| 140 |
| Step | Recall@K eval | PPL gap | Read |
|
| 141 |
|---:|---:|---:|---|
|
| 142 |
+
| 250 | 0.812 | +2.283% | already better than all36 best training eval |
|
| 143 |
+
| 500 | 0.823 | +1.753% | converged to final quality band |
|
| 144 |
+
| 750 | 0.825 | +1.943% | small eval fluctuation |
|
| 145 |
+
| 1000 | 0.825 | +1.746% | final checkpoint; essentially tied with step 500 |
|
| 146 |
+
|
| 147 |
+
The all32 checkpoint is the current broad-substitution result. It is not
|
| 148 |
+
full-attention parity at K=128 in training eval, but it reduces the all36
|
| 149 |
+
quality cost while still substituting 32 of 36 layers. Post-hoc
|
| 150 |
+
`compare_retrieval` on step 1000 shows learned retrieval matches raw-QK mass on
|
| 151 |
+
the substituted layers: at K=128, learned mass is 0.971 vs raw-QK 0.969; at
|
| 152 |
+
K=256, learned mass is 0.993 vs raw-QK 0.994.
|
| 153 |
+
|
| 154 |
+
Exact K-sweep on the final all32 checkpoint, 2-batch clean block-causal slice
|
| 155 |
+
(`PPL_full = 20.5349`):
|
| 156 |
+
|
| 157 |
+
| K | mass@K | Recall@K | sparse PPL | PPL gap |
|
| 158 |
+
|---:|---:|---:|---:|---:|
|
| 159 |
+
| 16 | 0.546 | 0.518 | 24.86 | +21.064% |
|
| 160 |
+
| 32 | 0.627 | 0.572 | 21.85 | +6.422% |
|
| 161 |
+
| 64 | 0.722 | 0.652 | 20.94 | +1.974% |
|
| 162 |
+
| 128 | 0.807 | 0.746 | 20.66 | +0.590% |
|
| 163 |
+
| 256 | 0.902 | 0.876 | 20.52 | -0.062% |
|
| 164 |
+
|
| 165 |
+
K=512 is intentionally omitted from this table. The current script produced a
|
| 166 |
+
valid sparse-attention PPL line for K=512 but zero mass/recall, which is an
|
| 167 |
+
edge-case bug in the metric path when K exceeds the number of valid causal keys
|
| 168 |
+
for most same-segment queries. It should be rerun after fixing the metric
|
| 169 |
+
handling; the publishable sweep for now is K <= 256.
|
| 170 |
+
|
| 171 |
+
Coverage now looks like a real deployment knob:
|
| 172 |
+
|
| 173 |
+
| Configuration | Layers substituted | Coverage | PPL gap | Read |
|
| 174 |
+
|---|---:|---:|---:|---|
|
| 175 |
+
| Clean six-layer pilot | 6/36 | 17% | +0.07% at K=128 | quality-preserving pilot |
|
| 176 |
+
| all32 reserved-edge | 32/36 | 89% | +1.746% train eval; +0.590% exact sweep | near-parity broad substitution |
|
| 177 |
+
| all36 | 36/36 | 100% | +3.23% best observed | full substitution costs quality |
|
| 178 |
+
|
| 179 |
+
This is not yet enough to claim an optimal coverage ratio, but it suggests the
|
| 180 |
+
best deployment point is intermediate rather than "sparsify everything." A
|
| 181 |
+
12/18/20-layer coverage sweep is the next clean experiment.
|
| 182 |
|
| 183 |
## Positioning against related methods
|
| 184 |
|
|
|
|
| 197 |
| This work | trained low-dim retrieval | yes | yes | O(N log N) | over retrieved set |
|
| 198 |
|
| 199 |
This is a design-positioning table, not a claim of completed production
|
| 200 |
+
superiority. The clean result proves the approach for the six-layer pilot, and
|
| 201 |
+
the all32 reserved-edge run shows broad substitution can get close to parity
|
| 202 |
+
when weak edge layers remain full attention. All36 full substitution is still
|
| 203 |
+
not parity.
|
| 204 |
|
| 205 |
This method targets a different deployment scenario than native
|
| 206 |
sliding-window/state-space/hybrid architectures such as Mistral-style sliding
|
|
|
|
| 217 |
- `checkpoints_block_d128/search_step_1000.pt`: clean six-layer d128 parity checkpoint.
|
| 218 |
- `checkpoints_all36_d128_block/protected/search_step_500_keep.pt`: best observed all-36 checkpoint so far.
|
| 219 |
- `checkpoints_all36_d128_block/search_step_800.pt`: latest all-36 checkpoint before stopping for analysis.
|
| 220 |
+
- `checkpoints_all32_d128_block_reserve_0_1_2_35/search_step_1000.pt`: final all32 reserved-edge checkpoint.
|
| 221 |
+
- `checkpoints_all32_d128_block_reserve_0_1_2_35/search_step_1000.compare_retrieval.json`: all32 per-layer retrieval comparison.
|
| 222 |
+
- `checkpoints_all32_d128_block_reserve_0_1_2_35/search_step_1000.k_sweep_exact.json`: all32 exact K-sweep.
|
| 223 |
|
| 224 |
These checkpoints contain the trained search projection module and optimizer
|
| 225 |
state. They do not contain or modify the base Qwen model weights.
|
|
|
|
| 231 |
- No autoregressive KV-cache integration yet.
|
| 232 |
- Dynamic indexing is currently supported only by a retrieval-mass proxy.
|
| 233 |
- Main clean results are single-model and mostly single-seed.
|
| 234 |
+
- All-36 broad substitution is not full-attention parity.
|
| 235 |
+
- The all32 result is near parity on a small slice but still needs larger eval
|
| 236 |
+
slices, task benchmarks, and a coverage Pareto sweep.
|
| 237 |
|
| 238 |
Use the GitHub repository for runnable code, scripts, and the LaTeX paper draft.
|