Document 128K context and KV cache quantization defaults
Browse files
README.md
CHANGED
|
@@ -29,9 +29,11 @@ Running the original Gemma 4 weights on CPU Basic is too slow for an API with us
|
|
| 29 |
|
| 30 |
- `Gemma 4 E2B`
|
| 31 |
- `GGUF Q4_0` quantization
|
|
|
|
| 32 |
- `llama.cpp` server
|
| 33 |
- `reasoning` disabled by default
|
| 34 |
- `parallel=1` to avoid contention on 2 CPU cores
|
|
|
|
| 35 |
|
| 36 |
This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
|
| 37 |
|
|
@@ -67,16 +69,20 @@ You can change these in the Space settings as runtime variables:
|
|
| 67 |
- `MODEL_SPEC`
|
| 68 |
Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
|
| 69 |
- `CTX_SIZE`
|
| 70 |
-
Default: `
|
| 71 |
- `THREADS`
|
| 72 |
Default: `2`
|
| 73 |
- `PARALLEL`
|
| 74 |
Default: `1`
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
- `REASONING_MODE`
|
| 76 |
Default: `off`
|
| 77 |
|
| 78 |
## Notes
|
| 79 |
|
| 80 |
- If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
|
| 81 |
-
- If latency is
|
| 82 |
- Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.
|
|
|
|
| 29 |
|
| 30 |
- `Gemma 4 E2B`
|
| 31 |
- `GGUF Q4_0` quantization
|
| 32 |
+
- `KV cache q4_0 / q4_0`
|
| 33 |
- `llama.cpp` server
|
| 34 |
- `reasoning` disabled by default
|
| 35 |
- `parallel=1` to avoid contention on 2 CPU cores
|
| 36 |
+
- `ctx_size=131072` by default
|
| 37 |
|
| 38 |
This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
|
| 39 |
|
|
|
|
| 69 |
- `MODEL_SPEC`
|
| 70 |
Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
|
| 71 |
- `CTX_SIZE`
|
| 72 |
+
Default: `131072`
|
| 73 |
- `THREADS`
|
| 74 |
Default: `2`
|
| 75 |
- `PARALLEL`
|
| 76 |
Default: `1`
|
| 77 |
+
- `CACHE_TYPE_K`
|
| 78 |
+
Default: `q4_0`
|
| 79 |
+
- `CACHE_TYPE_V`
|
| 80 |
+
Default: `q4_0`
|
| 81 |
- `REASONING_MODE`
|
| 82 |
Default: `off`
|
| 83 |
|
| 84 |
## Notes
|
| 85 |
|
| 86 |
- If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
|
| 87 |
+
- If first-token latency is too high on very long prompts, reduce `CTX_SIZE` to `65536` or `32768`.
|
| 88 |
- Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.
|