AlexandreScriptsMT commited on
Commit
5717969
·
verified ·
1 Parent(s): 64daee6

Document 128K context and KV cache quantization defaults

Browse files
Files changed (1) hide show
  1. README.md +8 -2
README.md CHANGED
@@ -29,9 +29,11 @@ Running the original Gemma 4 weights on CPU Basic is too slow for an API with us
29
 
30
  - `Gemma 4 E2B`
31
  - `GGUF Q4_0` quantization
 
32
  - `llama.cpp` server
33
  - `reasoning` disabled by default
34
  - `parallel=1` to avoid contention on 2 CPU cores
 
35
 
36
  This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
37
 
@@ -67,16 +69,20 @@ You can change these in the Space settings as runtime variables:
67
  - `MODEL_SPEC`
68
  Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
69
  - `CTX_SIZE`
70
- Default: `4096`
71
  - `THREADS`
72
  Default: `2`
73
  - `PARALLEL`
74
  Default: `1`
 
 
 
 
75
  - `REASONING_MODE`
76
  Default: `off`
77
 
78
  ## Notes
79
 
80
  - If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
81
- - If latency is still high for your prompts, reduce `CTX_SIZE` to `2048`.
82
  - Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.
 
29
 
30
  - `Gemma 4 E2B`
31
  - `GGUF Q4_0` quantization
32
+ - `KV cache q4_0 / q4_0`
33
  - `llama.cpp` server
34
  - `reasoning` disabled by default
35
  - `parallel=1` to avoid contention on 2 CPU cores
36
+ - `ctx_size=131072` by default
37
 
38
  This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
39
 
 
69
  - `MODEL_SPEC`
70
  Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
71
  - `CTX_SIZE`
72
+ Default: `131072`
73
  - `THREADS`
74
  Default: `2`
75
  - `PARALLEL`
76
  Default: `1`
77
+ - `CACHE_TYPE_K`
78
+ Default: `q4_0`
79
+ - `CACHE_TYPE_V`
80
+ Default: `q4_0`
81
  - `REASONING_MODE`
82
  Default: `off`
83
 
84
  ## Notes
85
 
86
  - If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
87
+ - If first-token latency is too high on very long prompts, reduce `CTX_SIZE` to `65536` or `32768`.
88
  - Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.