Spaces:

AlexandreScriptsMT
/

gemma-4-cpu-basic-api

Paused

AlexandreScriptsMT commited on May 4

Commit

5717969

verified ·

1 Parent(s): 64daee6

Document 128K context and KV cache quantization defaults

Files changed (1) hide show

README.md CHANGED Viewed

@@ -29,9 +29,11 @@ Running the original Gemma 4 weights on CPU Basic is too slow for an API with us
 - `Gemma 4 E2B`
 - `GGUF Q4_0` quantization
 - `llama.cpp` server
 - `reasoning` disabled by default
 - `parallel=1` to avoid contention on 2 CPU cores
 This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
@@ -67,16 +69,20 @@ You can change these in the Space settings as runtime variables:
 - `MODEL_SPEC`
   Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
 - `CTX_SIZE`
-  Default: `4096`
 - `THREADS`
   Default: `2`
 - `PARALLEL`
   Default: `1`
 - `REASONING_MODE`
   Default: `off`
 ## Notes
 - If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
-- If latency is still high for your prompts, reduce `CTX_SIZE` to `2048`.
 - Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.

 - `Gemma 4 E2B`
 - `GGUF Q4_0` quantization
+- `KV cache q4_0 / q4_0`
 - `llama.cpp` server
 - `reasoning` disabled by default
 - `parallel=1` to avoid contention on 2 CPU cores
+- `ctx_size=131072` by default
 This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
 - `MODEL_SPEC`
   Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
 - `CTX_SIZE`
+  Default: `131072`
 - `THREADS`
   Default: `2`
 - `PARALLEL`
   Default: `1`
+- `CACHE_TYPE_K`
+  Default: `q4_0`
+- `CACHE_TYPE_V`
+  Default: `q4_0`
 - `REASONING_MODE`
   Default: `off`
 ## Notes
 - If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
+- If first-token latency is too high on very long prompts, reduce `CTX_SIZE` to `65536` or `32768`.
 - Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.