AlexandreScriptsMT's picture
Document 128K context and KV cache quantization defaults
5717969 verified
---
title: Gemma 4 CPU Basic API
colorFrom: blue
colorTo: gray
sdk: docker
app_port: 7860
suggested_hardware: cpu-basic
startup_duration_timeout: 1h
short_description: Gemma 4 E2B API for HF CPU Basic
models:
- google/gemma-4-E2B-it
- unsloth/gemma-4-E2B-it-GGUF
tags:
- gemma4
- llama.cpp
- api
- openai-compatible
preload_from_hub:
- unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf
---
# Gemma 4 on CPU Basic
This Space is tuned for Hugging Face `CPU Basic`, which currently provides `2 vCPU`, `16 GB RAM`, and `50 GB` of ephemeral disk by default.
## Why this setup
Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses:
- `Gemma 4 E2B`
- `GGUF Q4_0` quantization
- `KV cache q4_0 / q4_0`
- `llama.cpp` server
- `reasoning` disabled by default
- `parallel=1` to avoid contention on 2 CPU cores
- `ctx_size=131072` by default
This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
## API
The server exposes an OpenAI-compatible API:
- `POST /v1/chat/completions`
- `POST /v1/completions`
- `GET /`
Example:
```bash
curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-e2b-q4",
"messages": [
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Explain in one sentence what quantization is."}
],
"max_tokens": 128,
"temperature": 0.2,
"stream": false
}'
```
## Runtime knobs
You can change these in the Space settings as runtime variables:
- `MODEL_SPEC`
Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
- `CTX_SIZE`
Default: `131072`
- `THREADS`
Default: `2`
- `PARALLEL`
Default: `1`
- `CACHE_TYPE_K`
Default: `q4_0`
- `CACHE_TYPE_V`
Default: `q4_0`
- `REASONING_MODE`
Default: `off`
## Notes
- If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
- If first-token latency is too high on very long prompts, reduce `CTX_SIZE` to `65536` or `32768`.
- Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.