| --- |
| title: Gemma 4 CPU Basic API |
| colorFrom: blue |
| colorTo: gray |
| sdk: docker |
| app_port: 7860 |
| suggested_hardware: cpu-basic |
| startup_duration_timeout: 1h |
| short_description: Gemma 4 E2B API for HF CPU Basic |
| models: |
| - google/gemma-4-E2B-it |
| - unsloth/gemma-4-E2B-it-GGUF |
| tags: |
| - gemma4 |
| - llama.cpp |
| - api |
| - openai-compatible |
| preload_from_hub: |
| - unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf |
| --- |
| |
| # Gemma 4 on CPU Basic |
|
|
| This Space is tuned for Hugging Face `CPU Basic`, which currently provides `2 vCPU`, `16 GB RAM`, and `50 GB` of ephemeral disk by default. |
|
|
| ## Why this setup |
|
|
| Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses: |
|
|
| - `Gemma 4 E2B` |
| - `GGUF Q4_0` quantization |
| - `KV cache q4_0 / q4_0` |
| - `llama.cpp` server |
| - `reasoning` disabled by default |
| - `parallel=1` to avoid contention on 2 CPU cores |
| - `ctx_size=131072` by default |
|
|
| This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed. |
|
|
| ## API |
|
|
| The server exposes an OpenAI-compatible API: |
|
|
| - `POST /v1/chat/completions` |
| - `POST /v1/completions` |
| - `GET /` |
|
|
| Example: |
|
|
| ```bash |
| curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "gemma-4-e2b-q4", |
| "messages": [ |
| {"role": "system", "content": "You are a concise assistant."}, |
| {"role": "user", "content": "Explain in one sentence what quantization is."} |
| ], |
| "max_tokens": 128, |
| "temperature": 0.2, |
| "stream": false |
| }' |
| ``` |
|
|
| ## Runtime knobs |
|
|
| You can change these in the Space settings as runtime variables: |
|
|
| - `MODEL_SPEC` |
| Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0` |
| - `CTX_SIZE` |
| Default: `131072` |
| - `THREADS` |
| Default: `2` |
| - `PARALLEL` |
| Default: `1` |
| - `CACHE_TYPE_K` |
| Default: `q4_0` |
| - `CACHE_TYPE_V` |
| Default: `q4_0` |
| - `REASONING_MODE` |
| Default: `off` |
|
|
| ## Notes |
|
|
| - If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`. |
| - If first-token latency is too high on very long prompts, reduce `CTX_SIZE` to `65536` or `32768`. |
| - Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache. |
|
|