--- title: Gemma 4 CPU Basic API colorFrom: blue colorTo: gray sdk: docker app_port: 7860 suggested_hardware: cpu-basic startup_duration_timeout: 1h short_description: Gemma 4 E2B API for HF CPU Basic models: - google/gemma-4-E2B-it - unsloth/gemma-4-E2B-it-GGUF tags: - gemma4 - llama.cpp - api - openai-compatible preload_from_hub: - unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf --- # Gemma 4 on CPU Basic This Space is tuned for Hugging Face `CPU Basic`, which currently provides `2 vCPU`, `16 GB RAM`, and `50 GB` of ephemeral disk by default. ## Why this setup Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses: - `Gemma 4 E2B` - `GGUF Q4_0` quantization - `KV cache q4_0 / q4_0` - `llama.cpp` server - `reasoning` disabled by default - `parallel=1` to avoid contention on 2 CPU cores - `ctx_size=131072` by default This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed. ## API The server exposes an OpenAI-compatible API: - `POST /v1/chat/completions` - `POST /v1/completions` - `GET /` Example: ```bash curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-4-e2b-q4", "messages": [ {"role": "system", "content": "You are a concise assistant."}, {"role": "user", "content": "Explain in one sentence what quantization is."} ], "max_tokens": 128, "temperature": 0.2, "stream": false }' ``` ## Runtime knobs You can change these in the Space settings as runtime variables: - `MODEL_SPEC` Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0` - `CTX_SIZE` Default: `131072` - `THREADS` Default: `2` - `PARALLEL` Default: `1` - `CACHE_TYPE_K` Default: `q4_0` - `CACHE_TYPE_V` Default: `q4_0` - `REASONING_MODE` Default: `off` ## Notes - If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`. - If first-token latency is too high on very long prompts, reduce `CTX_SIZE` to `65536` or `32768`. - Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.