metadata
title: Gemma 4 CPU Basic API
colorFrom: blue
colorTo: gray
sdk: docker
app_port: 7860
suggested_hardware: cpu-basic
startup_duration_timeout: 1h
short_description: Gemma 4 E2B API for HF CPU Basic
models:
- google/gemma-4-E2B-it
- unsloth/gemma-4-E2B-it-GGUF
tags:
- gemma4
- llama.cpp
- api
- openai-compatible
preload_from_hub:
- unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf
Gemma 4 on CPU Basic
This Space is tuned for Hugging Face CPU Basic, which currently provides 2 vCPU, 16 GB RAM, and 50 GB of ephemeral disk by default.
Why this setup
Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses:
Gemma 4 E2BGGUF Q4_0quantizationKV cache q4_0 / q4_0llama.cppserverreasoningdisabled by defaultparallel=1to avoid contention on 2 CPU coresctx_size=131072by default
This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
API
The server exposes an OpenAI-compatible API:
POST /v1/chat/completionsPOST /v1/completionsGET /
Example:
curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-e2b-q4",
"messages": [
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Explain in one sentence what quantization is."}
],
"max_tokens": 128,
"temperature": 0.2,
"stream": false
}'
Runtime knobs
You can change these in the Space settings as runtime variables:
MODEL_SPECDefault:unsloth/gemma-4-E2B-it-GGUF:Q4_0CTX_SIZEDefault:131072THREADSDefault:2PARALLELDefault:1CACHE_TYPE_KDefault:q4_0CACHE_TYPE_VDefault:q4_0REASONING_MODEDefault:off
Notes
- If you want a bit more quality and can accept lower speed, switch
MODEL_SPECtounsloth/gemma-4-E2B-it-GGUF:Q4_K_M. - If first-token latency is too high on very long prompts, reduce
CTX_SIZEto65536or32768. - Space disk is ephemeral. The
preload_from_hubsetting reduces cold-start time by baking the default model file into the build cache.