Spaces:

AlexandreScriptsMT
/

gemma-4-cpu-basic-api

Running

App Files Files Community

gemma-4-cpu-basic-api / README.md

AlexandreScriptsMT

Document 128K context and KV cache quantization defaults

5717969 verified 15 days ago

preview code

raw

history blame contribute delete

2.29 kB

metadata

title: Gemma 4 CPU Basic API
colorFrom: blue
colorTo: gray
sdk: docker
app_port: 7860
suggested_hardware: cpu-basic
startup_duration_timeout: 1h
short_description: Gemma 4 E2B API for HF CPU Basic
models:
  - google/gemma-4-E2B-it
  - unsloth/gemma-4-E2B-it-GGUF
tags:
  - gemma4
  - llama.cpp
  - api
  - openai-compatible
preload_from_hub:
  - unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf

Gemma 4 on CPU Basic

This Space is tuned for Hugging Face CPU Basic, which currently provides 2 vCPU, 16 GB RAM, and 50 GB of ephemeral disk by default.

Why this setup

Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses:

Gemma 4 E2B
GGUF Q4_0 quantization
KV cache q4_0 / q4_0
llama.cpp server
reasoning disabled by default
parallel=1 to avoid contention on 2 CPU cores
ctx_size=131072 by default

This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.

API

The server exposes an OpenAI-compatible API:

POST /v1/chat/completions
POST /v1/completions
GET /

Example:

curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-e2b-q4",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "Explain in one sentence what quantization is."}
    ],
    "max_tokens": 128,
    "temperature": 0.2,
    "stream": false
  }'

Runtime knobs

You can change these in the Space settings as runtime variables:

MODEL_SPEC Default: unsloth/gemma-4-E2B-it-GGUF:Q4_0
CTX_SIZE Default: 131072
THREADS Default: 2
PARALLEL Default: 1
CACHE_TYPE_K Default: q4_0
CACHE_TYPE_V Default: q4_0
REASONING_MODE Default: off

Notes

If you want a bit more quality and can accept lower speed, switch MODEL_SPEC to unsloth/gemma-4-E2B-it-GGUF:Q4_K_M.
If first-token latency is too high on very long prompts, reduce CTX_SIZE to 65536 or 32768.
Space disk is ephemeral. The preload_from_hub setting reduces cold-start time by baking the default model file into the build cache.