AlexandreScriptsMT's picture
Document 128K context and KV cache quantization defaults
5717969 verified
metadata
title: Gemma 4 CPU Basic API
colorFrom: blue
colorTo: gray
sdk: docker
app_port: 7860
suggested_hardware: cpu-basic
startup_duration_timeout: 1h
short_description: Gemma 4 E2B API for HF CPU Basic
models:
  - google/gemma-4-E2B-it
  - unsloth/gemma-4-E2B-it-GGUF
tags:
  - gemma4
  - llama.cpp
  - api
  - openai-compatible
preload_from_hub:
  - unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf

Gemma 4 on CPU Basic

This Space is tuned for Hugging Face CPU Basic, which currently provides 2 vCPU, 16 GB RAM, and 50 GB of ephemeral disk by default.

Why this setup

Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses:

  • Gemma 4 E2B
  • GGUF Q4_0 quantization
  • KV cache q4_0 / q4_0
  • llama.cpp server
  • reasoning disabled by default
  • parallel=1 to avoid contention on 2 CPU cores
  • ctx_size=131072 by default

This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.

API

The server exposes an OpenAI-compatible API:

  • POST /v1/chat/completions
  • POST /v1/completions
  • GET /

Example:

curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-e2b-q4",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "Explain in one sentence what quantization is."}
    ],
    "max_tokens": 128,
    "temperature": 0.2,
    "stream": false
  }'

Runtime knobs

You can change these in the Space settings as runtime variables:

  • MODEL_SPEC Default: unsloth/gemma-4-E2B-it-GGUF:Q4_0
  • CTX_SIZE Default: 131072
  • THREADS Default: 2
  • PARALLEL Default: 1
  • CACHE_TYPE_K Default: q4_0
  • CACHE_TYPE_V Default: q4_0
  • REASONING_MODE Default: off

Notes

  • If you want a bit more quality and can accept lower speed, switch MODEL_SPEC to unsloth/gemma-4-E2B-it-GGUF:Q4_K_M.
  • If first-token latency is too high on very long prompts, reduce CTX_SIZE to 65536 or 32768.
  • Space disk is ephemeral. The preload_from_hub setting reduces cold-start time by baking the default model file into the build cache.