Spaces:

AlexandreScriptsMT
/

gemma-4-cpu-basic-api

Running

App Files Files Community

gemma-4-cpu-basic-api / README.md

AlexandreScriptsMT

Document 128K context and KV cache quantization defaults

5717969 verified 15 days ago

preview code

raw

history blame contribute delete

2.29 kB

	---
	title: Gemma 4 CPU Basic API
	colorFrom: blue
	colorTo: gray
	sdk: docker
	app_port: 7860
	suggested_hardware: cpu-basic
	startup_duration_timeout: 1h
	short_description: Gemma 4 E2B API for HF CPU Basic
	models:
	- google/gemma-4-E2B-it
	- unsloth/gemma-4-E2B-it-GGUF
	tags:
	- gemma4
	- llama.cpp
	- api
	- openai-compatible
	preload_from_hub:
	- unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf
	---

	# Gemma 4 on CPU Basic

	This Space is tuned for Hugging Face `CPU Basic`, which currently provides `2 vCPU`, `16 GB RAM`, and `50 GB` of ephemeral disk by default.

	## Why this setup

	Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses:

	- `Gemma 4 E2B`
	- `GGUF Q4_0` quantization
	- `KV cache q4_0 / q4_0`
	- `llama.cpp` server
	- `reasoning` disabled by default
	- `parallel=1` to avoid contention on 2 CPU cores
	- `ctx_size=131072` by default

	This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.

	## API

	The server exposes an OpenAI-compatible API:

	- `POST /v1/chat/completions`
	- `POST /v1/completions`
	- `GET /`

	Example:

	```bash
	curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \
	-H "Content-Type: application/json" \
	-d '{
	"model": "gemma-4-e2b-q4",
	"messages": [
	{"role": "system", "content": "You are a concise assistant."},
	{"role": "user", "content": "Explain in one sentence what quantization is."}
	],
	"max_tokens": 128,
	"temperature": 0.2,
	"stream": false
	}'
	```

	## Runtime knobs

	You can change these in the Space settings as runtime variables:

	- `MODEL_SPEC`
	Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
	- `CTX_SIZE`
	Default: `131072`
	- `THREADS`
	Default: `2`
	- `PARALLEL`
	Default: `1`
	- `CACHE_TYPE_K`
	Default: `q4_0`
	- `CACHE_TYPE_V`
	Default: `q4_0`
	- `REASONING_MODE`
	Default: `off`

	## Notes

	- If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
	- If first-token latency is too high on very long prompts, reduce `CTX_SIZE` to `65536` or `32768`.
	- Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.