Add README for Gemma 4 CPU Basic API Space
Browse files
README.md
CHANGED
|
@@ -1,10 +1,82 @@
|
|
| 1 |
---
|
| 2 |
-
title: Gemma 4
|
| 3 |
-
|
| 4 |
-
colorFrom: red
|
| 5 |
colorTo: gray
|
| 6 |
sdk: docker
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Gemma 4 CPU Basic API
|
| 3 |
+
colorFrom: blue
|
|
|
|
| 4 |
colorTo: gray
|
| 5 |
sdk: docker
|
| 6 |
+
app_port: 7860
|
| 7 |
+
suggested_hardware: cpu-basic
|
| 8 |
+
startup_duration_timeout: 1h
|
| 9 |
+
short_description: Gemma 4 E2B API for HF CPU Basic
|
| 10 |
+
models:
|
| 11 |
+
- google/gemma-4-E2B-it
|
| 12 |
+
- unsloth/gemma-4-E2B-it-GGUF
|
| 13 |
+
tags:
|
| 14 |
+
- gemma4
|
| 15 |
+
- llama.cpp
|
| 16 |
+
- api
|
| 17 |
+
- openai-compatible
|
| 18 |
+
preload_from_hub:
|
| 19 |
+
- unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf
|
| 20 |
---
|
| 21 |
|
| 22 |
+
# Gemma 4 on CPU Basic
|
| 23 |
+
|
| 24 |
+
This Space is tuned for Hugging Face `CPU Basic`, which currently provides `2 vCPU`, `16 GB RAM`, and `50 GB` of ephemeral disk by default.
|
| 25 |
+
|
| 26 |
+
## Why this setup
|
| 27 |
+
|
| 28 |
+
Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses:
|
| 29 |
+
|
| 30 |
+
- `Gemma 4 E2B`
|
| 31 |
+
- `GGUF Q4_0` quantization
|
| 32 |
+
- `llama.cpp` server
|
| 33 |
+
- `reasoning` disabled by default
|
| 34 |
+
- `parallel=1` to avoid contention on 2 CPU cores
|
| 35 |
+
|
| 36 |
+
This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
|
| 37 |
+
|
| 38 |
+
## API
|
| 39 |
+
|
| 40 |
+
The server exposes an OpenAI-compatible API:
|
| 41 |
+
|
| 42 |
+
- `POST /v1/chat/completions`
|
| 43 |
+
- `POST /v1/completions`
|
| 44 |
+
- `GET /`
|
| 45 |
+
|
| 46 |
+
Example:
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \
|
| 50 |
+
-H "Content-Type: application/json" \
|
| 51 |
+
-d '{
|
| 52 |
+
"model": "gemma-4-e2b-q4",
|
| 53 |
+
"messages": [
|
| 54 |
+
{"role": "system", "content": "You are a concise assistant."},
|
| 55 |
+
{"role": "user", "content": "Explain in one sentence what quantization is."}
|
| 56 |
+
],
|
| 57 |
+
"max_tokens": 128,
|
| 58 |
+
"temperature": 0.2,
|
| 59 |
+
"stream": false
|
| 60 |
+
}'
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Runtime knobs
|
| 64 |
+
|
| 65 |
+
You can change these in the Space settings as runtime variables:
|
| 66 |
+
|
| 67 |
+
- `MODEL_SPEC`
|
| 68 |
+
Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
|
| 69 |
+
- `CTX_SIZE`
|
| 70 |
+
Default: `4096`
|
| 71 |
+
- `THREADS`
|
| 72 |
+
Default: `2`
|
| 73 |
+
- `PARALLEL`
|
| 74 |
+
Default: `1`
|
| 75 |
+
- `REASONING_MODE`
|
| 76 |
+
Default: `off`
|
| 77 |
+
|
| 78 |
+
## Notes
|
| 79 |
+
|
| 80 |
+
- If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
|
| 81 |
+
- If latency is still high for your prompts, reduce `CTX_SIZE` to `2048`.
|
| 82 |
+
- Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.
|