Spaces:

AlexandreScriptsMT
/

gemma-4-cpu-basic-api

Paused

App Files Files Community

AlexandreScriptsMT commited on May 4

Commit

12010f8

verified ·

1 Parent(s): 43aaf2b

Add README for Gemma 4 CPU Basic API Space

Browse files

Files changed (1) hide show

README.md +77 -5

README.md CHANGED Viewed

@@ -1,10 +1,82 @@
 ---
-title: Gemma 4 Cpu Basic Api
-emoji: 🏃
-colorFrom: red
 colorTo: gray
 sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Gemma 4 CPU Basic API
+colorFrom: blue
 colorTo: gray
 sdk: docker
+app_port: 7860
+suggested_hardware: cpu-basic
+startup_duration_timeout: 1h
+short_description: Gemma 4 E2B API for HF CPU Basic
+models:
+  - google/gemma-4-E2B-it
+  - unsloth/gemma-4-E2B-it-GGUF
+tags:
+  - gemma4
+  - llama.cpp
+  - api
+  - openai-compatible
+preload_from_hub:
+  - unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf
 ---
+# Gemma 4 on CPU Basic
+This Space is tuned for Hugging Face `CPU Basic`, which currently provides `2 vCPU`, `16 GB RAM`, and `50 GB` of ephemeral disk by default.
+## Why this setup
+Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses:
+- `Gemma 4 E2B`
+- `GGUF Q4_0` quantization
+- `llama.cpp` server
+- `reasoning` disabled by default
+- `parallel=1` to avoid contention on 2 CPU cores
+This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
+## API
+The server exposes an OpenAI-compatible API:
+- `POST /v1/chat/completions`
+- `POST /v1/completions`
+- `GET /`
+Example:
+```bash
+curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemma-4-e2b-q4",
+    "messages": [
+      {"role": "system", "content": "You are a concise assistant."},
+      {"role": "user", "content": "Explain in one sentence what quantization is."}
+    ],
+    "max_tokens": 128,
+    "temperature": 0.2,
+    "stream": false
+  }'
+```
+## Runtime knobs
+You can change these in the Space settings as runtime variables:
+- `MODEL_SPEC`
+  Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
+- `CTX_SIZE`
+  Default: `4096`
+- `THREADS`
+  Default: `2`
+- `PARALLEL`
+  Default: `1`
+- `REASONING_MODE`
+  Default: `off`
+## Notes
+- If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
+- If latency is still high for your prompts, reduce `CTX_SIZE` to `2048`.
+- Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.