AlexandreScriptsMT commited on
Commit
12010f8
·
verified ·
1 Parent(s): 43aaf2b

Add README for Gemma 4 CPU Basic API Space

Browse files
Files changed (1) hide show
  1. README.md +77 -5
README.md CHANGED
@@ -1,10 +1,82 @@
1
  ---
2
- title: Gemma 4 Cpu Basic Api
3
- emoji: 🏃
4
- colorFrom: red
5
  colorTo: gray
6
  sdk: docker
7
- pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Gemma 4 CPU Basic API
3
+ colorFrom: blue
 
4
  colorTo: gray
5
  sdk: docker
6
+ app_port: 7860
7
+ suggested_hardware: cpu-basic
8
+ startup_duration_timeout: 1h
9
+ short_description: Gemma 4 E2B API for HF CPU Basic
10
+ models:
11
+ - google/gemma-4-E2B-it
12
+ - unsloth/gemma-4-E2B-it-GGUF
13
+ tags:
14
+ - gemma4
15
+ - llama.cpp
16
+ - api
17
+ - openai-compatible
18
+ preload_from_hub:
19
+ - unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf
20
  ---
21
 
22
+ # Gemma 4 on CPU Basic
23
+
24
+ This Space is tuned for Hugging Face `CPU Basic`, which currently provides `2 vCPU`, `16 GB RAM`, and `50 GB` of ephemeral disk by default.
25
+
26
+ ## Why this setup
27
+
28
+ Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses:
29
+
30
+ - `Gemma 4 E2B`
31
+ - `GGUF Q4_0` quantization
32
+ - `llama.cpp` server
33
+ - `reasoning` disabled by default
34
+ - `parallel=1` to avoid contention on 2 CPU cores
35
+
36
+ This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.
37
+
38
+ ## API
39
+
40
+ The server exposes an OpenAI-compatible API:
41
+
42
+ - `POST /v1/chat/completions`
43
+ - `POST /v1/completions`
44
+ - `GET /`
45
+
46
+ Example:
47
+
48
+ ```bash
49
+ curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \
50
+ -H "Content-Type: application/json" \
51
+ -d '{
52
+ "model": "gemma-4-e2b-q4",
53
+ "messages": [
54
+ {"role": "system", "content": "You are a concise assistant."},
55
+ {"role": "user", "content": "Explain in one sentence what quantization is."}
56
+ ],
57
+ "max_tokens": 128,
58
+ "temperature": 0.2,
59
+ "stream": false
60
+ }'
61
+ ```
62
+
63
+ ## Runtime knobs
64
+
65
+ You can change these in the Space settings as runtime variables:
66
+
67
+ - `MODEL_SPEC`
68
+ Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
69
+ - `CTX_SIZE`
70
+ Default: `4096`
71
+ - `THREADS`
72
+ Default: `2`
73
+ - `PARALLEL`
74
+ Default: `1`
75
+ - `REASONING_MODE`
76
+ Default: `off`
77
+
78
+ ## Notes
79
+
80
+ - If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
81
+ - If latency is still high for your prompts, reduce `CTX_SIZE` to `2048`.
82
+ - Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.