File size: 2,285 Bytes
00bfb44
12010f8
 
00bfb44
 
12010f8
 
 
 
 
 
 
 
 
 
 
 
 
 
00bfb44
 
12010f8
 
 
 
 
 
 
 
 
 
5717969
12010f8
 
 
5717969
12010f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5717969
12010f8
 
 
 
5717969
 
 
 
12010f8
 
 
 
 
 
5717969
12010f8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
title: Gemma 4 CPU Basic API
colorFrom: blue
colorTo: gray
sdk: docker
app_port: 7860
suggested_hardware: cpu-basic
startup_duration_timeout: 1h
short_description: Gemma 4 E2B API for HF CPU Basic
models:
  - google/gemma-4-E2B-it
  - unsloth/gemma-4-E2B-it-GGUF
tags:
  - gemma4
  - llama.cpp
  - api
  - openai-compatible
preload_from_hub:
  - unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-Q4_0.gguf
---

# Gemma 4 on CPU Basic

This Space is tuned for Hugging Face `CPU Basic`, which currently provides `2 vCPU`, `16 GB RAM`, and `50 GB` of ephemeral disk by default.

## Why this setup

Running the original Gemma 4 weights on CPU Basic is too slow for an API with usable latency. This Space uses:

- `Gemma 4 E2B`
- `GGUF Q4_0` quantization
- `KV cache q4_0 / q4_0`
- `llama.cpp` server
- `reasoning` disabled by default
- `parallel=1` to avoid contention on 2 CPU cores
- `ctx_size=131072` by default

This is the smallest practical Gemma 4 setup for CPU-only serving with acceptable generation speed.

## API

The server exposes an OpenAI-compatible API:

- `POST /v1/chat/completions`
- `POST /v1/completions`
- `GET /`

Example:

```bash
curl -X POST "https://alexandrescriptsmt-gemma-4-cpu-basic-api.hf.space/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-e2b-q4",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "Explain in one sentence what quantization is."}
    ],
    "max_tokens": 128,
    "temperature": 0.2,
    "stream": false
  }'
```

## Runtime knobs

You can change these in the Space settings as runtime variables:

- `MODEL_SPEC`
  Default: `unsloth/gemma-4-E2B-it-GGUF:Q4_0`
- `CTX_SIZE`
  Default: `131072`
- `THREADS`
  Default: `2`
- `PARALLEL`
  Default: `1`
- `CACHE_TYPE_K`
  Default: `q4_0`
- `CACHE_TYPE_V`
  Default: `q4_0`
- `REASONING_MODE`
  Default: `off`

## Notes

- If you want a bit more quality and can accept lower speed, switch `MODEL_SPEC` to `unsloth/gemma-4-E2B-it-GGUF:Q4_K_M`.
- If first-token latency is too high on very long prompts, reduce `CTX_SIZE` to `65536` or `32768`.
- Space disk is ephemeral. The `preload_from_hub` setting reduces cold-start time by baking the default model file into the build cache.