DuoNeural
/

GhostShell-4B

@@ -54,10 +54,12 @@ LoRA adapter merged into BF16 weights via `merge_and_unload()`. Exported as shar
 | File | Size | Description |
 |------|------|-------------|
 | `model-0000X-of-00004.safetensors` | ~15GB | Merged BF16 weights (full precision) |
-| `ghostshell-4b-Q4_K_M.gguf` | ~2.5GB | Q4_K_M quantization — recommended for most use |
-| `ghostshell-4b-Q8_0.gguf` | ~4.5GB | Q8_0 quantization — near-lossless, for power users |
-**Recommended**: `ghostshell-4b-Q4_K_M.gguf` for llama.cpp, Ollama, LM Studio, or any GGUF-compatible runtime. Runs on 6GB VRAM, handles well on CPU with 8GB RAM.
 ---

 | File | Size | Description |
 |------|------|-------------|
 | `model-0000X-of-00004.safetensors` | ~15GB | Merged BF16 weights (full precision) |
+| `ghostshell-4b-Q4_K_M.gguf` | ~5.0GB | Q4_K_M quantization — recommended for most use |
+| `ghostshell-4b-Q8_0.gguf` | ~7.5GB | Q8_0 quantization — near-lossless, for power users |
+**Recommended**: `ghostshell-4b-Q4_K_M.gguf` for llama.cpp, Ollama, LM Studio, or any GGUF-compatible runtime.
+> **Note on file sizes**: These GGUFs are larger than a typical 4B model because Gemma 4 uses a 262,144-token vocabulary. The embedding/output weight tensors (which stay in higher precision) account for ~2–3GB of the total. The transformer layers themselves are fully quantized. Expect ~6–8GB VRAM for Q4_K_M, ~10–12GB for Q8_0.
 ---