boneylizardwizard
/

llama_cpp_python-0.3.16-cp312-cp312-win_amd64

Model card Files Files and versions

llama_cpp_python-0.3.16-cp312-cp312-win_amd64 / README.md

boneylizardwizard's picture

boneylizardwizard

Update README.md

f596de4 verified 7 months ago

|

history blame contribute delete

3.68 kB

	---
	license: mit
	---
	# RTX 5000 Series–Ready `llama-cpp-python` Wheel (Python 3.12, Windows)

	Status: ✅ CONFIRMED WORKING — No more “invalid resource handle” errors
	Wheel: `llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`
	License: MIT (same as upstream `llama-cpp-python`)

	Platform: Windows 10/11 x64
	Python: 3.12
	CUDA: 12.8 (optimized for Blackwell)

	---

	## 🚀 Performance (Verified on RTX 5090)

	- ~64 tokens/sec on Mistral Small 24B (5-bit quant)
	- Full GPU offload (`n_gpu_layers = -1`) working as expected
	- ~1.83× faster than RTX 3090 in the same setup (35 tok/s → 64 tok/s)
	- 32 GB VRAM fully utilized (no kernel crashes)

	> Notes: numbers vary with quant, context, and params; these are representative.

	---

	## 🔧 Why This Works

	The wheel forces cuBLAS instead of ggml’s custom CUDA kernels.
	On RTX 5090 (Blackwell, `sm_120`), ggml’s custom kernels can trigger:
	“CUDA error: invalid resource handle”.

	cuBLAS is stable on 5090 and avoids those kernel issues.

	Key CMake flags used:
	-DGGML_CUDA=ON
	-DGGML_CUDA_FORCE_CUBLAS=1 # Use cuBLAS instead of custom kernels
	-DGGML_CUDA_NO_PINNED=1 # Avoid pinned memory issues with GDDR7
	-DGGML_CUDA_F16=0 # Disable problematic FP16 code paths
	-DCMAKE_CUDA_ARCHITECTURES=all-major # Ensure sm_120 is included

	---

	## 📋 Requirements

	- NVIDIA RTX 5090 (or other Blackwell GPU)
	- NVIDIA drivers 570.86.10+
	- CUDA Toolkit 12.8
	- Python 3.12
	- Windows 10/11 x64
	- Microsoft Visual C++ Redistributable 2015–2022

	---

	## 🛠️ Installation

	1) Download the wheel:
	`llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl`

	2) Install:
	pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl

	---

	## ✅ Quick Verification

	from llama_cpp import Llama

	# Full GPU offload on 5090
	llm = Llama(
	model_path="your_model.gguf",
	n_gpu_layers=-1, # full GPU
	n_ctx=2048,
	verbose=True
	)

	out = llm("Hello, how are you?", max_tokens=20)
	print(out["choices"][0]["text"])

	What to look for in stdout:
	- CUDA device assignment lines (e.g., using CUDA:0)
	- VRAM allocations without any “invalid resource handle” errors

	---

	## 🏗️ Build It Yourself (Advanced)

	Prereqs: CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12

	mkdir C:\wheels
	cd C:\wheels

	set FORCE_CMAKE=1
	set CMAKE_BUILD_PARALLEL_LEVEL=15
	set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major

	pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose

	Build time: ~10 minutes on a modern CPU
	Wheel size: ~231 MB (larger due to cuBLAS inclusion)

	---

	## 🐛 Troubleshooting

	“Invalid resource handle” errors
	- This wheel specifically fixes this. If you still see them, verify:
	- CUDA 12.8 is installed
	- Latest NVIDIA drivers are installed
	- No other CUDA apps are interfering

	CPU fallback
	- If GPU isn’t detected, check `nvidia-smi` and ensure `CUDA_VISIBLE_DEVICES` isn’t set.

	---

	## 🙏 Credits

	Built using the open-source `llama-cpp-python` project by abetlen and the `llama.cpp` project by ggml-org.
	This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release.

	- For issues with this specific wheel: open an issue here (this repo/thread).
	- For general `llama-cpp-python` issues: use the official repository.

	---

	Finally — RTX 5000 series owners can use their flagship GPU for local LLM inference without crashes! 🎉