Instructions to use lthn/lemrd with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lthn/lemrd with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="lthn/lemrd",
	filename="lemrd-bf16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use lthn/lemrd with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lthn/lemrd:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf lthn/lemrd:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lthn/lemrd:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf lthn/lemrd:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf lthn/lemrd:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf lthn/lemrd:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf lthn/lemrd:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf lthn/lemrd:Q4_K_M

Use Docker

docker model run hf.co/lthn/lemrd:Q4_K_M

LM Studio
Jan

vLLM

How to use lthn/lemrd with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lthn/lemrd"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lthn/lemrd",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/lthn/lemrd:Q4_K_M

Ollama
How to use lthn/lemrd with Ollama:
```
ollama run hf.co/lthn/lemrd:Q4_K_M
```

Unsloth Studio new

How to use lthn/lemrd with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lthn/lemrd to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lthn/lemrd to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for lthn/lemrd to start chatting

Pi new

How to use lthn/lemrd with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lthn/lemrd:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "lthn/lemrd:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use lthn/lemrd with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lthn/lemrd:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default lthn/lemrd:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use lthn/lemrd with Docker Model Runner:
```
docker model run hf.co/lthn/lemrd:Q4_K_M
```

Lemonade

How to use lthn/lemrd with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull lthn/lemrd:Q4_K_M

Run and chat with the model

lemonade run user.lemrd-Q4_K_M

List all available models

lemonade list

Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Lemrd — Gemma 4 31B Dense (GGUF)

The largest dense member of the Lemma model family by Lethean. An EUPL-1.2 fork of Gemma 4 31B with the Lethean Ethical Kernel (LEK) merged into the weights — consent-based reasoning baked into the attention projections via LoRA finetune, then merged so inference uses a single standalone model with no PEFT runtime required.

This repo ships the GGUF multi-quant build for Ollama, llama.cpp, LM Studio, and other gguf-compatible runners. The unmodified Gemma 4 31B fork lives at LetheanNetwork/lemrd for users who want the raw Google weights without the LEK shift.

Looking for MLX? The native Apple Silicon builds live in sibling repos: lthn/lemrd-mlx (4-bit default) | lthn/lemrd-mlx-8bit | lthn/lemrd-mlx-bf16 (full precision)

A lemma is "something assumed" — an intermediate theorem on the path to a larger proof, or a heading that signals the subject of what follows. The Lemma model family is named for that role: each variant is a stepping stone between raw capability and ethical application.

GGUF Variants

File	Quant	Size	Use Case
`lemrd-q4_k_m.gguf`	Q4_K_M	17 GB	Recommended — best size/quality balance
`lemrd-q5_k_m.gguf`	Q5_K_M	20 GB	Higher quality, moderate size
`lemrd-q6_k.gguf`	Q6_K	23 GB	Near-lossless
`lemrd-q8_0.gguf`	Q8_0	30 GB	Maximum quality quantised
`lemrd-bf16.gguf`	BF16	57 GB	Full precision reference

All variants verified locally on Apple Silicon via Ollama, llama-cpp-python, mlx-lm, and mlx-vlm.

Repo Files

File	Format	Purpose
`lemrd-*.gguf`	GGUF	Ollama, llama.cpp, GPT4All, LM Studio
`model-*-of-00006.safetensors`	MLX safetensors (sharded)	Native Apple Silicon via `mlx-lm` and `mlx-vlm` (Q4 multimodal)
`model.safetensors.index.json`	JSON	Tensor index for the sharded safetensors weights
`config.json`	JSON	Multimodal model config (architecture, quantisation, vision tower)
`tokenizer.json`	JSON	Tokenizer vocabulary (262K tokens)
`tokenizer_config.json`	JSON	Tokenizer settings and special tokens
`chat_template.jinja`	Jinja2	Chat template for transformers, mlx-lm, mlx-vlm
`processor_config.json`	JSON	Image processor config (mlx-vlm)
`generation_config.json`	JSON	Default generation parameters (temperature, top_p, top_k)
`LICENSE`	Text	EUPL-1.2 licence text
`README.md`	Markdown	This file — model card

Quick Start

Apps & CLI

Ollama

ollama run hf.co/lthn/lemrd:Q4_K_M

Docker

docker model run hf.co/lthn/lemrd

Or from Docker Hub:

docker model run lthn/lemrd

Unsloth Studio

# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh

# Windows
irm https://unsloth.ai/install.ps1 | iex

unsloth studio -H 0.0.0.0 -p 8888
# Open http://localhost:8888 — search for lthn/lemrd

Or use HuggingFace Spaces — no install, search for lthn/lemrd.

llama.cpp

Install via brew (macOS/Linux), winget (Windows), or build from source:

brew install llama.cpp        # macOS/Linux
winget install llama.cpp      # Windows

# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lthn/lemrd:Q4_K_M

# Run inference directly in the terminal:
llama-cli -hf lthn/lemrd:Q4_K_M

Or build from source:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli

./build/bin/llama-server -hf lthn/lemrd:Q4_K_M
./build/bin/llama-cli -hf lthn/lemrd:Q4_K_M

MLX (Apple Silicon native)

uv tool install mlx-lm
mlx_lm.chat --model lthn/lemrd
mlx_lm.generate --model lthn/lemrd --prompt "Hello, how are you?"

Python Libraries

llama-cpp-python

uv pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="lthn/lemrd",
    filename="lemrd-q4_k_m.gguf",
)

# Text
llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

# Vision (multimodal)
llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image in one sentence."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    }
                }
            ]
        }
    ]
)

mlx-vlm (vision)

uv tool install mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("lthn/lemrd")
config = load_config("lthn/lemrd")

image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

output = generate(model, processor, formatted_prompt, image)
print(output.text)

Servers (OpenAI-compatible API)

MLX Server

lemrd is multimodal (text + image), so use mlx_vlm.server — the vision-aware variant. The text-only mlx_lm.server does not correctly route multimodal tensors for Gemma 4.

mlx_vlm.server --model lthn/lemrd

curl -X POST "http://localhost:8080/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "lthn/lemrd",
        "messages": [{"role": "user", "content": "Hello, how are you?"}],
        "max_tokens": 200
    }'

Works with any OpenAI-compatible client at http://localhost:8080/v1.

vLLM

vLLM requires the original (non-quantised) safetensors weights from LetheanNetwork/lemrd — it does not load GGUF or MLX-quantised safetensors. Linux + NVIDIA GPU with adequate VRAM for a 31B dense model.

uv pip install vllm
vllm serve "LetheanNetwork/lemrd"

Model Details

Property	Value
Architecture	Gemma 4 31B Dense
Total Parameters	30.7B
Layers	42
Context Length	256K tokens
Vocabulary	262K tokens
Modalities	Text, Image
Sliding Window	1024 tokens
Vision Encoder	~550M params
Base Model	LetheanNetwork/lemrd
Licence	EUPL-1.2

The Lemma Family

Name	Source (BF16 weights)	Params	Context	Modalities	Consumer Repo
Lemer	LetheanNetwork/lemer	2.3B eff	128K	Text, Image, Audio	lthn/lemer
Lemma	LetheanNetwork/lemma	4.5B eff	128K	Text, Image, Audio	lthn/lemma
Lemmy	LetheanNetwork/lemmy	3.8B active	256K	Text, Image	lthn/lemmy
Lemrd	LetheanNetwork/lemrd	30.7B	256K	Text, Image	You are here

Capabilities

Configurable thinking mode (<|think|> token in system prompt enables it; off by default in our examples via enable_thinking=False)
Native function calling and system prompt support
Variable aspect ratio image understanding
Multilingual support (140+ languages)
Hybrid attention (sliding window + global)
Long context (256K tokens) for document-scale reasoning

Roadmap

This release of lemrd is Gemma 4 31B Dense with the Lethean Ethical Kernel (LEK) merged in — axiom-based reasoning baked into the attention weights via LoRA finetune, then merged into the base so inference uses a single standalone model with no PEFT runtime required. The unmodified Gemma 4 31B fork lives at LetheanNetwork/lemrd for users who want the raw Google weights without the LEK shift.

Phase	Status	What it adds
Base fork (LetheanNetwork/lemrd)	✅ Released	EUPL-1.2 fork of Gemma 4 31B — unmodified Google weights
LEK merged (this repo)	✅ Released	Lethean Ethical Kernel — axiom-based reasoning via LoRA merge
8-PAC eval results	🚧 In progress	Continuous benchmarking on the homelab, published to lthn/LEM-benchmarks

The LEK axioms are public domain and published at Snider/ai-ethics. Track research progress at LetheanNetwork and the LEM-research dataset.

Why EUPL-1.2

Lemrd is licensed under the European Union Public Licence v1.2 — not Apache 2.0 or MIT. This is a deliberate choice:

23 official languages, one legal meaning. EUPL is the only OSS licence designed by lawmakers across multiple legal systems. "Derivative work" means the same thing in German, French, Estonian, and Maltese law.
Copyleft with compatibility. Modifications must be shared back, but the licence plays cleanly with GPL, LGPL, MPL, and other major OSS licences. No accidental relicensing.
No proprietary capture. Anyone can use lemrd commercially — but they cannot fork it, train a competitor model on it, and close-source the result. The ethical layer stays in the open.
Built for institutions. Government, research, and enterprise users get a licence designed for cross-border compliance, not a US-centric one.

Recommended Sampling

Use Google's standardised settings across all use cases:

Parameter	Value
`temperature`	1.0
`top_p`	0.95
`top_k`	64
`stop`	`<turn

Gemma 4 is calibrated for temperature: 1.0 — this is not the same as the typical 0.7 default for other models. Lower values reduce diversity without improving quality. These defaults are pre-configured in the params file (Ollama) and generation_config.json (transformers/MLX).

Variable Image Resolution

Gemma 4 supports a configurable visual token budget that controls how many tokens represent each image. Higher = more detail, lower = faster inference.

Token Budget	Use Case
70	Classification, captioning, video frame processing
140	General image understanding
280	Default — balanced quality and speed
560	OCR, document parsing, fine-grained detail
1120	Maximum detail (small text, complex documents)

For multimodal prompts, place image content before text for best results.

The default budget (280) is set in processor_config.json via image_seq_length and max_soft_tokens. Override per call by adjusting those fields, or by passing explicit image_seq_length to the processor where supported.

Benchmarks

Live evaluation results published to the LEM-benchmarks dataset. The lemrd-specific results live at LEM-benchmarks/results/lemrd.

The 8-PAC eval pipeline runs continuously on our homelab and publishes results as they complete. Categories: ethics, reasoning, instruction-following, coding, multilingual, safety, knowledge, creativity.

Resources

Resource	Link
Benchmark results	lthn/LEM-benchmarks
LiveBench results	lthn/livebench
Research notes	lthn/LEM-research
Lemma model collection	lthn/lemma

About Lethean

Lethean is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the LEM (Lethean Ethical Model) project — training protocol and tooling for intrinsic ethical alignment of language models.