Instructions to use MerlinSafety/Pluto with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MerlinSafety/Pluto with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MerlinSafety/Pluto",
	filename="Pluto-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use MerlinSafety/Pluto with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MerlinSafety/Pluto:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf MerlinSafety/Pluto:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MerlinSafety/Pluto:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf MerlinSafety/Pluto:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MerlinSafety/Pluto:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf MerlinSafety/Pluto:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MerlinSafety/Pluto:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MerlinSafety/Pluto:Q4_K_M

Use Docker

docker model run hf.co/MerlinSafety/Pluto:Q4_K_M

LM Studio
Jan

vLLM

How to use MerlinSafety/Pluto with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MerlinSafety/Pluto"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MerlinSafety/Pluto",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/MerlinSafety/Pluto:Q4_K_M

Ollama
How to use MerlinSafety/Pluto with Ollama:
```
ollama run hf.co/MerlinSafety/Pluto:Q4_K_M
```

Unsloth Studio

How to use MerlinSafety/Pluto with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MerlinSafety/Pluto to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MerlinSafety/Pluto to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MerlinSafety/Pluto to start chatting

How to use MerlinSafety/Pluto with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MerlinSafety/Pluto:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "MerlinSafety/Pluto:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use MerlinSafety/Pluto with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MerlinSafety/Pluto:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default MerlinSafety/Pluto:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use MerlinSafety/Pluto with Docker Model Runner:
```
docker model run hf.co/MerlinSafety/Pluto:Q4_K_M
```

Lemonade

How to use MerlinSafety/Pluto with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MerlinSafety/Pluto:Q4_K_M

Run and chat with the model

lemonade run user.Pluto-Q4_K_M

List all available models

lemonade list

Pluto

Pluto is a 9B parameter coding and reasoning model developed by Merlin Research, built for precision, robustness, and seamless deployment in agentic coding environments including Claude Code, OpenAI Codex, and local large-codebase workflows.

Model Summary

Property	Value
Developer	Merlin Research
Base Model	Qwen/Qwen3.5-9B-Base
Parameters	9B
Context Length	1,000,000 tokens
Training	SFT + RL with Adaptive Entropy Regularization
Distillation	Frontier coding models
Compute	Google Cloud (TPU/GPU via Google TRC Research Grant)
Quantum	IBM Quantum Kingston (Heron r2) — entropy noise injection
License	Apache 2.0

Key Features

🎯 Precision-First Design

Pluto is trained to minimize errors rather than maximize fluency. Every training signal — from distillation targets to RL reward shaping — is oriented around correctness, not surface-level coherence. This makes Pluto particularly effective for tasks where a single wrong line of code has downstream consequences.

🔭 1M Token Context

Pluto supports up to 1,000,000 tokens of context, enabling operation on large codebases without chunking or retrieval hacks. Feed it an entire repository, a multi-file diff, or a long conversation history — Pluto maintains coherent reasoning across the full window.

🤖 Agentic Deployment Ready

Pluto is fine-tuned specifically for deployment in:

Claude Code — system prompt formatting, tool call patterns, multi-turn agentic loops
OpenAI Codex / Assistants API — compatible message structure and function calling behavior
Local deployment — GGUF and quantized variants available for running against large local codebases without API latency

⚛️ Quantum Entropy Regularization (AER)

During RL training, Pluto used Adaptive Entropy Regularization (AER) with quantum noise sourced from the IBM Quantum Kingston processor (Heron r2, 156 qubits). Bitstring measurements from entangled quantum states were used to modulate the per-token entropy coefficient λ(t) during GRPO training, providing:

Resistance to entropy collapse and reward hacking
Improved robustness on out-of-distribution inputs
More stable training dynamics across long RL runs

This makes Pluto the first production coding model trained with quantum hardware-sourced entropy regularization.

📚 Distillation from Frontier Models

Pluto was trained using knowledge distillation from multiple frontier coding models, combined with a curated private dataset of advanced reasoning traces. The distillation pipeline transfers deep reasoning chains from teacher models while keeping inference cost at the 9B scale.

Quickstart

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "MerlinSafety/Pluto"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "Write a Python function that parses a JWT token without external libraries and validates the expiry timestamp."
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
        repetition_penalty=1.1,
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

With Unsloth (faster inference, 4-bit)

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="MerlinSafety/Pluto",
    max_seq_length=131072,  # adjust as needed
    dtype=None,
    load_in_4bit=True,
)

FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "Refactor this function to be async and add proper error handling:\n\ndef fetch_data(url):\n    import requests\n    return requests.get(url).json()"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=1024,
    temperature=0.6,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

GGUF / llama.cpp (local deployment)

# Download Q4_K_M (recommended, ~5.4GB)
huggingface-cli download MerlinSafety/Pluto \
    Pluto-Q4_K_M.gguf \
    --local-dir ./pluto

# Download Q8_0 (higher quality, ~9.4GB)
huggingface-cli download MerlinSafety/Pluto \
    Pluto-Q8_0.gguf \
    --local-dir ./pluto

# Run with llama.cpp
./llama-cli \
    -m ./pluto/Pluto-Q4_K_M.gguf \
    -p "Explain the time complexity of this algorithm and suggest optimizations:\n[your code here]" \
    -n 1024 \
    --temp 0.6 \
    --top-p 0.95 \
    -c 8192

Ollama

cat > Modelfile << 'EOF'
FROM ./Pluto-Q4_K_M.gguf
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
EOF

ollama create pluto -f Modelfile
ollama run pluto "Write a thread-safe singleton implementation in Python"

Claude Code Integration

Pluto is optimized for use as a local backend in Claude Code via the --model flag when pointing to a local OpenAI-compatible server:

# Start local server (example with llama.cpp server)
./llama-server \
    -m pluto-9b-q4_k_m.gguf \
    --port 8080 \
    -c 32768 \
    --chat-template qwen

# Use with Claude Code
claude --model http://localhost:8080 "Review this PR and identify potential bugs"

OpenAI Codex / Assistants API Integration

Pluto's instruction format is compatible with the OpenAI Chat Completions API when served through a compatible endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",  # your local Pluto server
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="pluto",
    messages=[
        {
            "role": "user",
            "content": "Write a SQL query to find the top 5 customers by revenue in the last 30 days, handling NULL values correctly."
        }
    ],
    max_tokens=1024,
    temperature=0.6,
)

print(response.choices[0].message.content)

Training Details

Pipeline Overview

Qwen/Qwen3.5-9B-Base
    │
    ▼
SFT on curated advanced reasoning + coding dataset
(private dataset, distillation from frontier models)
    │
    ▼
GRPO Reinforcement Learning
with Adaptive Entropy Regularization (AER)
+ IBM Quantum Kingston entropy noise injection
    │
    ▼
Long-context fine-tuning (1M token extension)
    │
    ▼
Agentic deployment fine-tuning
(Claude Code + Codex format alignment)
    │
    ▼
Pluto 9B

Adaptive Entropy Regularization (AER)

During RL training, the loss function was modified as:

L_total = L_RL + λ(t) · L_entropy

where λ(t) is a dynamic coefficient modulated by quantum bitstring measurements from the IBM Quantum Kingston (Heron r2) processor. GHZ-state measurements provided true quantum randomness that guided the per-token entropy targets, preventing entropy collapse and improving robustness.

Compute

Training was conducted on Google Cloud TPU/GPU infrastructure supported by a Google TPU Research Cloud (TRC) grant awarded to Merlin Research.

Intended Use

Complex code generation and refactoring
Multi-file codebase analysis
Agentic coding pipelines (Claude Code, Codex)
Code review and bug detection
Architecture planning and technical reasoning
Local deployment with large private codebases

Limitations

Pluto is optimized for coding and technical reasoning — general conversation and creative tasks are outside its primary design goal
Like all LLMs, Pluto can produce incorrect code; always review generated output before deploying to production
Performance on very niche frameworks or proprietary APIs may be limited by training data coverage
Quantum entropy component provides training-time benefits; inference behavior is classical

Citation

@misc{pluto-2026,
  title={Pluto: Precision Coding and Reasoning Model with Quantum Entropy Regularization},
  author={Merlin Research},
  year={2026},
  publisher={Merlin Research},
  url={https://huggingface.co/MerlinSafety/Pluto}
}

About Merlin Research

Merlin Research is an independent AI safety laboratory based in Stockholm, Sweden, focused on open-source model development, adaptive entropy regularization, and practical AI alignment. Our models are released publicly to advance accessible, safe, and high-quality AI for the research community.

HuggingFace: huggingface.co/MerlinSafety
Contact: MerlinResearch@protonmail.com

Downloads last month: 77

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for MerlinSafety/Pluto

Base model

Qwen/Qwen3.5-9B-Base

Quantized

(34)

this model

Quantizations

2 models