How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="cafonez/Agent-Nemotron-ROCmFP6",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Agent-Nemotron-ROCmFP6

Q6_0_ROCMFPX_AGENT (ROCmFP6 Agent) quantized GGUF of NVIDIA's Nemotron-3-Nano-30B-A3B.

  • Base model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
  • Quantization: Q6_0_ROCMFPX_AGENT — ROCm-optimized 6-bit format with agent/tool-call coherent routing
  • Size: ~27.4 GiB (21 GB + 6.4 GB shards)
  • Parameters: ~30B total / 3.5B active (hybrid Mamba-2 + MoE)
  • Optimized for: Agentic workflows, tool calling, reasoning on AMD ROCm and Vulkan backends

This quantization uses custom ROCmFPX kernels (part of experimental ROCmFPx family in llama.cpp) that provide better performance/quality on ROCm hardware for agent-style workloads. The _AGENT preset protects and enhances routing for tool use (Hermes-style / OpenClaw / BFCL etc.).

Files

File Size Description
Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00001-of-00002.gguf 21 GB Main weights shard
Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00002-of-00002.gguf 6.4 GB Second shard

Recommended Usage (llama.cpp)

Use a ROCmFPX-enabled build of llama.cpp (see ROCmFPX projects / strix builds).

Quick server (recommended flags)

# Using the convenience wrapper (if installed)
HERMES_NEMOTRON_NANO_FP6_MODEL=/path/to/Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00001-of-00002.gguf \
  hermes-nemotron-nano-30b-rocmfp6-agent-server

Direct llama-server:

llama-server \
  -m /path/to/Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00001-of-00002.gguf \
  --alias nemotron-nano-30b-rocmfp6-agent \
  --host 0.0.0.0 --port 8101 \
  -dev ROCm0 \
  -ngl 999 \
  -fa on \
  --mmap \
  --jinja \
  -c 131072 \
  -b 512 -ub 512 \
  --reasoning off \
  --slots \
  --metrics

For best agent/tool performance use --jinja (the GGUF embeds a strong Nemotron tool calling template).

Key notes

  • Q6_0_ROCMFPX_AGENT spends a few extra bits on agent routing tensors compared to plain Q6_0_ROCMFPX.
  • Excellent balance of quality vs size for agentic use on high-end AMD GPUs (Strix Halo, etc.).
  • Supports very long context (tested high values).
  • Tool calling format is the Nemotron <tool_call> style (also compatible with many frameworks via parsers).

Chat Template

The GGUF includes the official Nemotron-3 tool-aware chat template. Use --jinja (or equivalent) with your loader.

Benchmarks (example from development)

Typical token/s on ROCm0 (full offload) for this quant:

  • ~650+ t/s prompt eval (pp512)
  • ~53 t/s generation (tg128)

Results vary by hardware + context.

License


Model page: https://huggingface.co/cafonez/Agent-Nemotron-ROCmFP6

For questions or issues with the quantization, refer to the ROCmFPX documentation in the corresponding development repositories.

Downloads last month
1,140
GGUF
Model size
32B params
Architecture
nemotron_h_moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cafonez/Agent-Nemotron-ROCmFP6

Quantized
(50)
this model