Instructions to use coder543/North-Mini-Code-1.0-QAD-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("coder543/North-Mini-Code-1.0-QAD-GGUF", dtype="auto")

llama-cpp-python

How to use coder543/North-Mini-Code-1.0-QAD-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="coder543/North-Mini-Code-1.0-QAD-GGUF",
	filename="north-mini-code-1.0-w4a16-nvfp4.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use coder543/North-Mini-Code-1.0-QAD-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
# Run inference directly in the terminal:
llama-cli -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
# Run inference directly in the terminal:
llama-cli -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
# Run inference directly in the terminal:
./llama-cli -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
# Run inference directly in the terminal:
./build/bin/llama-cli -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4

Use Docker

docker model run hf.co/coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4

LM Studio
Jan
Ollama
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Ollama:
```
ollama run hf.co/coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
```

Unsloth Studio

How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for coder543/North-Mini-Code-1.0-QAD-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for coder543/North-Mini-Code-1.0-QAD-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for coder543/North-Mini-Code-1.0-QAD-GGUF to start chatting

How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Docker Model Runner:
```
docker model run hf.co/coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
```

Lemonade

How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4

Run and chat with the model

lemonade run user.North-Mini-Code-1.0-QAD-GGUF-NVFP4

List all available models

lemonade list

GGUF Conversion Note

This repository contains a GGUF conversion of CohereLabs/North-Mini-Code-1.0-w4a16, Cohere's QAD-trained NVFP4 W4A16 checkpoint for North Mini Code. The GGUF was produced with llama.cpp's convert_hf_to_gguf.py using --outtype bf16.

The expert weights were repacked from the source checkpoint's compressed-tensors nvfp4-pack-quantized format into GGUF NVFP4 tensors. They were not dequantized and requantized into a standard GGUF Q4_* or IQ4_* format, so this conversion is intended to preserve the QAD-trained 4-bit weights. Non-NVFP4 tensors are stored as BF16 or F32 as emitted by the converter.

During conversion, the local llama.cpp converter needed a small workaround for a lazy tensor zero-bias check in the Cohere2 MoE path; the checkpoint's bias tensors were zero and were skipped as intended.

Conversion command:

python convert_hf_to_gguf.py \
  /path/to/CohereLabs/North-Mini-Code-1.0-w4a16 \
  --outfile north-mini-code-1.0-w4a16-nvfp4.gguf \
  --outtype bf16

Model Card for North Mini Code

Model Summary

North Mini Code is an open weights research release of a 30B-A3B parameter model optimized for code generation, agentic software engineering, and terminal tasks.

Developed by: Cohere and Cohere Labs

Point of Contact: Cohere Labs
License: Apache 2.0
Model: North Mini Code
Model Size: 30B total; 3B active
Context length: 256K & 64K max output
Quantization: NVFP4 W4A16

For more details about this model, please check out our blog post.

Try North Mini Code

You can try out North Mini Code before downloading the weights in OpenCode and our hosted Hugging Face Space.

Evaluation

Benchmarking Methodology [CLICK TO EXPAND]

We used SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, and Terminal-Bench Hard to benchmark North Mini Code's agentic coding capabilities. For evaluation harnesses, we used the Swe-Agent harness v1.1.0 for SWE-Bench, and a simple ReAct harness employing a single terminal-use tool based on Harbor's Tmux session implementation for Terminal-Bench v2. For Terminal Bench Hard, we directly used Terminus-2, following the same methodology as the Artificial Analysis Intelligence Index to compare North-Mini-Code-1.0 with the other models. Additionally, we used SciCode and LiveCodeBench v6 as complex code-generation benchmarks outside of tool use.
We run each benchmark with 3 different seeds and report the average benchmark performance, using temperature=1.0 and top_p=0.95. We used publicly reported scores for competitor models, either from original reports or the Artificial Analysis Intelligence Index, where available. Additionally, Gemma4’s scores for agentic coding tasks were reported by Qwen team. For benchmark results that any public report is missing, denoted by (*) in the figure, we run them internally using the recommended model configuration.

Usage

To use our model in transformers, please use our BF16 model weights. Our NVFP4_W4A16 checkpoint is designed to be used with vLLM and MLX-VLM and is not compatible with transformers due to lack of native 4-bit support.

vLLM

You can run the model in vLLM. Please use vLLM main for North Mini Code until a new release is available, and accurate response parsing also requires installing Cohere’s melody library.

uv pip install "git+https://github.com/vllm-project/vllm.git"
uv pip install cohere_melody>=0.9.0

Then the vLLM server can be started with the following command:

vllm serve CohereLabs/North-Mini-Code-1.0-w4a16 \
  -tp 1 \
  --max-model-len 320000 \
  --tool-call-parser cohere_command4 \
  --reasoning-parser cohere_command4 \
  --enable-auto-tool-choice

Use locally deployed North Mini Code in OpenCode:

Please use OpenCode > v1.17.0.

brew install anomalyco/tap/opencode

To use locally deployed North Mini Code in Opencode, please use this config which enables interleaved reasoning:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "vllm/CohereLabs/North-Mini-Code-1.0-w4a16",
  "provider": {
    "vllm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Local vLLM server",
      "options": {
        "baseURL": "http://127.0.0.1:8000/v1",
        "apiKey": "EMPTY"
      },
      "models": {
        "CohereLabs/North-Mini-Code-1.0-w4a16": {
          "name": "North-Mini-Code-1.0",
          "interleaved": {
            "field": "reasoning"
          },
          "limit": {
            "context": 256000,
            "output": 64000
          }
        }
      }
    }
  }
}

MLX-VLM

You can also run the model in MLX-VLM. Please use main for North Mini Code until a new release is available.

uv pip install "git+https://github.com/Blaizzy/mlx-vlm.git@main"

Then the mlx_vlm server can be started with the following command:

mlx_vlm.server \
  --model CohereLabs/North-Mini-Code-1.0-w4a16 \
  --enable-thinking \
  --thinking-start-token "<|START_THINKING|>" \
  --thinking-end-token "<|END_THINKING|>"

Opencode config:

Actual limit depends on your device

{
  "$schema": "https://opencode.ai/config.json",
  "model": "mlx-vlm/CohereLabs/North-Mini-Code-1.0-w4a16",
  "provider": {
    "mlx-vlm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "MLX VLM Local",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1",
        "apiKey": "EMPTY"
      },
      "models": {
        "CohereLabs/North-Mini-Code-1.0-w4a16": {
          "name": "North-Mini-Code-1.0",
          "interleaved": {
            "field": "reasoning"
          },
          "limit": {
            "context": 256000,
            "output": 64000
          }
        }
      }
    }
  }
}

Model Details

Input: Text only.

Output: Model generates text.

Model Architecture: North-Mini-Code-1.0 is a decoder-only Transformer-based sparse Mixture-of-Experts model. It uses an efficient attention implementation, interleaved between sliding-window attention with RoPE and global attention with no positional embeddings, in a 3:1 ratio. The feed-forward block is an MoE block with 128 experts, of which 8 are activated per token. Each expert block is an FFN block with SwiGLU activation. The router applies a sigmoid activation function to the logits before the top-k selection. We also use a single dense layer before the sparse layers. North-Mini-Code-1.0 was post-trained using a two-stage cascaded supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR), focusing on agentic coding. For more technical details, please check out our blog post.

Quantization Methodology: We use NVFP4 W4A16 quantization (4-bit weights, 16-bit activations) for this model, delivering a much smaller memory footprint (~ 18-20GB) and faster inference while preserving coding accuracy. We quantize the MoE experts only, keeping attention, the dense layer, and the router at higher precision. Since the experts hold most of the model's parameters, this captures the bulk of the savings with minimal quality loss. To preserve quality, we use Quantization-Aware Distillation (QAD), training the quantized model to match the unquantized model's outputs, achieving >99% overall accuracy recovery across our evaluations. Since only weights are quantized, this format does not require native FP4 hardware and runs on pre-Blackwell GPUs such as Hopper and Ada.

Context Length: North-Mini-Code-1.0 supports a context length of 256K & 64K output length.

Model Card Contact

For errors or additional questions about details in this model card, contact [labs@cohere.com].

Downloads last month: 1

GGUF

Model size

30B params

Architecture

cohere2moe

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for coder543/North-Mini-Code-1.0-QAD-GGUF

Base model

CohereLabs/North-Mini-Code-1.0

Quantized

CohereLabs/North-Mini-Code-1.0-w4a16

Quantized

(1)

this model