Instructions to use coder543/North-Mini-Code-1.0-QAD-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("coder543/North-Mini-Code-1.0-QAD-GGUF", dtype="auto") - llama-cpp-python
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="coder543/North-Mini-Code-1.0-QAD-GGUF", filename="north-mini-code-1.0-w4a16-nvfp4.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4 # Run inference directly in the terminal: llama-cli -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4 # Run inference directly in the terminal: llama-cli -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4 # Run inference directly in the terminal: ./llama-cli -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4 # Run inference directly in the terminal: ./build/bin/llama-cli -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
Use Docker
docker model run hf.co/coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
- LM Studio
- Jan
- Ollama
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Ollama:
ollama run hf.co/coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
- Unsloth Studio
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for coder543/North-Mini-Code-1.0-QAD-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for coder543/North-Mini-Code-1.0-QAD-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for coder543/North-Mini-Code-1.0-QAD-GGUF to start chatting
- Pi
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Docker Model Runner:
docker model run hf.co/coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
- Lemonade
How to use coder543/North-Mini-Code-1.0-QAD-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull coder543/North-Mini-Code-1.0-QAD-GGUF:NVFP4
Run and chat with the model
lemonade run user.North-Mini-Code-1.0-QAD-GGUF-NVFP4
List all available models
lemonade list
GGUF Conversion Note
This repository contains a GGUF conversion of CohereLabs/North-Mini-Code-1.0-w4a16, Cohere's QAD-trained NVFP4 W4A16 checkpoint for North Mini Code. The GGUF was produced with llama.cpp's convert_hf_to_gguf.py using --outtype bf16.
The expert weights were repacked from the source checkpoint's compressed-tensors nvfp4-pack-quantized format into GGUF NVFP4 tensors. They were not dequantized and requantized into a standard GGUF Q4_* or IQ4_* format, so this conversion is intended to preserve the QAD-trained 4-bit weights. Non-NVFP4 tensors are stored as BF16 or F32 as emitted by the converter.
During conversion, the local llama.cpp converter needed a small workaround for a lazy tensor zero-bias check in the Cohere2 MoE path; the checkpoint's bias tensors were zero and were skipped as intended.
Conversion command:
python convert_hf_to_gguf.py \
/path/to/CohereLabs/North-Mini-Code-1.0-w4a16 \
--outfile north-mini-code-1.0-w4a16-nvfp4.gguf \
--outtype bf16
Model Card for North Mini Code
Model Summary
North Mini Code is an open weights research release of a 30B-A3B parameter model optimized for code generation, agentic software engineering, and terminal tasks.
Developed by: Cohere and Cohere Labs
- Point of Contact: Cohere Labs
- License: Apache 2.0
- Model: North Mini Code
- Model Size: 30B total; 3B active
- Context length: 256K & 64K max output
- Quantization: NVFP4 W4A16
For more details about this model, please check out our blog post.
Try North Mini Code
You can try out North Mini Code before downloading the weights in OpenCode and our hosted Hugging Face Space.
Evaluation
Benchmarking Methodology [CLICK TO EXPAND]
- We used SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, and Terminal-Bench Hard to benchmark North Mini Code's agentic coding capabilities. For evaluation harnesses, we used the Swe-Agent harness v1.1.0 for SWE-Bench, and a simple ReAct harness employing a single terminal-use tool based on Harbor's Tmux session implementation for Terminal-Bench v2. For Terminal Bench Hard, we directly used Terminus-2, following the same methodology as the Artificial Analysis Intelligence Index to compare North-Mini-Code-1.0 with the other models. Additionally, we used SciCode and LiveCodeBench v6 as complex code-generation benchmarks outside of tool use.
- We run each benchmark with 3 different seeds and report the average benchmark performance, using temperature=1.0 and top_p=0.95. We used publicly reported scores for competitor models, either from original reports or the Artificial Analysis Intelligence Index, where available. Additionally, Gemma4’s scores for agentic coding tasks were reported by Qwen team. For benchmark results that any public report is missing, denoted by (*) in the figure, we run them internally using the recommended model configuration.
Usage
To use our model in transformers, please use our BF16 model weights. Our NVFP4_W4A16 checkpoint is designed to be used with vLLM and MLX-VLM and is not compatible with transformers due to lack of native 4-bit support.
vLLM
You can run the model in vLLM. Please use vLLM main for North Mini Code until a new release is available, and accurate response parsing also requires installing Cohere’s melody library.
uv pip install "git+https://github.com/vllm-project/vllm.git"
uv pip install cohere_melody>=0.9.0
Then the vLLM server can be started with the following command:
vllm serve CohereLabs/North-Mini-Code-1.0-w4a16 \
-tp 1 \
--max-model-len 320000 \
--tool-call-parser cohere_command4 \
--reasoning-parser cohere_command4 \
--enable-auto-tool-choice
Use locally deployed North Mini Code in OpenCode:
Please use OpenCode > v1.17.0.
brew install anomalyco/tap/opencode
To use locally deployed North Mini Code in Opencode, please use this config which enables interleaved reasoning:
{
"$schema": "https://opencode.ai/config.json",
"model": "vllm/CohereLabs/North-Mini-Code-1.0-w4a16",
"provider": {
"vllm": {
"npm": "@ai-sdk/openai-compatible",
"name": "Local vLLM server",
"options": {
"baseURL": "http://127.0.0.1:8000/v1",
"apiKey": "EMPTY"
},
"models": {
"CohereLabs/North-Mini-Code-1.0-w4a16": {
"name": "North-Mini-Code-1.0",
"interleaved": {
"field": "reasoning"
},
"limit": {
"context": 256000,
"output": 64000
}
}
}
}
}
}
MLX-VLM
You can also run the model in MLX-VLM. Please use main for North Mini Code until a new release is available.
uv pip install "git+https://github.com/Blaizzy/mlx-vlm.git@main"
Then the mlx_vlm server can be started with the following command:
mlx_vlm.server \
--model CohereLabs/North-Mini-Code-1.0-w4a16 \
--enable-thinking \
--thinking-start-token "<|START_THINKING|>" \
--thinking-end-token "<|END_THINKING|>"
Opencode config:
Actual limit depends on your device
{
"$schema": "https://opencode.ai/config.json",
"model": "mlx-vlm/CohereLabs/North-Mini-Code-1.0-w4a16",
"provider": {
"mlx-vlm": {
"npm": "@ai-sdk/openai-compatible",
"name": "MLX VLM Local",
"options": {
"baseURL": "http://127.0.0.1:8080/v1",
"apiKey": "EMPTY"
},
"models": {
"CohereLabs/North-Mini-Code-1.0-w4a16": {
"name": "North-Mini-Code-1.0",
"interleaved": {
"field": "reasoning"
},
"limit": {
"context": 256000,
"output": 64000
}
}
}
}
}
}
Model Details
Input: Text only.
Output: Model generates text.
Model Architecture: North-Mini-Code-1.0 is a decoder-only Transformer-based sparse Mixture-of-Experts model. It uses an efficient attention implementation, interleaved between sliding-window attention with RoPE and global attention with no positional embeddings, in a 3:1 ratio. The feed-forward block is an MoE block with 128 experts, of which 8 are activated per token. Each expert block is an FFN block with SwiGLU activation. The router applies a sigmoid activation function to the logits before the top-k selection. We also use a single dense layer before the sparse layers. North-Mini-Code-1.0 was post-trained using a two-stage cascaded supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR), focusing on agentic coding. For more technical details, please check out our blog post.
Quantization Methodology: We use NVFP4 W4A16 quantization (4-bit weights, 16-bit activations) for this model, delivering a much smaller memory footprint (~ 18-20GB) and faster inference while preserving coding accuracy. We quantize the MoE experts only, keeping attention, the dense layer, and the router at higher precision. Since the experts hold most of the model's parameters, this captures the bulk of the savings with minimal quality loss. To preserve quality, we use Quantization-Aware Distillation (QAD), training the quantized model to match the unquantized model's outputs, achieving >99% overall accuracy recovery across our evaluations. Since only weights are quantized, this format does not require native FP4 hardware and runs on pre-Blackwell GPUs such as Hopper and Ada.
Context Length: North-Mini-Code-1.0 supports a context length of 256K & 64K output length.
Model Card Contact
For errors or additional questions about details in this model card, contact [labs@cohere.com].
- Downloads last month
- 1
4-bit
Model tree for coder543/North-Mini-Code-1.0-QAD-GGUF
Base model
CohereLabs/North-Mini-Code-1.0