Instructions to use AIDC-AI/Marco-Nano-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AIDC-AI/Marco-Nano-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AIDC-AI/Marco-Nano-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AIDC-AI/Marco-Nano-Instruct")
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Marco-Nano-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AIDC-AI/Marco-Nano-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AIDC-AI/Marco-Nano-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIDC-AI/Marco-Nano-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AIDC-AI/Marco-Nano-Instruct

SGLang

How to use AIDC-AI/Marco-Nano-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AIDC-AI/Marco-Nano-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIDC-AI/Marco-Nano-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AIDC-AI/Marco-Nano-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIDC-AI/Marco-Nano-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AIDC-AI/Marco-Nano-Instruct with Docker Model Runner:
```
docker model run hf.co/AIDC-AI/Marco-Nano-Instruct
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Marco-Nano-Instruct

Marco-Nano-Instruct is the post-trained variant of Marco-Nano-Base, a highly sparse Mixture-of-Experts (MoE) multilingual language model from the Marco-MoE family, developed by Alibaba International Digital Commerce. It activates only 0.6B out of 8B total parameters (7.5% activation ratio) per token. Despite its extreme sparsity, Marco-Nano-Instruct achieves the best average performance across English, multilingual general, and multilingual cultural benchmarks among all comparable instruct models up to 3.84B activated parameters.

Model Description

Marco-Nano-Instruct shares the same architecture as Marco-Nano-Base: a decoder-only Transformer with sparse MoE layers replacing standard FFN layers, upcycled from Qwen3-0.6B-Base using fine-grained sub-matrix splitting combined with Drop-Upcycling.

Configuration	Value
Total Parameters	8B
Activated Parameters	0.6B
Activation Ratio	7.5%
Num Layers	28
Model Dimension	1024
FFN Intermediate Dimension	3072
Q-Heads	16
KV-Heads	8
Head Dimension	128
Expert Dimension	384
Total Experts	232
Activated Experts	8
Tie Embeddings	True
Training FLOPs	$1.40 \times 10^{23}$

Post-Training Details

Marco-Nano-Instruct is trained from Marco-Nano-Base using a two-stage post-training pipeline implemented with the SLIME framework:

Stage 1: Supervised Fine-Tuning (SFT)

Duration: ~24 hours on 64 GPUs
Steps: ~4,000 (1 epoch)
Learning rate: 1e-5 with cosine decay to 1e-6
Batch size: 512, context length 8,192 tokens

Data sources:

General instructions — Dolci-Instruct dataset, augmented with Nemotron-Cascade-2 data
Knowledge-intensive data — Scientific prompts from Nemotron-Cascade-2, responses distilled from Gemini3-Flash
Translation data — Web-mined NLLB translation pairs, filtered and scored with Qwen3-Embedding-8B (top 10K per language)
Multilingual & cultural data — Wikidata-sourced content with Gemini3-Flash text synthesis for cultural concepts.

Stage 2: On-Policy Distillation (OPD)

Duration: ~110 hours on 64 GPUs
Steps: ~2,900 total (2 responses sampled per prompt)
Learning rate: 1e-6 (constant)

Cascaded distillation:

~1,900 steps with Qwen3-30B-A3B-Instruct as teacher
~1,000 steps with Qwen3-Next-80B-A3B-Instruct as stronger teacher

OPD data mixture:

Category	Datasets	Ratio
Instruction Following	Nemotron-RL-instruction-following + structured outputs	25%
Knowledge & Reasoning	Nemotron-RL-ReasoningGym-v1 + knowledge-mcqa	25%
Alignment	Nemotron-Cascade-RL-RLHF	10%
Math	DAPO-Math-17k + Skywork-OR1-RL-Data	10%
Multilingual	Translation + Cultural + Nemotron-SFT-Multilingual-v1	30%

Supported Languages

English, Chinese, Arabic, German, Spanish, French, Korean, Japanese, Portuguese, Turkish, Indonesian, Italian, Dutch, Polish, Russian, Vietnamese, Thai, Hebrew, Ukrainian, Malay, Bengali, Czech, Urdu, Kazakh, Greek, Romanian, Hungarian, Nepali, Azerbaijani

Evaluation

We compare Marco-Nano-Instruct against instruct models of comparable size: Qwen3-1.7B-Instruct (1.7B activated), Qwen3-VL-2B-Instruct (2B activated), Ministral3-3B-Instruct (3.84B activated), LFM2-8B-A1B (1.5B activated), and Granite4-Tiny-Instruct (1.47B activated). Marco-Nano-Instruct uses only 0.6B activated parameters — the smallest among all baselines. Avg@8 accuracies are reported, except for GlobalMMLU and MMMLU where Acc@1 is reported.

English

Benchmark	Qwen3-1.7B	Qwen3-VL-2B	Ministral3-3B	LFM2-8B-A1B	Granite4-Tiny	Marco-Nano
MMLU (Acc)	62.4	62.1	69.8	72.1	50.8	73.2
MMLU-Redux (Acc)	62.4	62.2	69.6	71.9	51.2	73.3
MMLU-Pro (Acc)	35.2	38.3	49.5	49.5	25.3	54.5
AGIEval (Acc)	39.6	33.0	44.7	45.2	30.7	49.8
GPQA-Diamond (Acc)	27.5	21.0	31.6	31.9	28.3	22.2
GSM8K (EM)	77.9	79.7	79.0	84.6	71.1	86.7
MATH (EM)	70.6	73.7	70.2	82.6	53.4	79.6
Average	53.7	52.9	59.2	62.5	44.4	62.8

Multilingual — General

Benchmark	Qwen3-1.7B	Qwen3-VL-2B	Ministral3-3B	LFM2-8B-A1B	Granite4-Tiny	Marco-Nano
GlobalMMLU (Acc)	46.3	45.9	38.4	49.0	43.0	58.7
MMMLU (Acc)	49.0	49.0	39.4	56.5	44.1	59.9
MMLU-ProX-Lite (Acc)	28.6	30.3	26.7	33.8	22.1	43.2
MGPQA (Acc)	25.3	22.3	18.8	27.2	25.9	21.6
FLORES-200 En→Xx (BLEU)	12.7	15.3	8.3	14.9	22.5	22.3
FLORES-200 Xx→En (BLEU)	28.2	28.6	18.9	20.1	30.4	31.1
WMT24++ En→Xx (BLEU)	13.2	14.6	4.4	14.6	18.9	18.7
WMT24++ Xx→En (BLEU)	26.4	26.2	8.3	17.9	25.1	27.3
MGSM (EM)	63.6	67.6	47.0	56.5	55.3	76.5
PolyMath (EM)	23.4	25.5	16.3	26.5	18.7	29.6
Average	31.7	32.5	22.7	31.7	30.6	38.9

Multilingual — Cultural & Regional

Benchmark	Qwen3-1.7B	Qwen3-VL-2B	Ministral3-3B	LFM2-8B-A1B	Granite4-Tiny	Marco-Nano
INCLUDE (Acc)	44.9	44.4	35.4	43.5	38.6	54.3
Global-PIQA (Acc)	62.0	65.8	50.6	60.8	63.3	70.7
CMMLU (Acc)	60.4	63.3	48.9	52.7	39.2	60.0
C-Eval (Acc)	58.7	63.2	50.6	50.8	39.4	60.8
ArabicMMLU (Acc)	48.8	46.9	22.7	56.5	43.4	56.5
TurkishMMLU (Acc)	42.7	39.6	38.6	26.3	31.6	59.9
GreekMMLU (Acc)	48.7	48.0	38.4	40.0	44.8	61.6
KazakhMMLU (Acc)	46.0	47.1	41.4	39.6	39.6	56.3
IndoMMLU (Acc)	48.8	49.3	35.2	41.1	37.2	56.3
IndoCareer (Acc)	46.1	45.7	36.0	41.7	34.7	54.9
IndoCulture (Acc)	45.8	47.7	37.2	45.9	42.8	59.1
Average	50.3	51.0	39.5	45.4	41.3	59.1

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AIDC-AI/Marco-Nano-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Note: vLLM is the recommended engine for deployment, as SGLang currently lacks support for MoE models with tied embeddings (see PR #20127). If SGLang is required for your workflow, please use the specific build at commit e5f48b32abff027d859a43b7d5ba3aece04471c7.

Citation

@article{marco-moe,
  title={Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling},
  author={Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, Yichao Du, Feihu Jiang, Longyue Wang and Weihua Luo},
  year={2026}
}