Marco-Nano-Instruct

Marco-Nano-Instruct is the post-trained variant of Marco-Nano-Base, a highly sparse Mixture-of-Experts (MoE) multilingual language model from the Marco-MoE family, developed by Alibaba International Digital Commerce. It activates only 0.6B out of 8B total parameters (7.5% activation ratio) per token. Despite its extreme sparsity, Marco-Nano-Instruct achieves the best average performance across English, multilingual general, and multilingual cultural benchmarks among all comparable instruct models up to 3.84B activated parameters.

Model Description

Marco-Nano-Instruct shares the same architecture as Marco-Nano-Base: a decoder-only Transformer with sparse MoE layers replacing standard FFN layers, upcycled from Qwen3-0.6B-Base using fine-grained sub-matrix splitting combined with Drop-Upcycling.

Configuration Value
Total Parameters 8B
Activated Parameters 0.6B
Activation Ratio 7.5%
Num Layers 28
Model Dimension 1024
FFN Intermediate Dimension 3072
Q-Heads 16
KV-Heads 8
Head Dimension 128
Expert Dimension 384
Total Experts 232
Activated Experts 8
Tie Embeddings True
Training FLOPs $1.40 \times 10^{23}$

Post-Training Details

Marco-Nano-Instruct is trained from Marco-Nano-Base using a two-stage post-training pipeline implemented with the SLIME framework:

Stage 1: Supervised Fine-Tuning (SFT)

  • Duration: ~24 hours on 64 GPUs
  • Steps: ~4,000 (1 epoch)
  • Learning rate: 1e-5 with cosine decay to 1e-6
  • Batch size: 512, context length 8,192 tokens

Data sources:

  1. General instructions — Dolci-Instruct dataset, augmented with Nemotron-Cascade-2 data
  2. Knowledge-intensive data — Scientific prompts from Nemotron-Cascade-2, responses distilled from Gemini3-Flash
  3. Translation data — Web-mined NLLB translation pairs, filtered and scored with Qwen3-Embedding-8B (top 10K per language)
  4. Multilingual & cultural data — Wikidata-sourced content with Gemini3-Flash text synthesis for cultural concepts.

Stage 2: On-Policy Distillation (OPD)

  • Duration: ~110 hours on 64 GPUs
  • Steps: ~2,900 total (2 responses sampled per prompt)
  • Learning rate: 1e-6 (constant)

Cascaded distillation:

  1. ~1,900 steps with Qwen3-30B-A3B-Instruct as teacher
  2. ~1,000 steps with Qwen3-Next-80B-A3B-Instruct as stronger teacher

OPD data mixture:

Category Datasets Ratio
Instruction Following Nemotron-RL-instruction-following + structured outputs 25%
Knowledge & Reasoning Nemotron-RL-ReasoningGym-v1 + knowledge-mcqa 25%
Alignment Nemotron-Cascade-RL-RLHF 10%
Math DAPO-Math-17k + Skywork-OR1-RL-Data 10%
Multilingual Translation + Cultural + Nemotron-SFT-Multilingual-v1 30%

Supported Languages

English, Chinese, Arabic, German, Spanish, French, Korean, Japanese, Portuguese, Turkish, Indonesian, Italian, Dutch, Polish, Russian, Vietnamese, Thai, Hebrew, Ukrainian, Malay, Bengali, Czech, Urdu, Kazakh, Greek, Romanian, Hungarian, Nepali, Azerbaijani

Evaluation

We compare Marco-Nano-Instruct against instruct models of comparable size: Qwen3-1.7B-Instruct (1.7B activated), Qwen3-VL-2B-Instruct (2B activated), Ministral3-3B-Instruct (3.84B activated), LFM2-8B-A1B (1.5B activated), and Granite4-Tiny-Instruct (1.47B activated). Marco-Nano-Instruct uses only 0.6B activated parameters — the smallest among all baselines. Avg@8 accuracies are reported, except for GlobalMMLU and MMMLU where Acc@1 is reported.

English

Benchmark Qwen3-1.7B Qwen3-VL-2B Ministral3-3B LFM2-8B-A1B Granite4-Tiny Marco-Nano
MMLU (Acc) 62.4 62.1 69.8 72.1 50.8 73.2
MMLU-Redux (Acc) 62.4 62.2 69.6 71.9 51.2 73.3
MMLU-Pro (Acc) 35.2 38.3 49.5 49.5 25.3 54.5
AGIEval (Acc) 39.6 33.0 44.7 45.2 30.7 49.8
GPQA-Diamond (Acc) 27.5 21.0 31.6 31.9 28.3 22.2
GSM8K (EM) 77.9 79.7 79.0 84.6 71.1 86.7
MATH (EM) 70.6 73.7 70.2 82.6 53.4 79.6
Average 53.7 52.9 59.2 62.5 44.4 62.8

Multilingual — General

Benchmark Qwen3-1.7B Qwen3-VL-2B Ministral3-3B LFM2-8B-A1B Granite4-Tiny Marco-Nano
GlobalMMLU (Acc) 46.3 45.9 38.4 49.0 43.0 58.7
MMMLU (Acc) 49.0 49.0 39.4 56.5 44.1 59.9
MMLU-ProX-Lite (Acc) 28.6 30.3 26.7 33.8 22.1 43.2
MGPQA (Acc) 25.3 22.3 18.8 27.2 25.9 21.6
FLORES-200 En→Xx (BLEU) 12.7 15.3 8.3 14.9 22.5 22.3
FLORES-200 Xx→En (BLEU) 28.2 28.6 18.9 20.1 30.4 31.1
WMT24++ En→Xx (BLEU) 13.2 14.6 4.4 14.6 18.9 18.7
WMT24++ Xx→En (BLEU) 26.4 26.2 8.3 17.9 25.1 27.3
MGSM (EM) 63.6 67.6 47.0 56.5 55.3 76.5
PolyMath (EM) 23.4 25.5 16.3 26.5 18.7 29.6
Average 31.7 32.5 22.7 31.7 30.6 38.9

Multilingual — Cultural & Regional

Benchmark Qwen3-1.7B Qwen3-VL-2B Ministral3-3B LFM2-8B-A1B Granite4-Tiny Marco-Nano
INCLUDE (Acc) 44.9 44.4 35.4 43.5 38.6 54.3
Global-PIQA (Acc) 62.0 65.8 50.6 60.8 63.3 70.7
CMMLU (Acc) 60.4 63.3 48.9 52.7 39.2 60.0
C-Eval (Acc) 58.7 63.2 50.6 50.8 39.4 60.8
ArabicMMLU (Acc) 48.8 46.9 22.7 56.5 43.4 56.5
TurkishMMLU (Acc) 42.7 39.6 38.6 26.3 31.6 59.9
GreekMMLU (Acc) 48.7 48.0 38.4 40.0 44.8 61.6
KazakhMMLU (Acc) 46.0 47.1 41.4 39.6 39.6 56.3
IndoMMLU (Acc) 48.8 49.3 35.2 41.1 37.2 56.3
IndoCareer (Acc) 46.1 45.7 36.0 41.7 34.7 54.9
IndoCulture (Acc) 45.8 47.7 37.2 45.9 42.8 59.1
Average 50.3 51.0 39.5 45.4 41.3 59.1

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AIDC-AI/Marco-Nano-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Citation

@article{marco-moe,
  title={Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling},
  author={Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, Yichao Du, Feihu Jiang, Longyue Wang and Weihua Luo},
  year={2026}
}

License

This model is released under the Apache 2.0 License.

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AIDC-AI/Marco-Nano-Instruct

Quantizations
3 models

Datasets used to train AIDC-AI/Marco-Nano-Instruct

Collection including AIDC-AI/Marco-Nano-Instruct