Instructions to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("FrontiersMind/Nandi-Mini-600M-Early-Checkpoint", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/FrontiersMind/Nandi-Mini-600M-Early-Checkpoint

SGLang

How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use FrontiersMind/Nandi-Mini-600M-Early-Checkpoint with Docker Model Runner:
```
docker model run hf.co/FrontiersMind/Nandi-Mini-600M-Early-Checkpoint
```

Nandi-Mini-600M-Early-Checkpoint

Introduction

Nandi-Mini-600M-Early-Checkpoint is an early-stage checkpoint (After 250 Billions tokens) from the upcoming Nandi-Mini-600M model family, a compact multilingual language model focused on strong efficiency, deployment flexibility, and Indic language support.

The model is being trained completely from scratch and is designed to deliver strong performance at low compute and memory budgets. This checkpoint is shared to provide an early look into the model’s scaling behavior and training progress.

This release is an early checkpoint and not the final converged model. Performance is expected to improve further with continued training and scaling.

📢 We will soon share technical blog ! Stay tuned!

Architectural Highlights

Nandi-Mini-600M introduces several efficiency-focused architectural optimizations designed for compact yet capable language models.

Shared KV (Shared Key-Value Vectors)

Shared KV is one of the core architectural ideas explored in Nandi-Mini. Instead of computing separate Key and Value projections, both reuse a shared latent representation, while a lightweight Key normalization step is applied specifically for attention computation.

This design reduces KV-cache memory usage by ~50% during inference with only a small increase in compute overhead, since RoPE and Key normalization are applied dynamically during attention computation.

Nandi supports two KV cache modes:

"kv_cache_mode": "shared"

Uses Shared KV, reducing KV-cache memory by ~50% with slightly higher compute overhead.

"kv_cache_mode": "vanilla"

Uses standard separate Key-Value caching for maximum inference compatibility and lower compute overhead.

KV-Cache Memory Comparison

Vanilla KV → Standard KV-cache memory usage
Shared KV → ~50% lower KV-cache footprint

Shared KV is part of our broader focus on deployable foundation models optimized for:

On-premise AI systems
Memory-constrained deployments
Edge devices
Long-context inference workloads

This remains an active research area within the Nandi model family, and we plan to share deeper technical details in upcoming engineering blogs.

Model Details

Type: Causal Language Model
Training Stage: Early Pretraining Checkpoint (250 Billions tokens)
Parameters: ~600M
Architecture: Transformer decoder
Positional Encoding: RoPE
Normalization: RMSNorm + QK Norm
Activation: SwiGLU
Attention: GQA + Shared KV
Embeddings: Tied embeddings with factorized design
Context length: 2,048 tokens (planned to be extended to 32,000 tokens)
Vocabulary Size: 131,072

📊 Benchmark Results

General Benchmarks

Model	Trained Tokens	HellaSwag	WinoGrande	OBQA	PIQA	GPQA	ARC-e	ARC-c	MMLU	Average
MobiLlama-0.5B-Base	1.3	39.65	53.67	30.60	70.35	24.33	52.82	23.63	24.18	39.90
Qwen-2-0.5B-Base	12	49.01	57.69	33.20	68.98	27.23	54.79	25.42	44.06	45.05
Qwen2.5-0.5B-Base	18	52.16	56.82	35.40	70.29	24.10	64.64	29.86	47.41	47.59
Qwen3-0.6B-Base	36	53.77	59.19	34.40	70.29	30.80	65.44	33.78	50.34	49.75
Qwen3.5-0.8B-Base	36	54.87	60.54	35.80	70.02	31.25	70.50	38.23	52.73	51.74
SmolLM-360M-Base	0.6	53.33	57.22	37.60	70.56	21.20	70.24	33.27	24.92	46.04
SmolLM2-360M-Base	4	56.30	59.19	37.60	71.81	25.22	67.88	36.68	25.55	47.53
Nandi-Mini-600M-Early-Checkpoint-Base	0.2	44.86	54.77	34.80	68.60	26.33	64.73	29.70	29.01	44.10

Tokenization Fertility Score Across Languages

Language	SmolLM3-3B	Qwen3-0.6B-Base	Sarvam-1	Nandi-Mini-600M
English	1.17	1.16	1.32	1.18
Bengali	8.66	7.51	1.55	1.44
Gujarati	10.47	9.37	1.55	1.53
Hindi	2.71	5.14	1.25	1.32
Kannada	16.43	12.96	2.10	1.90
Malayalam	17.77	14.56	2.49	2.05
Marathi	3.73	6.70	1.55	1.55
Oriya	19.07	15.75	2.18	2.68
Punjabi	9.23	8.66	1.47	1.42
Tamil	13.56	10.93	2.06	2.05
Telugu	15.40	13.38	2.09	1.77
Assamese	9.26	8.13	4.31	1.51

🌍 Supported Languages

The model is trained on English and a diverse set of Indic languages, including:

Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia

🚀 Usage

!pip install transformers=='5.4.0'

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "FrontiersMind/Nandi-Mini-600M-Early-Checkpoint"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    dtype=torch.bfloat16
).to(device).eval()


#model.config.kv_cache_mode = "shared" # Use this one if wants to save 50% KV cache, but this will slight more compute
model.config.kv_cache_mode = "vanilla"

prompt = """The night was quiet and the streets were empty"""

model_inputs = tokenizer(
    [prompt],
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
        **model_inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.3,
        top_k=20,
        top_p=0.95,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
        use_cache=True,   # Disable KV cache
    )

response = tokenizer.decode(
    outputs[0],
    skip_special_tokens=True
)

print(response)

Downloads last month: 183

Safetensors

Model size

0.6B params

Tensor type

BF16

Collection including FrontiersMind/Nandi-Mini-600M-Early-Checkpoint

Nandi-Mini

Collection

Nandi-Series of Mini Models • 4 items • Updated about 2 hours ago • 5