Instructions to use shibatch/tiny1m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use shibatch/tiny1m with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("shibatch/tiny1m", dtype="auto")

llama-cpp-python

How to use shibatch/tiny1m with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/tiny1m",
	filename="tiny1m.BF16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use shibatch/tiny1m with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf shibatch/tiny1m:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf shibatch/tiny1m:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf shibatch/tiny1m:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf shibatch/tiny1m:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf shibatch/tiny1m:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf shibatch/tiny1m:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf shibatch/tiny1m:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf shibatch/tiny1m:Q4_K_M

Use Docker

docker model run hf.co/shibatch/tiny1m:Q4_K_M

LM Studio
Jan
Ollama
How to use shibatch/tiny1m with Ollama:
```
ollama run hf.co/shibatch/tiny1m:Q4_K_M
```

Unsloth Studio

How to use shibatch/tiny1m with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tiny1m to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for shibatch/tiny1m to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for shibatch/tiny1m to start chatting

Atomic Chat new
Docker Model Runner
How to use shibatch/tiny1m with Docker Model Runner:
```
docker model run hf.co/shibatch/tiny1m:Q4_K_M
```

Lemonade

How to use shibatch/tiny1m with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull shibatch/tiny1m:Q4_K_M

Run and chat with the model

lemonade run user.tiny1m-Q4_K_M

List all available models

lemonade list

TinyStories Llama2 1M (tiny1m) GGUF & HF Validation Suite

This repository provides ultra-lightweight Llama2 model files across various formats (both GGUF and Hugging Face / Safetensors), trained on the TinyStories dataset and optimized for compatibility with Andrej Karpathy's llama2.c and llama.cpp.

Why this repository exists

When developing a custom LLM inference engine, debugging with a full-sized model is slow. This suite offers a true 1M parameter scale model (~1MB to ~4MB depending on the quantization format), allowing developers to validate their loaders, serialization, quantization kernels, and inference logic step-by-step with maximum efficiency.

📂 Repository Structure & File Descriptions

1. GGUF Formats (Root Directory `./`)

A comprehensive validation suite converted for llama.cpp and compatible engines. Every compiled quantization variant available in the root directory is explicitly covered below:

Filename(s) / Wildcard Pattern	Type	Size	Purpose / Validation Target
`tiny1m.F32.gguf`	`F32`	~4.0 MB	Baseline Test. Validates GGUF parsing, tensor layout, matrix multiplication, RoPE, and Attention logic without dequantization overhead.
`tiny1m.F16.gguf` `tiny1m.BF16.gguf`	`F16` `BF16`	~2.0 MB	Half-Precision Test. Validates 16-bit floating point loading, type casting, and inference stability.
`tiny1m.Q8_0.gguf`	`Q8_0`	~1.1 MB	Quantization Level 1. Validates block-based uniform scaling with 32 elements.
`tiny1m.Q4_0.gguf` `tiny1m.Q4_1.gguf`	`Q4_0` `Q4_1`	~0.7 MB	Quantization Level 2. Validates classic 4-bit linear quantization and bit-unpacking logic.
`tiny1m.Q2_K.gguf`	`Q2_K`	~0.5 MB	Standard K-Quant (2-bit). Validates 2-bit super-block quantization parsing.
*`tiny1m.Q3_K_.gguf`** ↳ `tiny1m.Q3_K_S.gguf` ↳ `tiny1m.Q3_K_M.gguf` ↳ `tiny1m.Q3_K_L.gguf`	`Q3_K`	~0.6 MB	Standard K-Quant (3-bit). Validates Small, Medium, and Large sub-variants of 3-bit multi-block structures.
*`tiny1m.Q4_K_.gguf`** ↳ `tiny1m.Q4_K_S.gguf` ↳ `tiny1m.Q4_K_M.gguf`	`Q4_K`	~0.7 MB	Standard K-Quant (4-bit). Validates Small and Medium sub-variants of modern 4-bit super-block structural parsing.
*`tiny1m.Q5_K_.gguf`** ↳ `tiny1m.Q5_K_S.gguf` ↳ `tiny1m.Q5_K_M.gguf`	`Q5_K`	~0.8 MB	Standard K-Quant (5-bit). Validates Small and Medium sub-variants of 5-bit mixed precision super-blocks.
`tiny1m.Q6_K.gguf`	`Q6_K`	~0.9 MB	Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block quantization.
*`tiny1m.IQ3_.gguf`** ↳ `tiny1m.IQ3_XXS.gguf` ↳ `tiny1m.IQ3_S.gguf`	`I-Quants`	~0.5 MB	Importance Quants (3-bit). Non-linear 3-bit importance quantization targeting lookup table (codebook) decoding logic.
*`tiny1m.IQ4_.gguf`** ↳ `tiny1m.IQ4_NL.gguf` ↳ `tiny1m.IQ4_XS.gguf`	`I-Quants`	~0.6 MB	Importance Quants (4-bit). Non-linear 4-bit importance quantization variants (Non-Linear and Extra Small).
`tiny1m.TQ1_0.gguf` `tiny1m.TQ2_0.gguf`	`Ternary`	~0.4 MB	Experimental. Ternary (-1, 0, 1) state quantization for cutting-edge engine testing.

2. Llama2.c & Base Tokenizer Assets (Root Directory `./`)

Files optimized for execution within the native llama2.c ecosystem:

model.bin: A single flat binary file containing all network weights, custom layout arrays, and pre-computed RoPE frequencies structured specifically for run.c.
tokenizer.bin: The structural binary version of the 512-vocab tokenizer compiled for rapid streaming and direct parsing by run.c.
tokenizer.model: The master SentencePiece tokenizer model file (512 vocabulary size, identical to the stories260k standard) kept at the root for upstream conversion tools and local reference.

3. Hugging Face Native Format (`./hf/`)

This directory contains the standard files required to load the model using the PyTorch transformers library:

hf/model.safetensors: The raw, unquantized model weights stored securely in Safetensors format.
hf/config.json: The architectural configuration file defining hyperparameters (layers, heads, dimensions).
hf/generation_config.json: Default parameters optimized for text generation (temperature, top_p, etc.).
hf/tokenizer.model: A redundant copy of the 512-vocab SentencePiece tokenizer model placed inside the directory for seamless Hugging Face API resolution.

🚀 Quick Start & Usage Examples

A. Running GGUF via llama.cpp

To verify your local setup or compare tokens using the official native utilities:

./llama-cli -m tiny1m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

B. Running via llama2.c (Native Binary)

The model.bin is fully compatible with the 512-vocab tokenizer.bin derived from the stories260k asset pipeline.

./run model.bin -z tokenizer.bin -i "Tom and Jerry are " -n 64

C. Loading Hugging Face Formats via Python

You can import the Hugging Face variant directly into Python using the transformers library.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/tiny1m"

# The library automatically looks into the hf/ folder using the subfolder argument
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

prompt = "Tom and Jerry are "
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=64, 
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📝 Model Specifications

The network architecture features an unshared output layer (lm_head) to keep memory structures consistent with standard Llama 2 definitions. Thanks to the highly optimized 512 vocabulary size, the token embedding and output layers remain extremely lightweight.

Architecture: Llama 2 (Scaled-down variant)
Dataset: TinyStories
Total Parameters: ~1M (Exactly 896,256 parameters)
Vocabulary Size: 512 (Uses the stories260k compatible 512-vocab tokenizer layout)
Hidden Size (hidden_size): 128
Number of Hidden Layers (num_hidden_layers): 4
Number of Attention Heads (num_heads): 2
Number of Key-Value Heads (num_kv_heads): 2
Intermediate Size (intermediate_size): 352
Max Position Embeddings (max_position_embeddings): 256

📜 Acknowledgments & License

Original Implementation: Inspired by Andrej Karpathy's llama2.c project.
Dataset: TinyStories dataset.
License: MIT License. You are free to use, modify, and distribute these assets for any purpose.

Downloads last month: 1,597

GGUF

Model size

935k params

Architecture

llama

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support