How to use from
Docker Model Runner
docker model run hf.co/shibatch/tiny1m:
Quick Links

TinyStories Llama2 1M (tiny1m) GGUF & HF Validation Suite

This repository provides ultra-lightweight Llama2 model files across various formats (both GGUF and Hugging Face / Safetensors), trained on the TinyStories dataset and optimized for compatibility with Andrej Karpathy's llama2.c and llama.cpp.

Why this repository exists

When developing a custom LLM inference engine from scratch (C/C++, Vulkan, WebAssembly, etc.) or testing custom hardware kernels, debugging with a full-sized model is slow. This suite offers a true 1M parameter scale model (~1MB to ~4MB depending on the quantization format), allowing developers to validate their loaders, serialization, quantization kernels, and inference logic step-by-step with maximum efficiency.


πŸ“‚ Repository Structure & File Descriptions

1. GGUF Formats (Root Directory ./)

A comprehensive validation suite converted for llama.cpp and compatible engines. Every compiled quantization variant available in the root directory is explicitly covered below:

Filename(s) / Wildcard Pattern Type Size Purpose / Validation Target
tiny1m.F32.gguf F32 ~4.0 MB Baseline Test. Validates GGUF parsing, tensor layout, matrix multiplication, RoPE, and Attention logic without dequantization overhead.
tiny1m.F16.gguf
tiny1m.BF16.gguf
F16
BF16
~2.0 MB Half-Precision Test. Validates 16-bit floating point loading, type casting, and inference stability.
tiny1m.Q8_0.gguf Q8_0 ~1.1 MB Quantization Level 1. Validates block-based uniform scaling with 32 elements.
tiny1m.Q4_0.gguf
tiny1m.Q4_1.gguf
Q4_0
Q4_1
~0.7 MB Quantization Level 2. Validates classic 4-bit linear quantization and bit-unpacking logic.
tiny1m.Q2_K.gguf Q2_K ~0.5 MB Standard K-Quant (2-bit). Validates 2-bit super-block quantization parsing.
tiny1m.Q3_K_*.gguf
↳ tiny1m.Q3_K_S.gguf
↳ tiny1m.Q3_K_M.gguf
↳ tiny1m.Q3_K_L.gguf
Q3_K ~0.6 MB Standard K-Quant (3-bit). Validates Small, Medium, and Large sub-variants of 3-bit multi-block structures.
tiny1m.Q4_K_*.gguf
↳ tiny1m.Q4_K_S.gguf
↳ tiny1m.Q4_K_M.gguf
Q4_K ~0.7 MB Standard K-Quant (4-bit). Validates Small and Medium sub-variants of modern 4-bit super-block structural parsing.
tiny1m.Q5_K_*.gguf
↳ tiny1m.Q5_K_S.gguf
↳ tiny1m.Q5_K_M.gguf
Q5_K ~0.8 MB Standard K-Quant (5-bit). Validates Small and Medium sub-variants of 5-bit mixed precision super-blocks.
tiny1m.Q6_K.gguf Q6_K ~0.9 MB Standard K-Quant (6-bit). Validates 6-bit high-fidelity super-block quantization.
tiny1m.IQ3_*.gguf
↳ tiny1m.IQ3_XXS.gguf
↳ tiny1m.IQ3_S.gguf
I-Quants ~0.5 MB Importance Quants (3-bit). Non-linear 3-bit importance quantization targeting lookup table (codebook) decoding logic.
tiny1m.IQ4_*.gguf
↳ tiny1m.IQ4_NL.gguf
↳ tiny1m.IQ4_XS.gguf
I-Quants ~0.6 MB Importance Quants (4-bit). Non-linear 4-bit importance quantization variants (Non-Linear and Extra Small).
tiny1m.TQ1_0.gguf
tiny1m.TQ2_0.gguf
Ternary ~0.4 MB Experimental. Ternary (-1, 0, 1) state quantization for cutting-edge engine testing.

2. Llama2.c & Base Tokenizer Assets (Root Directory ./)

Files optimized for execution within the native llama2.c ecosystem:

  • model.bin: A single flat binary file containing all network weights, custom layout arrays, and pre-computed RoPE frequencies structured specifically for run.c.
  • tokenizer.bin: The structural binary version of the 512-vocab tokenizer compiled for rapid streaming and direct parsing by run.c.
  • tokenizer.model: The master SentencePiece tokenizer model file (512 vocabulary size, identical to the stories260k standard) kept at the root for upstream conversion tools and local reference.

3. Hugging Face Native Format (./hf/)

This directory contains the standard files required to load the model using the PyTorch transformers library:

  • hf/model.safetensors: The raw, unquantized model weights stored securely in Safetensors format.
  • hf/config.json: The architectural configuration file defining hyperparameters (layers, heads, dimensions).
  • hf/generation_config.json: Default parameters optimized for text generation (temperature, top_p, etc.).
  • hf/tokenizer.model: A redundant copy of the 512-vocab SentencePiece tokenizer model placed inside the directory for seamless Hugging Face API resolution.

πŸš€ Quick Start & Usage Examples

A. Running GGUF via llama.cpp

To verify your local setup or compare tokens using the official native utilities:

./llama-cli -m tiny1m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0

B. Running via llama2.c (Native Binary)

The model.bin is fully compatible with the 512-vocab tokenizer.bin derived from the stories260k asset pipeline.

⚠️ Important Note for llama2.c/run: When passing a prompt to the run binary, you must use the -i option. Do not use -p, as -p is reserved for the Top-p sampling threshold in llama2.c, which will cause the prompt to be ignored.

./run model.bin -z tokenizer.bin -i "Tom and Jerry are " -n 64

C. Loading Hugging Face Formats via Python

You can import the Hugging Face variant directly into Python using the transformers library.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "your-username/your-repo-name"

# The library automatically looks into the hf/ folder using the subfolder argument
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf")

prompt = "Tom and Jerry are "
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=64, 
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“ Model Specifications

The network architecture features an unshared output layer (lm_head) to keep memory structures consistent with standard Llama 2 definitions. Thanks to the highly optimized 512 vocabulary size, the token embedding and output layers remain extremely lightweight.

  • Architecture: Llama 2 (Scaled-down variant)
  • Dataset: TinyStories
  • Total Parameters: ~1M (Exactly 896,256 parameters)
  • Vocabulary Size: 512 (Uses the stories260k compatible 512-vocab tokenizer layout)
  • Hidden Size (hidden_size): 128
  • Number of Hidden Layers (num_hidden_layers): 4
  • Number of Attention Heads (num_heads): 2
  • Number of Key-Value Heads (num_kv_heads): 2
  • Intermediate Size (intermediate_size): 352
  • Max Position Embeddings (max_position_embeddings): 256

πŸ“œ Acknowledgments & License

  • Original Implementation: Inspired by Andrej Karpathy's llama2.c project.
  • Dataset: TinyStories dataset.
  • License: MIT License. You are free to use, modify, and distribute these assets for any purpose.
Downloads last month
-
GGUF
Model size
935k params
Architecture
llama
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shibatch/tiny1m

Quantized
(2)
this model