--- license: mit base_model: karpathy/tinyllamas tags: - llama2 - gguf - safetensors - transformers - tinyllamas - validation - test-suite --- # TinyStories Llama2 1M (tiny1m) GGUF & HF Validation Suite This repository provides ultra-lightweight Llama2 model files across various formats (both **GGUF** and **Hugging Face / Safetensors**), trained on the TinyStories dataset and optimized for compatibility with Andrej Karpathy's `llama2.c` and `llama.cpp`. ### Why this repository exists When developing a custom LLM inference engine, debugging with a full-sized model is slow. This suite offers a true **1M parameter scale model** (~1MB to ~4MB depending on the quantization format), allowing developers to validate their loaders, serialization, quantization kernels, and inference logic step-by-step with maximum efficiency. --- ## 📂 Repository Structure & File Descriptions ### 1. GGUF Formats (Root Directory `./`) A comprehensive validation suite converted for `llama.cpp` and compatible engines. Every compiled quantization variant available in the root directory is explicitly covered below: | Filename(s) / Wildcard Pattern | Type | Size | Purpose / Validation Target | | :--- | :--- | :--- | :--- | | **`tiny1m.F32.gguf`** | `F32` | ~4.0 MB | **Baseline Test.** Validates GGUF parsing, tensor layout, matrix multiplication, RoPE, and Attention logic without dequantization overhead. | | **`tiny1m.F16.gguf`**
**`tiny1m.BF16.gguf`** | `F16`
`BF16` | ~2.0 MB | **Half-Precision Test.** Validates 16-bit floating point loading, type casting, and inference stability. | | **`tiny1m.Q8_0.gguf`** | `Q8_0` | ~1.1 MB | **Quantization Level 1.** Validates block-based uniform scaling with 32 elements. | | **`tiny1m.Q4_0.gguf`**
**`tiny1m.Q4_1.gguf`** | `Q4_0`
`Q4_1` | ~0.7 MB | **Quantization Level 2.** Validates classic 4-bit linear quantization and bit-unpacking logic. | | **`tiny1m.Q2_K.gguf`** | `Q2_K` | ~0.5 MB | **Standard K-Quant (2-bit).** Validates 2-bit super-block quantization parsing. | | **`tiny1m.Q3_K_*.gguf`**
↳ *`tiny1m.Q3_K_S.gguf`*
↳ *`tiny1m.Q3_K_M.gguf`*
↳ *`tiny1m.Q3_K_L.gguf`* | `Q3_K` | ~0.6 MB | **Standard K-Quant (3-bit).** Validates Small, Medium, and Large sub-variants of 3-bit multi-block structures. | | **`tiny1m.Q4_K_*.gguf`**
↳ *`tiny1m.Q4_K_S.gguf`*
↳ *`tiny1m.Q4_K_M.gguf`* | `Q4_K` | ~0.7 MB | **Standard K-Quant (4-bit).** Validates Small and Medium sub-variants of modern 4-bit super-block structural parsing. | | **`tiny1m.Q5_K_*.gguf`**
↳ *`tiny1m.Q5_K_S.gguf`*
↳ *`tiny1m.Q5_K_M.gguf`* | `Q5_K` | ~0.8 MB | **Standard K-Quant (5-bit).** Validates Small and Medium sub-variants of 5-bit mixed precision super-blocks. | | **`tiny1m.Q6_K.gguf`** | `Q6_K` | ~0.9 MB | **Standard K-Quant (6-bit).** Validates 6-bit high-fidelity super-block quantization. | | **`tiny1m.IQ3_*.gguf`**
↳ *`tiny1m.IQ3_XXS.gguf`*
↳ *`tiny1m.IQ3_S.gguf`* | `I-Quants` | ~0.5 MB | **Importance Quants (3-bit).** Non-linear 3-bit importance quantization targeting lookup table (codebook) decoding logic. | | **`tiny1m.IQ4_*.gguf`**
↳ *`tiny1m.IQ4_NL.gguf`*
↳ *`tiny1m.IQ4_XS.gguf`* | `I-Quants` | ~0.6 MB | **Importance Quants (4-bit).** Non-linear 4-bit importance quantization variants (Non-Linear and Extra Small). | | **`tiny1m.TQ1_0.gguf`**
**`tiny1m.TQ2_0.gguf`** | `Ternary` | ~0.4 MB | **Experimental.** Ternary (-1, 0, 1) state quantization for cutting-edge engine testing. | ### 2. Llama2.c & Base Tokenizer Assets (Root Directory `./`) Files optimized for execution within the native `llama2.c` ecosystem: * **`model.bin`**: A single flat binary file containing all network weights, custom layout arrays, and pre-computed RoPE frequencies structured specifically for `run.c`. * **`tokenizer.bin`**: The structural binary version of the 512-vocab tokenizer compiled for rapid streaming and direct parsing by `run.c`. * **`tokenizer.model`**: The master SentencePiece tokenizer model file (512 vocabulary size, identical to the `stories260k` standard) kept at the root for upstream conversion tools and local reference. ### 3. Hugging Face Native Format (`./hf/`) This directory contains the standard files required to load the model using the PyTorch `transformers` library: * **`hf/model.safetensors`**: The raw, unquantized model weights stored securely in Safetensors format. * **`hf/config.json`**: The architectural configuration file defining hyperparameters (layers, heads, dimensions). * **`hf/generation_config.json`**: Default parameters optimized for text generation (temperature, top_p, etc.). * **`hf/tokenizer.model`**: A redundant copy of the 512-vocab SentencePiece tokenizer model placed inside the directory for seamless Hugging Face API resolution. --- ## 🚀 Quick Start & Usage Examples ### A. Running GGUF via llama.cpp To verify your local setup or compare tokens using the official native utilities: ```bash ./llama-cli -m tiny1m.Q4_K_M.gguf -p "Tom and Jerry are " -n 64 --temp 0.0 ``` ### B. Running via llama2.c (Native Binary) The `model.bin` is fully compatible with the 512-vocab `tokenizer.bin` derived from the `stories260k` asset pipeline. ```bash ./run model.bin -z tokenizer.bin -i "Tom and Jerry are " -n 64 ``` ### C. Loading Hugging Face Formats via Python You can import the Hugging Face variant directly into Python using the `transformers` library. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM repo_id = "shibatch/tiny1m" # The library automatically looks into the hf/ folder using the subfolder argument tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf") model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf") prompt = "Tom and Jerry are " inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=64, do_sample=False, pad_token_id=tokenizer.eos_token_id ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## 📝 Model Specifications The network architecture features an unshared output layer (`lm_head`) to keep memory structures consistent with standard Llama 2 definitions. Thanks to the highly optimized 512 vocabulary size, the token embedding and output layers remain extremely lightweight. * **Architecture:** Llama 2 (Scaled-down variant) * **Dataset:** TinyStories * **Total Parameters:** ~1M (Exactly 896,256 parameters) * **Vocabulary Size:** 512 (Uses the `stories260k` compatible 512-vocab tokenizer layout) * **Hidden Size (`hidden_size`):** 128 * **Number of Hidden Layers (`num_hidden_layers`):** 4 * **Number of Attention Heads (`num_heads`):** 2 * **Number of Key-Value Heads (`num_kv_heads`):** 2 * **Intermediate Size (`intermediate_size`):** 352 * **Max Position Embeddings (`max_position_embeddings`):** 256 ## 📜 Acknowledgments & License * **Original Implementation:** Inspired by Andrej Karpathy's `llama2.c` project. * **Dataset:** TinyStories dataset. * **License:** **MIT License**. You are free to use, modify, and distribute these assets for any purpose.