stories-converted / README.md
shibatch's picture
Upload README.md with huggingface_hub
4724c96 verified
metadata
license: mit
base_model: karpathy/tinyllamas
tags:
  - llama2
  - gguf
  - safetensors
  - transformers
  - tinyllamas
  - validation
  - test-suite

TinyStories Llama2 GGUF & HF Validation Suite

This repository provides a comprehensive collection of ultra-lightweight Llama2 models across various formats (both GGUF and Hugging Face/Safetensors), converted from Andrej Karpathy's llama2.c project.

Why this repository exists?

When developing a custom LLM inference engine from scratch (C/C++, Vulkan, WebAssembly, etc.) or testing custom hardware kernels, debugging with a full-sized 7B model is slow and inefficient. This suite offers 1MB to 60MB scale models, allowing developers to validate their loaders, serialization, quantization kernels, and inference logic step-by-step with lightning speed.


πŸ“¦ Included Formats & Testing Roadmap

1. GGUF Formats (For Native Inference Engines)

Recommended validation order when developing a custom native GGUF engine:

Filename Type Size Purpose / Validation Target
stories15M.F32.gguf F32 ~60 MB Baseline Test. Validates GGUF parsing, tensor layout, matrix multiplication, RoPE, and Attention logic without any dequantization overhead.
stories15M.F16.gguf
stories15M.BF16.gguf
F16
BF16
~30 MB Half-Precision Test. Validates 16-bit floating point loading, type casting, and inference stability.
stories15M.Q8_0.gguf Q8_0 ~16 MB Quantization Level 1. Validates the simplest linear quantization logic (block-based uniform scaling with 32 elements).
stories15M.Q4_0.gguf
stories15M.Q4_1.gguf
Q4_0
Q4_1
~10 MB Quantization Level 2. Validates classic 4-bit linear quantization and bit-unpacking logic.
stories15M.Q2_K γ€œ Q6_K.gguf K-Quants 9~15 MB Standard Quants. Validates modern super-block structural parsing with mixed precision.
stories15M.IQ3_XXS γ€œ IQ4_XS.gguf I-Quants 8~12 MB Advanced Quants. Non-linear quantization targeting lookup table (codebook) decoding logic.
stories15M.TQ1_0.gguf
stories15M.TQ2_0.gguf
Ternary 7~9 MB Experimental. Ternary (-1, 0, 1) state quantization for cutting-edge engine testing.
stories260K.F32.gguf
stories260K.F16.gguf
F32
F16
~1 MB Ultra-Mini Check. Extreme low-resource baseline utilizing a tiny 512-token vocabulary.

2. Hugging Face / Transformers Formats (For PyTorch Validation)

Standard Safetensors weights accompanied by standard config.json files for out-of-the-box usage with the Hugging Face transformers library. Ideal for calculating mathematical baseline answers or testing upstream conversion scripts (like convert_hf_to_gguf.py).

  • hf_stories15M/: The 15M parameter model mapped to standard Hugging Face Llama architecture. Includes pre-bundled Llama-2 compatible tokenizer configurations.
  • hf_stories260K/: The ultra-mini 260K parameter model with its custom architecture parameters intact.

πŸš€ Quick Start & Usage Examples

A. Running GGUF via llama.cpp

To verify your local setup or compare tokens using the official native utilities:

./llama-cli -m stories15M.Q4_K_M.gguf -p "One day, Timmy went to" -n 30 --temp 0.0

B. Loading Hugging Face Formats via Python

You can import the Hugging Face variants directly into Python via the transformers library using the subfolder argument.

Example for hf_stories15M

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/stories-converted"

# Load directly from the subfolder in this repository
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf_stories15M")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf_stories15M")

prompt = "One day, Timmy went to"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=30, 
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“ Model Specifications

  • Architecture: Llama 2 (scaled down variants)
  • Dataset: TinyStories (focused on simple vocabulary suited for 3 to 4-year-olds)
  • Vocabulary Size: 32,000 for 15M models, 512 for 260K models.

πŸ“œ Acknowledgments & License

  • Original Weights: Trained by Andrej Karpathy (karpathy/tinyllamas).
  • License: MIT License (inherited from the original llama2.c repository). You are free to use, modify, and distribute these assets for any purpose.