How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="shibatch/stories-converted",
	filename="",
)
output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

TinyStories Llama2 GGUF & HF Validation Suite

This repository provides a comprehensive collection of ultra-lightweight Llama2 models across various formats (both GGUF and Hugging Face/Safetensors), converted from Andrej Karpathy's llama2.c project.

Why this repository exists?

When developing a custom LLM inference engine from scratch (C/C++, Vulkan, WebAssembly, etc.) or testing custom hardware kernels, debugging with a full-sized 7B model is slow and inefficient. This suite offers 1MB to 60MB scale models, allowing developers to validate their loaders, serialization, quantization kernels, and inference logic step-by-step with lightning speed.


πŸ“¦ Included Formats & Testing Roadmap

1. GGUF Formats (For Native Inference Engines)

Recommended validation order when developing a custom native GGUF engine:

Filename Type Size Purpose / Validation Target
stories15M.F32.gguf F32 ~60 MB Baseline Test. Validates GGUF parsing, tensor layout, matrix multiplication, RoPE, and Attention logic without any dequantization overhead.
stories15M.F16.gguf
stories15M.BF16.gguf
F16
BF16
~30 MB Half-Precision Test. Validates 16-bit floating point loading, type casting, and inference stability.
stories15M.Q8_0.gguf Q8_0 ~16 MB Quantization Level 1. Validates the simplest linear quantization logic (block-based uniform scaling with 32 elements).
stories15M.Q4_0.gguf
stories15M.Q4_1.gguf
Q4_0
Q4_1
~10 MB Quantization Level 2. Validates classic 4-bit linear quantization and bit-unpacking logic.
stories15M.Q2_K γ€œ Q6_K.gguf K-Quants 9~15 MB Standard Quants. Validates modern super-block structural parsing with mixed precision.
stories15M.IQ3_XXS γ€œ IQ4_XS.gguf I-Quants 8~12 MB Advanced Quants. Non-linear quantization targeting lookup table (codebook) decoding logic.
stories15M.TQ1_0.gguf
stories15M.TQ2_0.gguf
Ternary 7~9 MB Experimental. Ternary (-1, 0, 1) state quantization for cutting-edge engine testing.
stories260K.F32.gguf
stories260K.F16.gguf
F32
F16
~1 MB Ultra-Mini Check. Extreme low-resource baseline utilizing a tiny 512-token vocabulary.

2. Hugging Face / Transformers Formats (For PyTorch Validation)

Standard Safetensors weights accompanied by standard config.json files for out-of-the-box usage with the Hugging Face transformers library. Ideal for calculating mathematical baseline answers or testing upstream conversion scripts (like convert_hf_to_gguf.py).

  • hf_stories15M/: The 15M parameter model mapped to standard Hugging Face Llama architecture. Includes pre-bundled Llama-2 compatible tokenizer configurations.
  • hf_stories260K/: The ultra-mini 260K parameter model with its custom architecture parameters intact.

πŸš€ Quick Start & Usage Examples

A. Running GGUF via llama.cpp

To verify your local setup or compare tokens using the official native utilities:

./llama-cli -m stories15M.Q4_K_M.gguf -p "One day, Timmy went to" -n 30 --temp 0.0

B. Loading Hugging Face Formats via Python

You can import the Hugging Face variants directly into Python via the transformers library using the subfolder argument.

Example for hf_stories15M

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "shibatch/stories-converted"

# Load directly from the subfolder in this repository
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="hf_stories15M")
model = AutoModelForCausalLM.from_pretrained(repo_id, subfolder="hf_stories15M")

prompt = "One day, Timmy went to"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=30, 
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“ Model Specifications

  • Architecture: Llama 2 (scaled down variants)
  • Dataset: TinyStories (focused on simple vocabulary suited for 3 to 4-year-olds)
  • Vocabulary Size: 32,000 for 15M models, 512 for 260K models.

πŸ“œ Acknowledgments & License

  • Original Weights: Trained by Andrej Karpathy (karpathy/tinyllamas).
  • License: MIT License (inherited from the original llama2.c repository). You are free to use, modify, and distribute these assets for any purpose.
Downloads last month
250
GGUF
Model size
24.4M params
Architecture
llama
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shibatch/stories-converted

Quantized
(1)
this model