TinyLlama-1.1B-Chat — ONNX (FP16)

ONNX export of TinyLlama-1.1B-Chat-v1.0 (1.1B parameters, FP16 weights) with KV cache support for efficient autoregressive generation.

Converted for use with inference4j, an inference-only AI library for Java.

Original Source

Repository: TinyLlama/TinyLlama-1.1B-Chat-v1.0
License: Apache 2.0

Usage with inference4j

try (var gen = OnnxTextGenerator.tinyLlama().build()) {
    GenerationResult result = gen.generate("What is Java?");
    System.out.println(result.text());
}

Model Details

Property	Value
Architecture	LlamaForCausalLM (1.1B parameters, 22 layers, 2048 hidden, 32 heads, 4 KV heads)
Task	Text generation (instruction-tuned, Zephyr chat template)
Precision	FP16
Context length	2048 tokens
Vocabulary	32,000 tokens (SentencePiece BPE)
Chat template	Zephyr (`<
Original framework	PyTorch (transformers)
Export method	Hugging Face Optimum (with KV cache, FP16)

License

This model is licensed under the Apache License 2.0. Original model by TinyLlama.

Downloads last month: 6