Overview

This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher. This version is the latest version of DistilLlama, which has gone through 5 days of training using two Nvidia A100 80G GPU.

Update

30 out of 300 checkpoints were examined, and the one with the best performance in semantic and factual accuracy has now been updated in this repository.

Model Architecture

The architecture is based on LLaMA 2, with the following parameters:

Parameter	Value
Hidden Dimension	512
Intermediate Dimension	1536
Max Positional Embeddings	128
Attention Heads	8
Transformer Layers	16

Evaluation Metrics

Cosine Similarity using Word Embeddings
- Description: Measures semantic similarity by mapping words/phrases to vectors.
- Equation: Cosine Similarity = ( A • B ) / ( ||A|| ||B|| )
- Example: "The dog chased the cat." vs. "A canine pursued a feline." (High similarity)
Exact Match (EM)
- Description: Checks if critical keywords are present.
- Example:
  - Expected: "Paris"
  - Response: "The capital of France is Paris." (EM = 1)
ROUGE Score
- Description: Measures the overlap of the longest common subsequences between reference and response texts.
- Equation:
  - Precision = Precision = LCS(R, C) / Length of C
  - Recall = Recall = LCS(R, C) / Length of R

Model Evaluation Summary

Model Name	Duration (s)	Emissions (kgCO₂e)	Avg. EM	Avg. Cosine Similarity	Avg. ROUGE Score
LLaMA-2-7B-HF	18215.61	1.84e-01	0.715	0.7257	0.0821
baby-llama-58m	57.20	2.73e-06	0.025	0.6556	0.0097
DistilLlama	77.12	7.79e-04	0.02	0.6623	0.0115
DistilLlamaV1	78.46	8.49e-04	0.065	0.6776	0.0135

Note: CodeCarbon was used to track carbon emission. Allocated 80GB memory, 32 cores, Intel(R) Xeon(R) Gold 6448H for the evaluation

GitHub Repositories

Training Repo: DistilLlama Training Repository
Evaluation Repo: Knowledge Distillation Evaluation Repository

Reference

@misc{timiryasov2023babyllamaknowledgedistillation, title={Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty}, author={Inar Timiryasov and Jean-Loup Tastet}, year={2023}, eprint={2308.02019}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2308.02019}, }

Note: The repository will be updated as training progresses. Last update 2024-11-06

Downloads last month: 2

Safetensors

Model size

87.3M params

Tensor type

F32

Datasets used to train HenryHHHH/DistilLlamaV1

Paper for HenryHHHH/DistilLlamaV1

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

Paper • 2308.02019 • Published Aug 3, 2023