Instructions to use HenryHHHH/DistilLlamaV1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HenryHHHH/DistilLlamaV1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HenryHHHH/DistilLlamaV1")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HenryHHHH/DistilLlamaV1") model = AutoModelForCausalLM.from_pretrained("HenryHHHH/DistilLlamaV1") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HenryHHHH/DistilLlamaV1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HenryHHHH/DistilLlamaV1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HenryHHHH/DistilLlamaV1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/HenryHHHH/DistilLlamaV1
- SGLang
How to use HenryHHHH/DistilLlamaV1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HenryHHHH/DistilLlamaV1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HenryHHHH/DistilLlamaV1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HenryHHHH/DistilLlamaV1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HenryHHHH/DistilLlamaV1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use HenryHHHH/DistilLlamaV1 with Docker Model Runner:
docker model run hf.co/HenryHHHH/DistilLlamaV1
Overview
This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher. This version is the latest version of DistilLlama, which has gone through 5 days of training using two Nvidia A100 80G GPU.
Update
30 out of 300 checkpoints were examined, and the one with the best performance in semantic and factual accuracy has now been updated in this repository.
Model Architecture
The architecture is based on LLaMA 2, with the following parameters:
| Parameter | Value |
|---|---|
| Hidden Dimension | 512 |
| Intermediate Dimension | 1536 |
| Max Positional Embeddings | 128 |
| Attention Heads | 8 |
| Transformer Layers | 16 |
Evaluation Metrics
Cosine Similarity using Word Embeddings
- Description: Measures semantic similarity by mapping words/phrases to vectors.
- Equation: Cosine Similarity = ( A • B ) / ( ||A|| ||B|| )
- Example: "The dog chased the cat." vs. "A canine pursued a feline." (High similarity)
Exact Match (EM)
- Description: Checks if critical keywords are present.
- Example:
- Expected: "Paris"
- Response: "The capital of France is Paris." (EM = 1)
ROUGE Score
- Description: Measures the overlap of the longest common subsequences between reference and response texts.
- Equation:
- Precision = Precision = LCS(R, C) / Length of C
- Recall = Recall = LCS(R, C) / Length of R
Model Evaluation Summary
| Model Name | Duration (s) | Emissions (kgCO₂e) | Avg. EM | Avg. Cosine Similarity | Avg. ROUGE Score |
|---|---|---|---|---|---|
| LLaMA-2-7B-HF | 18215.61 | 1.84e-01 | 0.715 | 0.7257 | 0.0821 |
| baby-llama-58m | 57.20 | 2.73e-06 | 0.025 | 0.6556 | 0.0097 |
| DistilLlama | 77.12 | 7.79e-04 | 0.02 | 0.6623 | 0.0115 |
| DistilLlamaV1 | 78.46 | 8.49e-04 | 0.065 | 0.6776 | 0.0135 |
Note: CodeCarbon was used to track carbon emission. Allocated 80GB memory, 32 cores, Intel(R) Xeon(R) Gold 6448H for the evaluation
GitHub Repositories
- Training Repo: DistilLlama Training Repository
- Evaluation Repo: Knowledge Distillation Evaluation Repository
Reference
@misc{timiryasov2023babyllamaknowledgedistillation, title={Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty}, author={Inar Timiryasov and Jean-Loup Tastet}, year={2023}, eprint={2308.02019}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2308.02019}, }
Note: The repository will be updated as training progresses. Last update 2024-11-06
- Downloads last month
- 6