- GPT-2 Medium Instruct
- A 355M parameter GPT-2 Medium model fine-tuned from scratch on the
yahma/alpaca-cleanedinstruction dataset, with a full custom training pipeline in PyTorch Lightning. - Model Details
- Training Details
- Usage
- Architecture Notes
- Files in This Repository
- Limitations
- Training Pipeline Summary
- Citation
- Author
- A 355M parameter GPT-2 Medium model fine-tuned from scratch on the
GPT-2 Medium Instruct
A 355M parameter GPT-2 Medium model fine-tuned from scratch on the yahma/alpaca-cleaned instruction dataset, with a full custom training pipeline in PyTorch Lightning.
Model Details
| Property | Value |
|---|---|
| Base model | openai-community/gpt2-medium |
| Parameters | ~355M |
| Architecture | GPT-2 (decoder-only transformer) |
| Fine-tuning dataset | yahma/alpaca-cleaned (10,000 training samples) |
| Context length | 1,024 tokens |
| Vocabulary size | 50,257 tokens |
| Embedding dim | 1,024 |
| Transformer layers | 24 |
| Attention heads | 16 |
| Tokenizer | GPT-2 BPE (via tiktoken / HF GPT2Tokenizer) |
Training Details
Dataset
The model was fine-tuned on the yahma/alpaca-cleaned dataset β a cleaned version of Stanford Alpaca's 52K instruction-following data generated from text-davinci-003.
| Split | Samples |
|---|---|
| Train | 10,000 |
| Validation | 1,000 |
| Test | 1,000 |
Prompt Format
The model uses the standard Alpaca prompt template:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input} β omitted if empty
### Response:
{output}
During training, the instruction + input portion is masked with -100 in the targets so the loss is only computed on the response tokens. This is the standard technique to make the model learn how to respond rather than memorize the prompt structure.
Optimizer
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 3e-5 |
| Weight decay | 0.1 |
| Beta1 / Beta2 | 0.9 / 0.95 |
| Gradient clip | 1.0 |
Training Config
| Setting | Value |
|---|---|
| Framework | PyTorch Lightning |
| Epochs | 2 (+ 1 continuation epoch) |
| Batch size (per device) | 2 |
| Gradient accumulation steps | 4 |
| Effective batch size | 8 |
| Precision | 16-mixed (FP16 + FP32) |
| Hardware | Single GPU (Colab) |
| Early stopping patience | 3 validation checks |
| Checkpoint metric | val_loss_eval (minimize) |
Usage
Basic Inference
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
model_id = "snehangshu511/gpt2-medium-instruct"
tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)
model.eval()
def build_prompt(instruction, input_text=""):
base = (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
)
if input_text.strip():
return f"{base}### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
return f"{base}### Instruction:\n{instruction}\n\n### Response:\n"
prompt = build_prompt("Explain what machine learning is in simple terms.")
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.2,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# Decode only the newly generated tokens (strip the prompt)
input_len = inputs["input_ids"].shape[1]
response = tokenizer.decode(output_ids[0][input_len:], skip_special_tokens=True)
print(response)
With Optional Input Context
prompt = build_prompt(
instruction="Summarize the following text.",
input_text="The Industrial Revolution began in Britain in the 18th century..."
)
Recommended Generation Settings
| Setting | Recommended range | Effect |
|---|---|---|
temperature |
0.6 β 0.9 | Higher = more creative, lower = more deterministic |
top_p |
0.85 β 0.95 | Nucleus sampling β limits token pool to top P% probability mass |
top_k |
40 β 60 | Hard limits candidate tokens to top K at each step |
repetition_penalty |
1.1 β 1.3 | Higher = less repetition in output |
max_new_tokens |
100 β 300 | Keep under 800 to stay within the 1024 context window |
Architecture Notes
This model was built from scratch using a custom GPTModel class (no AutoModel during training). The weights were converted from the custom format to HF-compatible GPT2LMHeadModel format for this Hub upload.
Key architectural decisions:
Weight tying disabled (
tie_word_embeddings=False): In standard GPT-2, the output head shares weights with the embedding layer. During conversion,lm_head.weightwas explicitly cloned to avoid shared-memory issues withsafetensors. The config reflects this.QKV separation: The custom training model stores Q, K, V as separate linear layers. During HF conversion, they are re-fused into the standard
c_attnformat thatGPT2LMHeadModelexpects.Drop rate = 0.0: Dropout is disabled during fine-tuning, which is standard practice when working with pretrained models on relatively small datasets.
Files in This Repository
| File | Description |
|---|---|
model.safetensors |
Model weights in safetensors format (recommended) |
pytorch_model.bin |
Model weights in legacy .bin format |
config.json |
GPT2Config β model architecture definition |
generation_config.json |
Default generation settings |
tokenizer.json |
Fast tokenizer file |
tokenizer_config.json |
Tokenizer configuration |
checkpoints/model.ckpt |
Original PyTorch Lightning training checkpoint |
Limitations
- Small training subset: Only 10,000 of the available ~52,000 Alpaca samples were used. A full dataset run would likely yield noticeably better results.
- GPT-2 base: GPT-2 Medium, while a solid model, is much smaller than modern instruction-tuned LLMs. Responses can be inconsistent or drift from the prompt on complex tasks.
- No RLHF: The model is instruction-tuned via supervised fine-tuning only β no reinforcement learning from human feedback. It may produce responses that are grammatically correct but factually wrong.
- Context length: Hard-limited to 1,024 tokens. Long prompts can get truncated.
- No safety alignment: There is no safety filtering or RLHF alignment. Do not deploy in production without additional safety measures.
Training Pipeline Summary
yahma/alpaca-cleaned (52K rows)
β load 10K rows
Alpaca prompt formatting
β
tiktoken BPE tokenization
β -100 masking on prompt tokens
Custom PyTorch Dataset + DataLoader (dynamic padding)
β
GPT-2 Medium pretrained weights loaded from openai-community/gpt2-medium
β
PyTorch Lightning fine-tuning
- AdamW, lr=3e-5, 2 epochs
- FP16 mixed precision
- Gradient accumulation (eff. batch = 8)
- Checkpoint on best val_loss
β
Lightning prefix stripped β raw GPTModel state dict
β
Custom β HF format conversion (QKV fusing, key renaming)
β
Saved as model.safetensors + pytorch_model.bin
β
Pushed to snehangshu511/gpt2-medium-instruct
Citation
If you use this model, please also cite the resources it was built from:
@book{raschka2024llms,
title = {Build a Large Language Model (From Scratch)},
author = {Sebastian Raschka},
year = {2024},
publisher = {Manning Publications}
}
@misc{alpaca,
title = {Stanford Alpaca: An Instruction-following LLaMA model},
author = {Taori et al.},
year = {2023},
url = {https://github.com/tatsu-lab/stanford_alpaca}
}
Author
Snehangshu Bhuin β Data Scientist
GitHub: snehangshu2002
Built as part of ongoing LLM learning and portfolio development.
- Downloads last month
- 2,145
Model tree for snehangshu511/gpt2-medium-instruct
Base model
openai-community/gpt2-medium