GPT-2 Medium Instruct

A 355M parameter GPT-2 Medium model fine-tuned from scratch on the yahma/alpaca-cleaned instruction dataset, with a full custom training pipeline in PyTorch Lightning.

Model Details

Property Value
Base model openai-community/gpt2-medium
Parameters ~355M
Architecture GPT-2 (decoder-only transformer)
Fine-tuning dataset yahma/alpaca-cleaned (10,000 training samples)
Context length 1,024 tokens
Vocabulary size 50,257 tokens
Embedding dim 1,024
Transformer layers 24
Attention heads 16
Tokenizer GPT-2 BPE (via tiktoken / HF GPT2Tokenizer)

Training Details

Dataset

The model was fine-tuned on the yahma/alpaca-cleaned dataset β€” a cleaned version of Stanford Alpaca's 52K instruction-following data generated from text-davinci-003.

Split Samples
Train 10,000
Validation 1,000
Test 1,000

Prompt Format

The model uses the standard Alpaca prompt template:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}   ← omitted if empty

### Response:
{output}

During training, the instruction + input portion is masked with -100 in the targets so the loss is only computed on the response tokens. This is the standard technique to make the model learn how to respond rather than memorize the prompt structure.

Optimizer

Hyperparameter Value
Optimizer AdamW
Learning rate 3e-5
Weight decay 0.1
Beta1 / Beta2 0.9 / 0.95
Gradient clip 1.0

Training Config

Setting Value
Framework PyTorch Lightning
Epochs 2 (+ 1 continuation epoch)
Batch size (per device) 2
Gradient accumulation steps 4
Effective batch size 8
Precision 16-mixed (FP16 + FP32)
Hardware Single GPU (Colab)
Early stopping patience 3 validation checks
Checkpoint metric val_loss_eval (minimize)

Usage

Basic Inference

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model_id = "snehangshu511/gpt2-medium-instruct"

tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model     = GPT2LMHeadModel.from_pretrained(model_id)
model.eval()

def build_prompt(instruction, input_text=""):
    base = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
    )
    if input_text.strip():
        return f"{base}### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    return f"{base}### Instruction:\n{instruction}\n\n### Response:\n"

prompt = build_prompt("Explain what machine learning is in simple terms.")
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        top_k=50,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode only the newly generated tokens (strip the prompt)
input_len = inputs["input_ids"].shape[1]
response  = tokenizer.decode(output_ids[0][input_len:], skip_special_tokens=True)
print(response)

With Optional Input Context

prompt = build_prompt(
    instruction="Summarize the following text.",
    input_text="The Industrial Revolution began in Britain in the 18th century..."
)

Recommended Generation Settings

Setting Recommended range Effect
temperature 0.6 – 0.9 Higher = more creative, lower = more deterministic
top_p 0.85 – 0.95 Nucleus sampling β€” limits token pool to top P% probability mass
top_k 40 – 60 Hard limits candidate tokens to top K at each step
repetition_penalty 1.1 – 1.3 Higher = less repetition in output
max_new_tokens 100 – 300 Keep under 800 to stay within the 1024 context window

Architecture Notes

This model was built from scratch using a custom GPTModel class (no AutoModel during training). The weights were converted from the custom format to HF-compatible GPT2LMHeadModel format for this Hub upload.

Key architectural decisions:

  • Weight tying disabled (tie_word_embeddings=False): In standard GPT-2, the output head shares weights with the embedding layer. During conversion, lm_head.weight was explicitly cloned to avoid shared-memory issues with safetensors. The config reflects this.

  • QKV separation: The custom training model stores Q, K, V as separate linear layers. During HF conversion, they are re-fused into the standard c_attn format that GPT2LMHeadModel expects.

  • Drop rate = 0.0: Dropout is disabled during fine-tuning, which is standard practice when working with pretrained models on relatively small datasets.


Files in This Repository

File Description
model.safetensors Model weights in safetensors format (recommended)
pytorch_model.bin Model weights in legacy .bin format
config.json GPT2Config β€” model architecture definition
generation_config.json Default generation settings
tokenizer.json Fast tokenizer file
tokenizer_config.json Tokenizer configuration
checkpoints/model.ckpt Original PyTorch Lightning training checkpoint

Limitations

  • Small training subset: Only 10,000 of the available ~52,000 Alpaca samples were used. A full dataset run would likely yield noticeably better results.
  • GPT-2 base: GPT-2 Medium, while a solid model, is much smaller than modern instruction-tuned LLMs. Responses can be inconsistent or drift from the prompt on complex tasks.
  • No RLHF: The model is instruction-tuned via supervised fine-tuning only β€” no reinforcement learning from human feedback. It may produce responses that are grammatically correct but factually wrong.
  • Context length: Hard-limited to 1,024 tokens. Long prompts can get truncated.
  • No safety alignment: There is no safety filtering or RLHF alignment. Do not deploy in production without additional safety measures.

Training Pipeline Summary

yahma/alpaca-cleaned (52K rows)
        ↓ load 10K rows
Alpaca prompt formatting
        ↓
tiktoken BPE tokenization
        ↓ -100 masking on prompt tokens
Custom PyTorch Dataset + DataLoader (dynamic padding)
        ↓
GPT-2 Medium pretrained weights loaded from openai-community/gpt2-medium
        ↓
PyTorch Lightning fine-tuning
  - AdamW, lr=3e-5, 2 epochs
  - FP16 mixed precision
  - Gradient accumulation (eff. batch = 8)
  - Checkpoint on best val_loss
        ↓
Lightning prefix stripped β†’ raw GPTModel state dict
        ↓
Custom β†’ HF format conversion (QKV fusing, key renaming)
        ↓
Saved as model.safetensors + pytorch_model.bin
        ↓
Pushed to snehangshu511/gpt2-medium-instruct

Citation

If you use this model, please also cite the resources it was built from:

@book{raschka2024llms,
  title     = {Build a Large Language Model (From Scratch)},
  author    = {Sebastian Raschka},
  year      = {2024},
  publisher = {Manning Publications}
}

@misc{alpaca,
  title  = {Stanford Alpaca: An Instruction-following LLaMA model},
  author = {Taori et al.},
  year   = {2023},
  url    = {https://github.com/tatsu-lab/stanford_alpaca}
}

Author

Snehangshu Bhuin β€” Data Scientist
GitHub: snehangshu2002
Built as part of ongoing LLM learning and portfolio development.

Downloads last month
2,145
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for snehangshu511/gpt2-medium-instruct

Finetuned
(190)
this model

Dataset used to train snehangshu511/gpt2-medium-instruct

Space using snehangshu511/gpt2-medium-instruct 1