GPT-2 Medium Instruct

A 355M parameter GPT-2 Medium model fine-tuned from scratch on the `yahma/alpaca-cleaned` instruction dataset, with a full custom training pipeline in PyTorch Lightning.

Model Details

Property	Value
Base model	`openai-community/gpt2-medium`
Parameters	~355M
Architecture	GPT-2 (decoder-only transformer)
Fine-tuning dataset	`yahma/alpaca-cleaned` (10,000 training samples)
Context length	1,024 tokens
Vocabulary size	50,257 tokens
Embedding dim	1,024
Transformer layers	24
Attention heads	16
Tokenizer	GPT-2 BPE (via `tiktoken` / HF `GPT2Tokenizer`)

Training Details

Dataset

The model was fine-tuned on the yahma/alpaca-cleaned dataset — a cleaned version of Stanford Alpaca's 52K instruction-following data generated from text-davinci-003.

Split	Samples
Train	10,000
Validation	1,000
Test	1,000

Prompt Format

The model uses the standard Alpaca prompt template:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}   ← omitted if empty

### Response:
{output}

During training, the instruction + input portion is masked with -100 in the targets so the loss is only computed on the response tokens. This is the standard technique to make the model learn how to respond rather than memorize the prompt structure.

Optimizer

Hyperparameter	Value
Optimizer	AdamW
Learning rate	`3e-5`
Weight decay	`0.1`
Beta1 / Beta2	`0.9` / `0.95`
Gradient clip	`1.0`

Training Config

Setting	Value
Framework	PyTorch Lightning
Epochs	2 (+ 1 continuation epoch)
Batch size (per device)	2
Gradient accumulation steps	4
Effective batch size	8
Precision	`16-mixed` (FP16 + FP32)
Hardware	Single GPU (Colab)
Early stopping patience	3 validation checks
Checkpoint metric	`val_loss_eval` (minimize)

Usage

Basic Inference

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model_id = "snehangshu511/gpt2-medium-instruct"

tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model     = GPT2LMHeadModel.from_pretrained(model_id)
model.eval()

def build_prompt(instruction, input_text=""):
    base = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
    )
    if input_text.strip():
        return f"{base}### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    return f"{base}### Instruction:\n{instruction}\n\n### Response:\n"

prompt = build_prompt("Explain what machine learning is in simple terms.")
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        top_k=50,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode only the newly generated tokens (strip the prompt)
input_len = inputs["input_ids"].shape[1]
response  = tokenizer.decode(output_ids[0][input_len:], skip_special_tokens=True)
print(response)

With Optional Input Context

prompt = build_prompt(
    instruction="Summarize the following text.",
    input_text="The Industrial Revolution began in Britain in the 18th century..."
)

Recommended Generation Settings

Setting	Recommended range	Effect
`temperature`	0.6 – 0.9	Higher = more creative, lower = more deterministic
`top_p`	0.85 – 0.95	Nucleus sampling — limits token pool to top P% probability mass
`top_k`	40 – 60	Hard limits candidate tokens to top K at each step
`repetition_penalty`	1.1 – 1.3	Higher = less repetition in output
`max_new_tokens`	100 – 300	Keep under 800 to stay within the 1024 context window

Architecture Notes

This model was built from scratch using a custom GPTModel class (no AutoModel during training). The weights were converted from the custom format to HF-compatible GPT2LMHeadModel format for this Hub upload.

Key architectural decisions:

Weight tying disabled (tie_word_embeddings=False): In standard GPT-2, the output head shares weights with the embedding layer. During conversion, lm_head.weight was explicitly cloned to avoid shared-memory issues with safetensors. The config reflects this.
QKV separation: The custom training model stores Q, K, V as separate linear layers. During HF conversion, they are re-fused into the standard c_attn format that GPT2LMHeadModel expects.
Drop rate = 0.0: Dropout is disabled during fine-tuning, which is standard practice when working with pretrained models on relatively small datasets.

Files in This Repository

File	Description
`model.safetensors`	Model weights in safetensors format (recommended)
`pytorch_model.bin`	Model weights in legacy `.bin` format
`config.json`	GPT2Config — model architecture definition
`generation_config.json`	Default generation settings
`tokenizer.json`	Fast tokenizer file
`tokenizer_config.json`	Tokenizer configuration
`checkpoints/model.ckpt`	Original PyTorch Lightning training checkpoint

Limitations

Small training subset: Only 10,000 of the available ~52,000 Alpaca samples were used. A full dataset run would likely yield noticeably better results.
GPT-2 base: GPT-2 Medium, while a solid model, is much smaller than modern instruction-tuned LLMs. Responses can be inconsistent or drift from the prompt on complex tasks.
No RLHF: The model is instruction-tuned via supervised fine-tuning only — no reinforcement learning from human feedback. It may produce responses that are grammatically correct but factually wrong.
Context length: Hard-limited to 1,024 tokens. Long prompts can get truncated.
No safety alignment: There is no safety filtering or RLHF alignment. Do not deploy in production without additional safety measures.

Training Pipeline Summary

yahma/alpaca-cleaned (52K rows)
        ↓ load 10K rows
Alpaca prompt formatting
        ↓
tiktoken BPE tokenization
        ↓ -100 masking on prompt tokens
Custom PyTorch Dataset + DataLoader (dynamic padding)
        ↓
GPT-2 Medium pretrained weights loaded from openai-community/gpt2-medium
        ↓
PyTorch Lightning fine-tuning
  - AdamW, lr=3e-5, 2 epochs
  - FP16 mixed precision
  - Gradient accumulation (eff. batch = 8)
  - Checkpoint on best val_loss
        ↓
Lightning prefix stripped → raw GPTModel state dict
        ↓
Custom → HF format conversion (QKV fusing, key renaming)
        ↓
Saved as model.safetensors + pytorch_model.bin
        ↓
Pushed to snehangshu511/gpt2-medium-instruct

Citation

If you use this model, please also cite the resources it was built from:

@book{raschka2024llms,
  title     = {Build a Large Language Model (From Scratch)},
  author    = {Sebastian Raschka},
  year      = {2024},
  publisher = {Manning Publications}
}

@misc{alpaca,
  title  = {Stanford Alpaca: An Instruction-following LLaMA model},
  author = {Taori et al.},
  year   = {2023},
  url    = {https://github.com/tatsu-lab/stanford_alpaca}
}

Author

Snehangshu Bhuin — Data Scientist
GitHub: snehangshu2002
Built as part of ongoing LLM learning and portfolio development.

Downloads last month: 2,145

Model tree for snehangshu511/gpt2-medium-instruct

Base model

openai-community/gpt2-medium

Finetuned

(190)

this model

snehangshu511
/

gpt2-medium-instruct

GPT-2 Medium Instruct

A 355M parameter GPT-2 Medium model fine-tuned from scratch on the `yahma/alpaca-cleaned` instruction dataset, with a full custom training pipeline in PyTorch Lightning.

Model Details

Training Details

Dataset

Prompt Format

Optimizer

Training Config

Usage

Basic Inference

With Optional Input Context

Recommended Generation Settings

Architecture Notes

Files in This Repository

Limitations

Training Pipeline Summary

Citation

Author

Model tree for snehangshu511/gpt2-medium-instruct

Dataset used to train snehangshu511/gpt2-medium-instruct

Space using snehangshu511/gpt2-medium-instruct 1

GPT-2 Medium Instruct

A 355M parameter GPT-2 Medium model fine-tuned from scratch on the yahma/alpaca-cleaned instruction dataset, with a full custom training pipeline in PyTorch Lightning.

Model Details

Training Details

Dataset

Prompt Format

Optimizer

Training Config

Usage

Basic Inference

With Optional Input Context

Recommended Generation Settings

Architecture Notes

Files in This Repository

Limitations

Training Pipeline Summary

Citation

Author

Model tree for snehangshu511/gpt2-medium-instruct

Dataset used to train snehangshu511/gpt2-medium-instruct

Space using snehangshu511/gpt2-medium-instruct 1

A 355M parameter GPT-2 Medium model fine-tuned from scratch on the `yahma/alpaca-cleaned` instruction dataset, with a full custom training pipeline in PyTorch Lightning.