Misha0706/llm-alignment-ppo
This repository contains a PPO-aligned version of HuggingFaceTB/SmolLM-135M-Instruct trained as part of a coursework project on language model alignment. The goal was to compare PPO-based alignment against the original base model and a DPO-tuned variant on the same family of prompts.
Model Details
Model Description
This model is a causal language model initialized from HuggingFaceTB/SmolLM-135M-Instruct and further optimized with PPO using TRL. The PPO setup used a separately trained reward model based on the same preference dataset.
In the project evaluation, PPO produced relatively small behavioral changes compared to the base model. The model often remained close to the original checkpoint, with only minor wording differences on many prompts.
- Developed by: Mikhail Kalinkin
- Model type: Causal language model
- Language(s): English
- Finetuned from model:
HuggingFaceTB/SmolLM-135M-Instruct - Training method: PPO with TRL
- Reward model used during training:
Misha0706/llm-alignment-reward-model
Model Sources
- Base model:
HuggingFaceTB/SmolLM-135M-Instruct - Training dataset:
HumanLLMs/Human-Like-DPO-Dataset - Reward model:
Misha0706/llm-alignment-reward-model
Intended Use
Direct Use
This model is intended for:
- educational experiments with PPO-based alignment;
- comparison against the base model and the DPO checkpoint from the same project;
- studying RLHF-style fine-tuning pipelines on small language models;
- coursework demos and lightweight research experiments.
Out-of-Scope Use
This model is not intended for:
- production use;
- factual or safety-critical deployments;
- medical, legal, or financial advice;
- applications requiring robust alignment guarantees.
Bias, Risks, and Limitations
This model inherits limitations from the base model and from the small-scale PPO training setup used in the project.
Important limitations:
- small parameter count;
- limited training budget;
- no large-scale benchmark evaluation;
- alignment improvements are modest in this setup;
- responses may still be generic, repetitive, or inconsistent.
How to Get Started
Use the base tokenizer together with this model checkpoint:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Misha0706/llm-alignment-ppo"
tokenizer_id = "HuggingFaceTB/SmolLM-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()
messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=128,
do_sample=False,
)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Training Details
Training Data
The PPO pipeline used:
HumanLLMs/Human-Like-DPO-Datasetfor reward model training;- a prompt-only version of the same dataset for PPO policy optimization.
The original dataset contains triples of:
- prompt,
- chosen response,
- rejected response.
For PPO training, prompts were used to generate responses, and the reward was provided by a separately trained reward model.
Preprocessing
For reward model training:
- the dataset was converted into implicit preference format;
- each example was represented as a short user-assistant conversation;
- the tokenizer chat template was applied before training.
For PPO policy training:
- only prompts were kept;
- prompts were converted to chat format with
add_generation_prompt=True; - tokenized prompts were passed to the PPO trainer.
Training Procedure
The training pipeline consisted of:
- training a reward model with
RewardTrainer; - loading the reward model as both reward model and value model in the PPO pipeline;
- optimizing the policy with
PPOTraineragainst a frozen reference policy.
Training Hyperparameters
Reward Model
- Base model:
HuggingFaceTB/SmolLM-135M-Instruct - Objective: sequence classification with one scalar reward
- Epochs: 1
- Batch size: 8
- Max length: 512
- Learning rate:
3e-4
PPO Policy
- Base policy:
HuggingFaceTB/SmolLM-135M-Instruct - Reference policy: frozen copy of the base model
- Reward model:
Misha0706/llm-alignment-reward-model - Epochs: 1
- Per-device batch size: 4
- Gradient accumulation steps: 4
- Learning rate:
1e-5 - Response length: 128
Evaluation
Evaluation Setup
The model was evaluated in three ways:
- qualitative comparison of generations from the base model and PPO model;
- shared comparison of base, DPO, and PPO on the same prompts;
- log-probability analysis on train vs unseen data.
Metrics
The main quantitative metric reported in the project was average answer log-probability on:
- the training preference dataset;
- an unseen dataset:
databricks/databricks-dolly-15k.
Results
Log-probability analysis
BASE train mean:
-1.8917BASE unseen mean:
-2.6316PPO train mean:
-1.8918PPO unseen mean:
-2.6309
These results indicate that the PPO checkpoint remained very close to the base model in this evaluation setup.
Reward Model Check
The reward model used in the PPO pipeline preferred the chosen response over the rejected response in 5/5 inspected examples.
Qualitative Summary
Observed behavior in the project:
- PPO usually stayed close to the base model;
- some prompts showed mild wording changes;
- compared to DPO, PPO produced less noticeable stylistic shift in this setup.
Example Generations
Example 1
Prompt: What's your morning routine like?
Base model:
I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like...
PPO model:
I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like...
Example 2
Prompt: Can you tell me about a memorable birthday or celebration you've had?
Base model:
What a great question! As a digital AI assistant, I don't have personal experiences, but I can provide you with some insights and anecdotes about memorable birthday or celebration experiences...
PPO model:
What a great question! As a digital AI assistant, I don't have personal experiences, but I can provide you with some insights and anecdotes about memorable birthday or celebration experiences...
Example 3
Prompt: Do you have a favorite type of vacation or getaway?
Base model:
What a great question! As a digital AI assistant, I don't have personal preferences, but I can provide you with some popular types of vacations and getaways that are often enjoyed by many people...
PPO model:
What a great question! As a digital AI assistant, I don't have personal preferences, but I can provide you with some popular types of vacations and getaways that are often enjoyed by many people...
Technical Notes
Architecture
- Transformer-based causal language model
- Initialized from
HuggingFaceTB/SmolLM-135M-Instruct
Objective
The model was optimized with PPO using:
- a policy model,
- a frozen reference policy,
- a reward model,
- a value model,
through the TRL training stack.
Limitations
This is a coursework model and should be treated as an experimental artifact.
Main limitations:
- very small model size;
- modest PPO effect in this setup;
- limited evaluation scope;
- no safety guarantees;
- not intended as a polished assistant model.
Citation
If you use this repository, please cite the original model and dataset:
HuggingFaceTB/SmolLM-135M-InstructHumanLLMs/Human-Like-DPO-Dataset
You may also reference this repository as a coursework PPO alignment checkpoint trained with TRL.
- Downloads last month
- 643
Model tree for Misha0706/llm-alignment-ppo
Base model
HuggingFaceTB/SmolLM-135M