Misha0706/llm-alignment-ppo

This repository contains a PPO-aligned version of HuggingFaceTB/SmolLM-135M-Instruct trained as part of a coursework project on language model alignment. The goal was to compare PPO-based alignment against the original base model and a DPO-tuned variant on the same family of prompts.

Model Details

Model Description

This model is a causal language model initialized from HuggingFaceTB/SmolLM-135M-Instruct and further optimized with PPO using TRL. The PPO setup used a separately trained reward model based on the same preference dataset.

In the project evaluation, PPO produced relatively small behavioral changes compared to the base model. The model often remained close to the original checkpoint, with only minor wording differences on many prompts.

Developed by: Mikhail Kalinkin
Model type: Causal language model
Language(s): English
Finetuned from model: HuggingFaceTB/SmolLM-135M-Instruct
Training method: PPO with TRL
Reward model used during training: Misha0706/llm-alignment-reward-model

Model Sources

Base model: HuggingFaceTB/SmolLM-135M-Instruct
Training dataset: HumanLLMs/Human-Like-DPO-Dataset
Reward model: Misha0706/llm-alignment-reward-model

Intended Use

Direct Use

This model is intended for:

educational experiments with PPO-based alignment;
comparison against the base model and the DPO checkpoint from the same project;
studying RLHF-style fine-tuning pipelines on small language models;
coursework demos and lightweight research experiments.

Out-of-Scope Use

This model is not intended for:

production use;
factual or safety-critical deployments;
medical, legal, or financial advice;
applications requiring robust alignment guarantees.

Bias, Risks, and Limitations

This model inherits limitations from the base model and from the small-scale PPO training setup used in the project.

Important limitations:

small parameter count;
limited training budget;
no large-scale benchmark evaluation;
alignment improvements are modest in this setup;
responses may still be generic, repetitive, or inconsistent.

How to Get Started

Use the base tokenizer together with this model checkpoint:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Misha0706/llm-alignment-ppo"
tokenizer_id = "HuggingFaceTB/SmolLM-135M-Instruct"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()

messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=False,
    )

print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Training Details

Training Data

The PPO pipeline used:

HumanLLMs/Human-Like-DPO-Dataset for reward model training;
a prompt-only version of the same dataset for PPO policy optimization.

The original dataset contains triples of:

prompt,
chosen response,
rejected response.

For PPO training, prompts were used to generate responses, and the reward was provided by a separately trained reward model.

Preprocessing

For reward model training:

the dataset was converted into implicit preference format;
each example was represented as a short user-assistant conversation;
the tokenizer chat template was applied before training.

For PPO policy training:

only prompts were kept;
prompts were converted to chat format with add_generation_prompt=True;
tokenized prompts were passed to the PPO trainer.

Training Procedure

The training pipeline consisted of:

training a reward model with RewardTrainer;
loading the reward model as both reward model and value model in the PPO pipeline;
optimizing the policy with PPOTrainer against a frozen reference policy.

Training Hyperparameters

Reward Model

Base model: HuggingFaceTB/SmolLM-135M-Instruct
Objective: sequence classification with one scalar reward
Epochs: 1
Batch size: 8
Max length: 512
Learning rate: 3e-4

PPO Policy

Base policy: HuggingFaceTB/SmolLM-135M-Instruct
Reference policy: frozen copy of the base model
Reward model: Misha0706/llm-alignment-reward-model
Epochs: 1
Per-device batch size: 4
Gradient accumulation steps: 4
Learning rate: 1e-5
Response length: 128

Evaluation

Evaluation Setup

The model was evaluated in three ways:

qualitative comparison of generations from the base model and PPO model;
shared comparison of base, DPO, and PPO on the same prompts;
log-probability analysis on train vs unseen data.

Metrics

The main quantitative metric reported in the project was average answer log-probability on:

the training preference dataset;
an unseen dataset: databricks/databricks-dolly-15k.

Results

Log-probability analysis

BASE train mean: -1.8917
BASE unseen mean: -2.6316
PPO train mean: -1.8918
PPO unseen mean: -2.6309

These results indicate that the PPO checkpoint remained very close to the base model in this evaluation setup.

Reward Model Check

The reward model used in the PPO pipeline preferred the chosen response over the rejected response in 5/5 inspected examples.

Qualitative Summary

Observed behavior in the project:

PPO usually stayed close to the base model;
some prompts showed mild wording changes;
compared to DPO, PPO produced less noticeable stylistic shift in this setup.

Example Generations

Example 1

Prompt: What's your morning routine like?

Base model:

I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like...

PPO model:

I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like...

Example 2

Prompt: Can you tell me about a memorable birthday or celebration you've had?

Base model:

What a great question! As a digital AI assistant, I don't have personal experiences, but I can provide you with some insights and anecdotes about memorable birthday or celebration experiences...

PPO model:

What a great question! As a digital AI assistant, I don't have personal experiences, but I can provide you with some insights and anecdotes about memorable birthday or celebration experiences...

Example 3

Prompt: Do you have a favorite type of vacation or getaway?

Base model:

What a great question! As a digital AI assistant, I don't have personal preferences, but I can provide you with some popular types of vacations and getaways that are often enjoyed by many people...

PPO model:

What a great question! As a digital AI assistant, I don't have personal preferences, but I can provide you with some popular types of vacations and getaways that are often enjoyed by many people...

Technical Notes

Architecture

Transformer-based causal language model
Initialized from HuggingFaceTB/SmolLM-135M-Instruct

Objective

The model was optimized with PPO using:

a policy model,
a frozen reference policy,
a reward model,
a value model,

through the TRL training stack.

Limitations

This is a coursework model and should be treated as an experimental artifact.

Main limitations:

very small model size;
modest PPO effect in this setup;
limited evaluation scope;
no safety guarantees;
not intended as a polished assistant model.

Citation

If you use this repository, please cite the original model and dataset:

HuggingFaceTB/SmolLM-135M-Instruct
HumanLLMs/Human-Like-DPO-Dataset

You may also reference this repository as a coursework PPO alignment checkpoint trained with TRL.

Downloads last month: 643

Safetensors

Model size

0.1B params

Tensor type

BF16

Model tree for Misha0706/llm-alignment-ppo

Base model

HuggingFaceTB/SmolLM-135M

Quantized

HuggingFaceTB/SmolLM-135M-Instruct

Finetuned

(189)

this model

Misha0706
/

llm-alignment-ppo

Misha0706/llm-alignment-ppo

Model Details

Model Description

Model Sources

Intended Use

Direct Use

Out-of-Scope Use

Bias, Risks, and Limitations

How to Get Started

Training Details

Training Data

Preprocessing

Training Procedure

Training Hyperparameters

Reward Model

PPO Policy

Evaluation

Evaluation Setup

Metrics

Results

Log-probability analysis

Reward Model Check

Qualitative Summary

Example Generations

Example 1

Example 2

Example 3

Technical Notes

Architecture

Objective

Limitations

Citation

Model tree for Misha0706/llm-alignment-ppo

Dataset used to train Misha0706/llm-alignment-ppo