Misha0706/llm-alignment-ppo

This repository contains a PPO-aligned version of HuggingFaceTB/SmolLM-135M-Instruct trained as part of a coursework project on language model alignment. The goal was to compare PPO-based alignment against the original base model and a DPO-tuned variant on the same family of prompts.

Model Details

Model Description

This model is a causal language model initialized from HuggingFaceTB/SmolLM-135M-Instruct and further optimized with PPO using TRL. The PPO setup used a separately trained reward model based on the same preference dataset.

In the project evaluation, PPO produced relatively small behavioral changes compared to the base model. The model often remained close to the original checkpoint, with only minor wording differences on many prompts.

  • Developed by: Mikhail Kalinkin
  • Model type: Causal language model
  • Language(s): English
  • Finetuned from model: HuggingFaceTB/SmolLM-135M-Instruct
  • Training method: PPO with TRL
  • Reward model used during training: Misha0706/llm-alignment-reward-model

Model Sources

  • Base model: HuggingFaceTB/SmolLM-135M-Instruct
  • Training dataset: HumanLLMs/Human-Like-DPO-Dataset
  • Reward model: Misha0706/llm-alignment-reward-model

Intended Use

Direct Use

This model is intended for:

  • educational experiments with PPO-based alignment;
  • comparison against the base model and the DPO checkpoint from the same project;
  • studying RLHF-style fine-tuning pipelines on small language models;
  • coursework demos and lightweight research experiments.

Out-of-Scope Use

This model is not intended for:

  • production use;
  • factual or safety-critical deployments;
  • medical, legal, or financial advice;
  • applications requiring robust alignment guarantees.

Bias, Risks, and Limitations

This model inherits limitations from the base model and from the small-scale PPO training setup used in the project.

Important limitations:

  • small parameter count;
  • limited training budget;
  • no large-scale benchmark evaluation;
  • alignment improvements are modest in this setup;
  • responses may still be generic, repetitive, or inconsistent.

How to Get Started

Use the base tokenizer together with this model checkpoint:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Misha0706/llm-alignment-ppo"
tokenizer_id = "HuggingFaceTB/SmolLM-135M-Instruct"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()

messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=False,
    )

print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Training Details

Training Data

The PPO pipeline used:

  • HumanLLMs/Human-Like-DPO-Dataset for reward model training;
  • a prompt-only version of the same dataset for PPO policy optimization.

The original dataset contains triples of:

  • prompt,
  • chosen response,
  • rejected response.

For PPO training, prompts were used to generate responses, and the reward was provided by a separately trained reward model.

Preprocessing

For reward model training:

  • the dataset was converted into implicit preference format;
  • each example was represented as a short user-assistant conversation;
  • the tokenizer chat template was applied before training.

For PPO policy training:

  • only prompts were kept;
  • prompts were converted to chat format with add_generation_prompt=True;
  • tokenized prompts were passed to the PPO trainer.

Training Procedure

The training pipeline consisted of:

  1. training a reward model with RewardTrainer;
  2. loading the reward model as both reward model and value model in the PPO pipeline;
  3. optimizing the policy with PPOTrainer against a frozen reference policy.

Training Hyperparameters

Reward Model

  • Base model: HuggingFaceTB/SmolLM-135M-Instruct
  • Objective: sequence classification with one scalar reward
  • Epochs: 1
  • Batch size: 8
  • Max length: 512
  • Learning rate: 3e-4

PPO Policy

  • Base policy: HuggingFaceTB/SmolLM-135M-Instruct
  • Reference policy: frozen copy of the base model
  • Reward model: Misha0706/llm-alignment-reward-model
  • Epochs: 1
  • Per-device batch size: 4
  • Gradient accumulation steps: 4
  • Learning rate: 1e-5
  • Response length: 128

Evaluation

Evaluation Setup

The model was evaluated in three ways:

  1. qualitative comparison of generations from the base model and PPO model;
  2. shared comparison of base, DPO, and PPO on the same prompts;
  3. log-probability analysis on train vs unseen data.

Metrics

The main quantitative metric reported in the project was average answer log-probability on:

  • the training preference dataset;
  • an unseen dataset: databricks/databricks-dolly-15k.

Results

Log-probability analysis

  • BASE train mean: -1.8917

  • BASE unseen mean: -2.6316

  • PPO train mean: -1.8918

  • PPO unseen mean: -2.6309

These results indicate that the PPO checkpoint remained very close to the base model in this evaluation setup.

Reward Model Check

The reward model used in the PPO pipeline preferred the chosen response over the rejected response in 5/5 inspected examples.

Qualitative Summary

Observed behavior in the project:

  • PPO usually stayed close to the base model;
  • some prompts showed mild wording changes;
  • compared to DPO, PPO produced less noticeable stylistic shift in this setup.

Example Generations

Example 1

Prompt: What's your morning routine like?

Base model:

I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like...

PPO model:

I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits, but I can provide you with a general idea of what a morning routine might look like...

Example 2

Prompt: Can you tell me about a memorable birthday or celebration you've had?

Base model:

What a great question! As a digital AI assistant, I don't have personal experiences, but I can provide you with some insights and anecdotes about memorable birthday or celebration experiences...

PPO model:

What a great question! As a digital AI assistant, I don't have personal experiences, but I can provide you with some insights and anecdotes about memorable birthday or celebration experiences...

Example 3

Prompt: Do you have a favorite type of vacation or getaway?

Base model:

What a great question! As a digital AI assistant, I don't have personal preferences, but I can provide you with some popular types of vacations and getaways that are often enjoyed by many people...

PPO model:

What a great question! As a digital AI assistant, I don't have personal preferences, but I can provide you with some popular types of vacations and getaways that are often enjoyed by many people...

Technical Notes

Architecture

  • Transformer-based causal language model
  • Initialized from HuggingFaceTB/SmolLM-135M-Instruct

Objective

The model was optimized with PPO using:

  • a policy model,
  • a frozen reference policy,
  • a reward model,
  • a value model,

through the TRL training stack.

Limitations

This is a coursework model and should be treated as an experimental artifact.

Main limitations:

  • very small model size;
  • modest PPO effect in this setup;
  • limited evaluation scope;
  • no safety guarantees;
  • not intended as a polished assistant model.

Citation

If you use this repository, please cite the original model and dataset:

  • HuggingFaceTB/SmolLM-135M-Instruct
  • HumanLLMs/Human-Like-DPO-Dataset

You may also reference this repository as a coursework PPO alignment checkpoint trained with TRL.

Downloads last month
643
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Misha0706/llm-alignment-ppo

Finetuned
(189)
this model

Dataset used to train Misha0706/llm-alignment-ppo