YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

The Complete Guide to Post-Training of Large Language Models

From Pretraining to Alignment: Everything You Need to Know

Who is this for? You've learned how pretraining works — you understand GPT-2, transformer architectures, next-token prediction, and the cross-entropy loss. Now you want to understand what happens after pretraining: how raw language models become helpful assistants like ChatGPT, Claude, and Gemini. This guide takes you from zero knowledge of post-training to a deep understanding of every major method, with pointers to the key papers, tools, and code.

The Big Picture: Why Post-Training Exists
Supervised Fine-Tuning (SFT): Teaching Models to Follow Instructions
Reinforcement Learning from Human Feedback (RLHF): The Breakthrough
Direct Preference Optimization (DPO): RLHF Without RL
The Preference Optimization Zoo: KTO, ORPO, SimPO, CPO, and More
GRPO and the Reasoning Revolution: DeepSeek-R1 and Beyond
Parameter-Efficient Fine-Tuning: LoRA, QLoRA, and Adapters
The Toolbox: Libraries, Frameworks, and Infrastructure
Datasets: What to Train On
Evaluation: How to Know If It Worked
Putting It All Together: A Complete Post-Training Recipe
The Reading List: Papers Every Practitioner Should Read

Chapter 1: The Big Picture — Why Post-Training Exists

1.1 The Gap Between Pretraining and Usefulness

You've pretrained a language model. It can predict the next token with impressive accuracy. It has absorbed vast knowledge from the internet. But try asking it a question:

User: What is the capital of France?
Model: What is the capital of Germany? What is the capital of Italy? What is the...

The model doesn't answer — it continues. That's because the pretraining objective (P(next_token | context)) optimizes for predicting what comes next in web text, not for being helpful. Web documents contain questions followed by more questions, not questions followed by helpful answers.

This is the alignment problem in its simplest form. As the InstructGPT paper (Ouyang et al., 2022) put it:

"Large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users."

1.2 The Three Stages of Post-Training

Post-training is everything that happens after pretraining to make a model useful, safe, and aligned with human intent. The modern post-training pipeline, established by OpenAI's InstructGPT (2022), has three stages:

┌─────────────┐     ┌──────────────────┐     ┌─────────────────────────┐
│ STAGE 1: SFT │ ──> │ STAGE 2: Reward  │ ──> │ STAGE 3: RL             │
│              │     │ Model Training   │     │ (PPO / DPO / GRPO)      │
│ Teach format │     │ Learn preferences│     │ Optimize for preferences│
│ & behavior   │     │ from comparisons │     │ while staying close to  │
│              │     │                  │     │ the SFT model           │
└─────────────┘     └──────────────────┘     └─────────────────────────┘

Input: Pretrained LM                         Output: Aligned Assistant

Stage 1 — Supervised Fine-Tuning (SFT): Teach the model the format of helpful responses using human-written demonstrations. Input: instructions. Output: high-quality responses.

Stage 2 — Reward Modeling: Train a separate model to predict which of two responses a human would prefer. This "reward model" captures human preferences as a scalar score.

Stage 3 — Reinforcement Learning: Use the reward model to further improve the SFT model. The model generates responses, gets scored by the reward model, and updates its parameters to produce higher-scoring responses.

Key insight from LIMA (Zhou et al., 2023): "A model's knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users." This is called the Superficial Alignment Hypothesis — post-training doesn't teach new knowledge, it teaches the model to surface existing knowledge in the right way.

1.3 The Evolution: From RLHF to Modern Methods

The field has evolved rapidly:

Year	Method	Key Idea	Paper
2017	RLHF (original)	Use human preferences to train reward model, optimize with RL	Christiano et al.
2020	RLHF for LLMs	Apply RLHF to text summarization	Stiennon et al.
2022	InstructGPT	Full SFT → RM → PPO pipeline for general LLMs	Ouyang et al.
2022	Constitutional AI	Use AI feedback instead of human feedback (RLAIF)	Bai et al. (Anthropic)
2023	DPO	Eliminate reward model entirely — train directly on preferences	Rafailov et al.
2024	KTO	Train on binary feedback (good/bad) instead of pairwise	Ethayarajh et al.
2024	ORPO	Combine SFT and preference optimization in one step	Hong et al.
2024	GRPO	Group-based RL for mathematical reasoning (DeepSeek)	Shao et al.
2025	DeepSeek-R1	GRPO to teach models to "think" (chain-of-thought via RL)	DeepSeek-AI

Chapter 2: Supervised Fine-Tuning (SFT) — Teaching Models to Follow Instructions

2.1 What SFT Does

SFT is the bridge between a pretrained language model and a useful assistant. It takes a model that predicts web text and teaches it to respond helpfully to instructions.

Before SFT:

Input:  "Explain quantum computing in simple terms."
Output: "Explain quantum computing to a 5-year-old. Explain quantum computing..."

After SFT:

Input:  "Explain quantum computing in simple terms."
Output: "Quantum computing uses the principles of quantum mechanics to process
         information. Unlike classical computers that use bits (0 or 1),
         quantum computers use qubits that can be both 0 and 1 simultaneously..."

2.2 The SFT Loss Function

If you understand the pretraining loss, you already understand SFT — with one crucial difference.

Pretraining loss (next-token prediction on everything):

L_pretrain = -Σ log P(token_i | token_1, ..., token_{i-1})
             for ALL tokens in the sequence

SFT loss (next-token prediction on the response only):

L_SFT = -Σ log P(c_i | prompt_tokens, c_1, ..., c_{i-1})
         for ONLY the completion/response tokens

The prompt tokens are fed into the model but masked from the loss computation. This is important: we don't want the model to learn to generate instructions — we want it to learn to respond to them.

Sequence:  [User: What is 2+2?] [Assistant: 4]
Loss mask: [  ----IGNORED----  ] [COMPUTED HERE ]

2.3 Data Formats for SFT

Modern SFT uses chat-formatted data — structured conversations with roles:

# The standard format: a list of messages with roles
{
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
        {"role": "assistant", "content": "The capital of France is Paris."}
    ]
}

This gets converted to a chat template — a specific text format that the model learns to recognize:

# ChatML format (used by many models):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>

# Llama-3 format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
The capital of France is Paris.<|eot_id|>

Each model family has its own template. The transformers library handles this automatically:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"}
]

# For training (complete conversation):
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

# For inference (prompt the model to start generating):
text = tokenizer.apply_chat_template(messages[:1], tokenize=False, add_generation_prompt=True)

2.4 The Key SFT Papers

FLAN (2021) — Instruction Tuning at Scale

Paper: "Finetuned Language Models Are Zero-Shot Learners" (Wei et al., 2021) — arXiv:2109.01652

FLAN proved that fine-tuning on instructions dramatically improves zero-shot performance. They took 62 NLP datasets, formatted them as instructions, and fine-tuned LaMDA-PT 137B.

Key result: FLAN surpassed zero-shot GPT-3 175B on 20 out of 25 tasks.

Key insight: The instruction format itself is critical — fine-tuning on the same tasks without instructions gave much weaker results.

Recipe: Adafactor optimizer, lr=3e-5, 30K steps, batch size 8192 tokens, input length 1024, target length 256.

Self-Instruct (2022) — Bootstrapping Training Data

Paper: "Self-Instruct: Aligning Language Models with Self-Generated Instructions" (Wang et al., 2022) — arXiv:2212.10560

A breakthrough idea: use the language model itself to generate training data. Starting from 175 seed tasks, GPT-3 generated 52,445 instructions with responses.

Key result: +33% improvement over vanilla GPT-3 on SuperNaturalInstructions.

Key insight: The era of synthetic data for SFT began here. This directly led to Stanford Alpaca (fine-tuning LLaMA on 52K GPT-generated instructions for <$600).

InstructGPT (2022) — SFT as Stage 1

Paper: "Training Language Models to Follow Instructions with Human Feedback" (Ouyang et al., 2022) — arXiv:2203.02155

InstructGPT established SFT as the foundation of the alignment pipeline. Their SFT model was trained on ~13K human-written demonstrations.

Key details: 16 epochs, cosine LR decay, residual dropout 0.2. They found that SFT models overfit on validation loss after 1 epoch, but training more epochs improved the reward model score — so they selected checkpoints using the RM, not validation loss.

Key result: Even 1.3B InstructGPT was preferred over 175B GPT-3 by human evaluators.

LIMA (2023) — Less Is More

Paper: "LIMA: Less Is More for Alignment" (Zhou et al., 2023) — arXiv:2305.11206

The most provocative SFT paper: fine-tuning LLaMA-65B on just 1,000 carefully curated examples produced a model competitive with GPT-3.5 (DaVinci003) in human evaluations.

Key result: 1,000 high-quality examples ≈ 52,000 mediocre examples.

Recipe: AdamW, lr 1e-5 → 1e-6 linear decay, 15 epochs, batch size 32, max length 2048. Residual dropout linearly scaled from 0.0 (bottom layer) to 0.3 (top layer).

The takeaway: For SFT, data quality >> data quantity. A small number of consistently styled, high-quality demonstrations is better than a large, noisy dataset.

2.5 SFT in Practice with TRL

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# Load a chat dataset (must have "messages" column)
dataset = load_dataset("trl-lib/Capybara", split="train")

config = SFTConfig(
    output_dir="./sft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    max_seq_length=2048,
    gradient_checkpointing=True,  # Save memory
    bf16=True,                    # Use bfloat16 precision
    logging_steps=10,
    push_to_hub=True,
    hub_model_id="your-username/your-sft-model",
)

trainer = SFTTrainer(
    model="Qwen/Qwen3-0.6B",      # Base model
    args=config,
    train_dataset=dataset,
)
trainer.train()

The SFTTrainer automatically:

Detects the messages column and applies the model's chat template
Masks prompt tokens from the loss (trains only on assistant responses)
Handles tokenization and padding

Chapter 3: Reinforcement Learning from Human Feedback (RLHF) — The Breakthrough

3.1 Why SFT Isn't Enough

SFT teaches format and basic behavior, but it has limitations:

It only learns from demonstrations — it can only be as good as the training examples
It can't express preferences — it treats all tokens in a response equally
It can learn bad habits — if training data contains subtle errors, the model learns those too

RLHF addresses this by training the model based on which outputs are better, not on what specific tokens to generate.

3.2 The RLHF Pipeline (Step by Step)

Step 1: Train a Reward Model

A reward model (RM) takes a prompt and a response, and outputs a scalar score indicating how good the response is.

How it's trained:

Generate multiple responses to the same prompt using the SFT model
Have humans rank these responses (e.g., Response A > Response B)
Train the RM to predict these rankings

The RM uses the Bradley-Terry model of preferences:

P(response_A is preferred over response_B) = σ(r(A) - r(B))

where σ is the sigmoid function and r(·) is the reward model's score. The loss function is:

L_RM = -E[log σ(r(x, y_chosen) - r(x, y_rejected))]

Architecture: The reward model is typically the same architecture as the language model, but with the output head replaced by a linear layer that projects to a single scalar value.

InstructGPT details: They trained a 6B reward model (not 175B — larger RMs had unstable training). The RM dataset contained 33K prompts with human rankings.

Step 2: Optimize the Policy with PPO

Now we use the reward model to improve our language model (the "policy" in RL terminology).

The objective:

maximize E[RM(prompt, response)] - β · KL(π_θ || π_ref)

In plain English: generate responses that score high on the reward model, but don't deviate too far from the original SFT model.

The KL divergence penalty (β · KL(π_θ || π_ref)) is crucial — without it, the model quickly learns to exploit the reward model (generating gibberish that tricks the RM into giving high scores, a phenomenon called reward hacking).

PPO (Proximal Policy Optimization) is the RL algorithm used to optimize this objective. Here's the intuition:

Generate: The current model generates responses to a batch of prompts
Score: The reward model scores each response
Compute advantage: Calculate how much better each response is compared to the expected value
Update: Adjust model weights to make high-advantage responses more likely
Clip: Prevent too-large updates (the "proximal" part) for stability

L_PPO = -E[min(r_t(θ) · A_t, clip(r_t(θ), 1-ε, 1+ε) · A_t)]

where r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t) is the probability ratio and A_t is the advantage.

InstructGPT training details:

PPO with β = 0.02 for KL penalty
Mixed in 10% pretraining data during PPO to prevent regression on general capabilities
Learning rates scanned from 2.55e-6 to 2.55e-5 (rates > 8.05e-6 diverged)
256K PPO episodes total

3.3 The Alignment Tax

RLHF improves alignment but can hurt performance on standard NLP benchmarks — this is the "alignment tax." InstructGPT mitigated this by mixing pretraining data into PPO training (the PPO-ptx variant).

3.4 Why RLHF is Hard

RLHF works, but it has significant practical challenges:

Complexity: Three separate models needed (policy, reference policy, reward model, value model) — 4 models in memory simultaneously
Instability: PPO training is notoriously sensitive to hyperparameters
Reward hacking: The model can learn to exploit the RM rather than genuinely improve
Cost: Human preference data is expensive to collect
Reproducibility: Small changes in setup can lead to very different outcomes

These challenges directly motivated the development of DPO.

3.5 Constitutional AI (RLAIF)

Paper: "Constitutional AI: Harmlessness from AI Feedback" (Bai et al., 2022) — arXiv:2212.08073

Anthropic's key insight: you can replace human feedback with AI feedback (RLAIF — RL from AI Feedback). Instead of humans ranking responses, an AI system evaluates responses against a set of principles (the "constitution").

This dramatically reduces the cost and enables scaling the feedback process.

Chapter 4: Direct Preference Optimization (DPO) — RLHF Without RL

4.1 The Key Insight

Paper: "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (Rafailov et al., 2023) — arXiv:2305.18290

DPO's central insight is beautiful in its simplicity: you don't need a separate reward model or RL training loop. The language model itself implicitly represents a reward model.

The authors showed that the optimal solution to the RLHF objective (maximize reward while staying close to the reference model) can be expressed in closed form:

π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp((1/β) · r(x,y))

Rearranging this to express the reward in terms of the policy:

r(x,y) = β · log(π_θ(y|x) / π_ref(y|x)) + β · log Z(x)

Since the Bradley-Terry preference model only depends on the difference in rewards between two responses, the partition function Z(x) cancels out! This gives us the DPO loss:

L_DPO = -E[log σ(β · log(π_θ(y_w|x)/π_ref(y_w|x)) - β · log(π_θ(y_l|x)/π_ref(y_l|x)))]

where y_w is the preferred ("winning") response and y_l is the rejected ("losing") response.

4.2 Why DPO is a Big Deal

Aspect	RLHF (PPO)	DPO
Models in memory	4 (policy, reference, reward, value)	2 (policy, reference)
Training loop	Complex RL loop with generation	Simple supervised training
Hyperparameters	Many (PPO-specific: clip, value coef, etc.)	Few (mainly β)
Stability	Often unstable	Very stable
Sampling during training	Required	Not required
Performance	Strong	Comparable or better

4.3 Understanding the DPO Gradient

The gradient of the DPO loss has a beautiful interpretation:

∇L_DPO ∝ -β · [σ(r̂(x,y_l) - r̂(x,y_w))] · [∇log π(y_w|x) - ∇log π(y_l|x)]

In English:

Increase the likelihood of the preferred response y_w
Decrease the likelihood of the rejected response y_l
Weight these updates by how "wrong" the model currently is (if the model already prefers y_w, the gradient is small)

The weighting term σ(r̂(x,y_l) - r̂(x,y_w)) is crucial — without it, the model degenerates. This was verified experimentally: a naive "increase chosen, decrease rejected" approach without the weighting fails.

4.4 DPO in Practice

Data format: DPO needs preference pairs — for each prompt, a "chosen" (preferred) and "rejected" response:

{
    "prompt": [{"role": "user", "content": "Explain gravity"}],
    "chosen": [{"role": "assistant", "content": "Gravity is a fundamental force..."}],
    "rejected": [{"role": "assistant", "content": "Gravity is when things fall down."}]
}

The DPO recipe:

Start with an SFT model (this becomes π_ref)
Prepare preference dataset (prompt + chosen + rejected)
Train with the DPO loss

from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

config = DPOConfig(
    output_dir="./dpo-output",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=5e-7,       # DPO uses very low learning rates
    beta=0.1,                 # KL penalty strength
    logging_steps=10,
    bf16=True,
    gradient_checkpointing=True,
    push_to_hub=True,
    hub_model_id="your-username/your-dpo-model",
)

trainer = DPOTrainer(
    model="your-sft-model",           # The SFT model to improve
    args=config,
    train_dataset=dataset,
)
trainer.train()

4.5 DPO Hyperparameters

β (beta): Controls the strength of the KL constraint. Higher β = stay closer to reference model. Typical range: 0.01 to 0.5. Default in TRL: 0.1.
Learning rate: Much lower than SFT — typically 1e-7 to 5e-6. DPO is sensitive to learning rate.
Epochs: Usually 1-3. Overfitting is common with more epochs.

Chapter 5: The Preference Optimization Zoo

After DPO, researchers developed many variants addressing different limitations. Here's a guide to the most important ones.

5.1 IPO — Identity Preference Optimization

Paper: "A General Theoretical Paradigm to Understand Learning from Human Feedback" (Azar et al., 2023)

Problem with DPO: DPO can overfit to the preference data, especially when the Bradley-Terry assumption doesn't hold perfectly.

Solution: IPO adds a regularization term that prevents overfitting without assuming the Bradley-Terry model:

L_IPO = E[(log(π_θ(y_w|x)/π_ref(y_w|x)) - log(π_θ(y_l|x)/π_ref(y_l|x)) - 1/(2β))²]

When to use: When you suspect your preference data is noisy or when DPO is overfitting.

5.2 KTO — Kahneman-Tversky Optimization

Paper: "KTO: Model Alignment as Prospect Theoretic Optimization" (Ethayarajh et al., 2024) — arXiv:2402.01306

Problem with DPO: DPO requires paired preferences (chosen AND rejected for the same prompt). This is expensive to collect. In reality, it's much easier to get binary feedback: "this response is good" or "this response is bad."

Solution: KTO works with unpaired preferences — you only need individual responses labeled as good or bad, not pairs. It's based on Kahneman and Tversky's prospect theory from behavioral economics: humans feel losses more strongly than equivalent gains.

Data format:

{"prompt": "...", "completion": "...", "label": True}   # Good response
{"prompt": "...", "completion": "...", "label": False}  # Bad response

When to use: When you have thumbs-up/thumbs-down feedback but not pairwise comparisons.

5.3 ORPO — Odds Ratio Preference Optimization

Paper: "ORPO: Monolithic Preference Optimization without Reference Model" (Hong et al., 2024)

Problem with DPO: DPO still requires a separate SFT stage and a reference model.

Solution: ORPO combines SFT and preference optimization into a single training step. It adds a preference signal directly to the SFT loss using the odds ratio:

L_ORPO = L_SFT + λ · L_OR

where L_OR penalizes the model when the odds of generating the rejected response exceed those of the chosen response.

When to use: When you want a simpler pipeline without separate SFT and preference stages.

5.4 SimPO — Simple Preference Optimization

Paper: "SimPO: Simple Preference Optimization with a Reference-Free Reward" (Meng et al., 2024)

Problem with DPO: DPO needs a reference model in memory, doubling GPU requirements.

Solution: SimPO eliminates the reference model entirely by using the average log probability of a sequence as the implicit reward (instead of the total log probability). This length-normalized reward naturally prevents the model from favoring longer responses.

When to use: When GPU memory is a constraint and you want to skip the reference model.

5.5 CPO — Contrastive Preference Optimization

Simplifies DPO by removing the reference model and using a contrastive loss. Similar motivation to SimPO but with a different formulation.

5.6 Online DPO

Problem with standard DPO: DPO trains on a fixed, static preference dataset (offline). But the model changes during training, so the preferences collected from the old model become stale.

Solution: Online DPO generates new completions from the current model during training and gets them scored by a reward model. This keeps the training data fresh and on-policy.

5.7 Summary Table

Method	Needs Reference Model?	Needs Paired Data?	Needs RM?	Separate SFT?	Key Advantage
PPO (RLHF)	Yes	No (uses RM)	Yes	Yes	Gold standard, online
DPO	Yes	Yes	No	Yes	Simple, stable
IPO	Yes	Yes	No	Yes	Robust to noise
KTO	Yes	No (binary)	No	Yes	Works with unpaired data
ORPO	No	Yes	No	No (combined)	Simplest pipeline
SimPO	No	Yes	No	Yes	Memory efficient
CPO	No	Yes	No	Yes	Memory efficient
Online DPO	Yes	Generated online	Yes	Yes	On-policy, fresh data
GRPO	Yes (soft)	No (uses rewards)	Yes (or functions)	Yes	Best for reasoning

Chapter 6: GRPO and the Reasoning Revolution

6.1 What is GRPO?

Paper: "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (Shao et al., 2024) — arXiv:2402.03300

Group Relative Policy Optimization (GRPO) is a variant of PPO designed to be more memory-efficient and particularly effective for reasoning tasks (math, code, logic).

The key idea: Instead of training a separate value model (critic) as in PPO, GRPO estimates the "baseline" by generating multiple completions per prompt and using the group average reward as the baseline.

6.2 How GRPO Works

For each prompt:
  1. Generate G completions (e.g., G=16)
  2. Score each completion with a reward function
  3. Compute the advantage for each completion:
     Â_i = (r_i - mean(r)) / std(r)
  4. Update the model to increase probability of high-advantage completions
     and decrease probability of low-advantage completions

The GRPO loss:

L_GRPO = -E[min(ratio · Â, clip(ratio, 1-ε, 1+ε) · Â)] + β · KL(π_θ || π_ref)

where ratio = π_θ(o_{i,t}) / π_old(o_{i,t}) is the importance sampling ratio.

Why "Group Relative"? The advantage is computed relative to the group of completions for the same prompt. A completion is "good" if it scores above the group average, and "bad" if below. This is why the method has that name.

6.3 Why GRPO Matters: The DeepSeek-R1 Story

GRPO became famous when DeepSeek used it to train DeepSeek-R1 — a model that learned to "think" through chain-of-thought reasoning purely through RL, without being taught specific reasoning patterns.

Paper: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (DeepSeek-AI, 2025) — arXiv:2501.12948

The key discovery: with the right reward function (accuracy on math/coding problems) and GRPO training, the model spontaneously develops chain-of-thought reasoning, self-verification, and error correction — without being explicitly trained to do so.

This opened the "reasoning era" of LLM training, where RL-based methods are used to incentivize complex reasoning behaviors.

6.4 GRPO in Practice

GRPO requires:

A prompt dataset (just prompts, no responses needed)
A reward function (can be a model or a simple Python function)

from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
import re

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

# Custom reward function: checks if the answer is correct
def accuracy_reward(completions, ground_truth, **kwargs):
    matches = [re.search(r"\\boxed\{(.*?)\}", c) for c in completions]
    contents = [m.group(1) if m else "" for m in matches]
    return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]

config = GRPOConfig(
    output_dir="./grpo-output",
    learning_rate=1e-6,
    per_device_train_batch_size=4,
    num_generations=16,        # G: number of completions per prompt
    max_completion_length=512,
    logging_steps=10,
    bf16=True,
    gradient_checkpointing=True,
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    args=config,
    train_dataset=dataset,
)
trainer.train()

6.5 Reward Functions vs Reward Models

GRPO is flexible — the reward can come from:

A Python function (rule-based): Check if math answer is correct, if code passes tests, if format is right
A reward model (learned): A separate neural network that scores responses
Multiple reward functions combined: e.g., accuracy_reward + format_reward

For math/coding, rule-based rewards are often better because they provide an exact signal — the answer is either right or wrong. For open-ended tasks (chat, creative writing), a learned reward model is needed.

Chapter 7: Parameter-Efficient Fine-Tuning (PEFT) — LoRA, QLoRA, and Adapters

7.1 The Memory Problem

Fine-tuning a 7B parameter model requires:

Model weights: 7B × 2 bytes (bf16) = 14 GB
Gradients: 14 GB
Optimizer states (AdamW): 28 GB (2 states × 14 GB)
Activations: Variable, often 10-30 GB

Total: ~60-80 GB for a single 7B model. That's one A100 GPU just for SFT. For RLHF with PPO (4 models), you'd need 4× this.

7.2 LoRA: Low-Rank Adaptation

Paper: "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021) — arXiv:2106.09685

The insight: When fine-tuning, the weight updates have low rank — they can be approximated by small matrices without much loss.

Instead of updating the full weight matrix W (d × d), LoRA adds two small matrices:

W' = W + α · B × A

where:
  W is the original frozen weight (d × d)
  A is a small matrix (d × r)     — "down projection"
  B is a small matrix (r × d)     — "up projection"  
  r << d (typically r = 8, 16, 32) — the "rank"
  α is a scaling factor

Only A and B are trained — the original weights are frozen. This reduces trainable parameters by 100-1000×.

Example: For a 4096 × 4096 weight matrix:

Full fine-tuning: 16.7M parameters
LoRA with r=16: 2 × 4096 × 16 = 131K parameters (128× fewer!)

7.3 QLoRA: Quantized LoRA

Paper: "QLoRA: Efficient Finetuning of Quantized Language Models" (Dettmers et al., 2023) — arXiv:2305.14314

QLoRA goes further: it quantizes the frozen base model to 4-bit precision, then adds LoRA adapters on top.

Base model: 4-bit quantized (NF4 format) → 7B model fits in ~4 GB
LoRA adapters: Trained in bf16/fp16
Gradient computation: Done in bf16/fp16

This allows fine-tuning a 7B model on a single consumer GPU (e.g., RTX 4090 with 24 GB).

7.4 Using LoRA with TRL

from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

# Define LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Rank
    lora_alpha=32,                 # Scaling factor (usually 2×r)
    lora_dropout=0.05,             # Dropout for regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

config = SFTConfig(
    output_dir="./sft-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,           # LoRA typically uses higher LR than full fine-tuning
    bf16=True,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model="meta-llama/Llama-3.1-8B",
    args=config,
    train_dataset=dataset,
    peft_config=lora_config,       # Pass LoRA config here
)
trainer.train()

7.5 When to Use LoRA vs Full Fine-Tuning

Scenario	Recommendation
Limited GPU memory	LoRA / QLoRA
Quick experiment / prototype	LoRA
Maximum quality, sufficient compute	Full fine-tuning
Multiple task-specific models from same base	LoRA (swap adapters)
Very small dataset	LoRA (acts as regularizer)

Key trade-off: LoRA is ~95-99% as good as full fine-tuning for most tasks, at a fraction of the compute. For maximum quality (e.g., training a production model), full fine-tuning is still king.

Chapter 8: The Toolbox — Libraries, Frameworks, and Infrastructure

8.1 TRL (Transformers Reinforcement Learning)

Repository: github.com/huggingface/trl Documentation: huggingface.co/docs/trl

TRL is the central library for post-training. It provides trainers for every major method:

Trainer	Method	Config Class	Dataset Type
`SFTTrainer`	Supervised Fine-Tuning	`SFTConfig`	Language modeling or Prompt-completion
`DPOTrainer`	Direct Preference Optimization	`DPOConfig`	Preference (prompt + chosen + rejected)
`GRPOTrainer`	Group Relative Policy Optimization	`GRPOConfig`	Prompt-only
`RLOOTrainer`	REINFORCE Leave-One-Out	`RLOOConfig`	Prompt-only
`RewardTrainer`	Reward Model Training	`RewardConfig`	Preference
`KTOTrainer`	Kahneman-Tversky Optimization	`KTOConfig`	Unpaired preference
`ORPOTrainer`	Odds Ratio Preference Optimization	`ORPOConfig`	Preference
`CPOTrainer`	Contrastive Preference Optimization	`CPOConfig`	Preference
`OnlineDPOTrainer`	Online DPO	`OnlineDPOConfig`	Prompt-only
`PPOTrainer`	Proximal Policy Optimization	`PPOConfig`	Tokenized language modeling
`XPOTrainer`	Exploratory Preference Optimization	`XPOConfig`	Prompt-only
`NashMDTrainer`	Nash Mirror Descent	`NashMDConfig`	Prompt-only
`PRMTrainer`	Process Reward Model	`PRMConfig`	Stepwise supervision

Key features:

Integrates seamlessly with Hugging Face transformers and datasets
Built-in PEFT/LoRA support via peft_config argument
vLLM integration for fast generation in online methods
DeepSpeed ZeRO for distributed training
Supports both standard and conversational dataset formats

8.2 Transformers

Repository: github.com/huggingface/transformers

The foundation library. You'll use it for:

AutoModelForCausalLM — Loading language models
AutoTokenizer — Tokenization and chat templates
TrainingArguments — Base training configuration
Trainer — Base trainer class (TRL trainers inherit from this)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

8.3 PEFT (Parameter-Efficient Fine-Tuning)

Repository: github.com/huggingface/peft

Provides LoRA, QLoRA, and other adapter methods. Key classes:

LoraConfig — Configure LoRA adapters
get_peft_model() — Wrap a model with adapters
PeftModel.from_pretrained() — Load saved adapters

8.4 Accelerate

Repository: github.com/huggingface/accelerate

Handles distributed training across multiple GPUs/nodes. You rarely interact with it directly — it works behind the scenes when you use accelerate launch:

# Single GPU
python train.py

# Multi-GPU
accelerate launch --num_processes 4 train.py

# Multi-GPU with DeepSpeed
accelerate launch --config_file deepspeed_zero3.yaml train.py

8.5 Datasets

Repository: github.com/huggingface/datasets

Efficient dataset loading and processing:

from datasets import load_dataset

# Load from Hub
dataset = load_dataset("trl-lib/Capybara", split="train")

# Streaming (for huge datasets)
dataset = load_dataset("trl-lib/Capybara", split="train", streaming=True)

# Inspect
print(dataset.column_names)  # ['messages']
print(dataset[0])            # First example

8.6 vLLM

Repository: github.com/vllm-project/vllm

High-throughput inference engine. Critical for:

Online methods (GRPO, RLOO, Online DPO): Speeds up generation during training by 5-10×
Inference serving: Deploy models for production use

TRL integrates vLLM directly:

config = GRPOConfig(
    use_vllm=True,              # Enable vLLM for generation
    vllm_mode="colocate",       # Run on same GPUs as training
)

8.7 Other Important Tools

Tool	Purpose	Link
Unsloth	2-5× faster LoRA training, lower memory	github.com/unslothai/unsloth
bitsandbytes	4/8-bit quantization for QLoRA	github.com/bitsandbytes-foundation/bitsandbytes
Flash Attention	Memory-efficient attention	github.com/Dao-AILab/flash-attention
DeepSpeed	Distributed training (ZeRO)	github.com/microsoft/DeepSpeed
Weights & Biases	Experiment tracking	wandb.ai
Trackio	HF-native experiment tracking	HF Docs
LM Eval Harness	Standardized LLM evaluation	github.com/EleutherAI/lm-evaluation-harness

Chapter 9: Datasets — What to Train On

9.1 SFT Datasets

Dataset	Size	Description	Link
trl-lib/Capybara	~90K msgs	High-quality multi-turn conversations	Hub
HuggingFaceH4/ultrachat_200k	200K	Diverse multi-turn conversations	Hub
allenai/tulu-3-sft-mixture	~1.3M	Large-scale SFT mixture from AI2	Hub
OpenAssistant/oasst1	161K msgs	Crowdsourced conversation trees	Hub
tatsu-lab/alpaca	52K	GPT-generated instruction data	Hub
teknium/OpenHermes-2.5	1M	Large synthetic instruction dataset	Hub

9.2 Preference Datasets (for DPO/KTO/ORPO)

Dataset	Size	Description	Link
trl-lib/ultrafeedback_binarized	60K	Binarized UltraFeedback preferences	Hub
Anthropic/hh-rlhf	170K	Human preference data (helpful + harmless)	Hub
argilla/ultrafeedback-binarized-preferences	60K	Cleaned UltraFeedback	Hub

9.3 Prompt-Only Datasets (for GRPO/RLOO)

Dataset	Size	Description	Link
trl-lib/DeepMath-103K	103K	Math problems with verifiable answers	Hub
AI-MO/NuminaMath-TIR	~70K	Math competition problems	Hub

9.4 How to Choose a Dataset

For your first experiment: Use trl-lib/Capybara (SFT) or trl-lib/ultrafeedback_binarized (DPO). They're well-formatted and TRL-compatible out of the box.
Quality over quantity: LIMA showed that 1K great examples beats 52K mediocre ones. Invest in data curation.
Match your use case: If training a math model, use math-specific data. If training a general assistant, use diverse conversational data.
Always inspect before training:

from datasets import load_dataset
ds = load_dataset("trl-lib/Capybara", split="train")
print(ds[0])  # Look at the data!

Chapter 10: Evaluation — How to Know If It Worked

10.1 The Evaluation Problem

Evaluating LLMs is fundamentally hard because:

Open-ended outputs can be correct in many different ways
Perplexity doesn't correlate well with usefulness (LIMA found this explicitly)
Benchmark scores don't always reflect real-world performance
Human evaluation is expensive and subjective

10.2 Automated Benchmarks

Benchmark	What It Measures	How It Works
MMLU	Knowledge across 57 subjects	Multiple-choice questions
HellaSwag	Commonsense reasoning	Sentence completion
ARC	Science reasoning	Multiple-choice science questions
TruthfulQA	Truthfulness	Questions designed to elicit false claims
GSM8K	Math reasoning	Grade-school math word problems
MATH	Advanced math	Competition-level math problems
HumanEval	Code generation	Python programming problems
MBPP	Code generation	Basic Python problems
IFEval	Instruction following	Verifiable instruction constraints

Tool: lm-evaluation-harness runs all of these:

lm_eval --model hf \
    --model_args pretrained=your-model \
    --tasks mmlu,gsm8k,hellaswag \
    --batch_size 8

10.3 LLM-as-Judge Evaluations

Evaluation	Description	Link
AlpacaEval	GPT-4 compares model outputs to reference	github
MT-Bench	Multi-turn dialogue evaluation by GPT-4	Part of lmsys
Arena Hard	Challenging prompts, GPT-4 judged	Part of lmsys

10.4 Human Evaluation

The gold standard. Key approaches:

Side-by-side comparison: Show humans two responses, ask which is better
Likert scale: Rate each response on helpfulness, accuracy, harmlessness (1-7)
Chatbot Arena: Users chat with two anonymous models and vote for the better one

The LMSYS Chatbot Arena provides the most widely-cited human evaluation through crowdsourced blind comparisons.

10.5 The Open LLM Leaderboard

Hugging Face hosts the Open LLM Leaderboard which evaluates open-source models across standardized benchmarks. It's the primary way the community tracks progress.

Chapter 11: Putting It All Together — A Complete Post-Training Recipe

11.1 The Standard Recipe (2024-2025)

Here's a typical post-training pipeline for building a chat model:

Step 1: Choose Base Model
├── Qwen3 (0.6B to 235B) — Currently top-performing family
├── Llama 3.1/3.2 (1B to 405B) — Meta's open models
├── Gemma 3/4 (1B to 27B) — Google's open models
└── Mistral/Mixtral — Strong efficiency

Step 2: SFT
├── Dataset: trl-lib/Capybara or HuggingFaceH4/ultrachat_200k
├── Method: SFTTrainer with LoRA (for efficiency) or full fine-tuning
├── Epochs: 2-3
├── LR: 2e-5 (full) or 2e-4 (LoRA)
└── Output: SFT model (becomes reference model for Stage 3)

Step 3: Preference Optimization (choose one)
├── Option A: DPO (simplest, most popular)
│   ├── Dataset: trl-lib/ultrafeedback_binarized
│   ├── β: 0.1
│   ├── LR: 5e-7
│   └── Epochs: 1-2
├── Option B: GRPO (best for reasoning tasks)
│   ├── Dataset: trl-lib/DeepMath-103K (math)
│   ├── Reward: accuracy_reward + format_reward
│   ├── num_generations: 16
│   └── LR: 1e-6
└── Option C: KTO (if you only have binary feedback)
    ├── Dataset: unpaired preference data
    └── Similar to DPO hyperparameters

Step 4: Evaluation
├── Automated: lm-eval-harness (MMLU, GSM8K, etc.)
├── LLM-Judge: MT-Bench, AlpacaEval
└── Manual: Test with real prompts

11.2 Minimal Working Example: SFT + DPO

# === Stage 1: SFT ===
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

sft_dataset = load_dataset("trl-lib/Capybara", split="train")

sft_config = SFTConfig(
    output_dir="./sft-model",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    max_seq_length=2048,
    bf16=True,
    gradient_checkpointing=True,
    push_to_hub=True,
    hub_model_id="your-username/my-sft-model",
)

sft_trainer = SFTTrainer(
    model="Qwen/Qwen3-0.6B",
    args=sft_config,
    train_dataset=sft_dataset,
)
sft_trainer.train()

# === Stage 2: DPO ===
from trl import DPOTrainer, DPOConfig

dpo_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

dpo_config = DPOConfig(
    output_dir="./dpo-model",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=5e-7,
    beta=0.1,
    bf16=True,
    gradient_checkpointing=True,
    push_to_hub=True,
    hub_model_id="your-username/my-dpo-model",
)

dpo_trainer = DPOTrainer(
    model="your-username/my-sft-model",  # SFT model from stage 1
    args=dpo_config,
    train_dataset=dpo_dataset,
)
dpo_trainer.train()

11.3 Hardware Guidelines

Model Size	Minimum GPU	Recommended	With LoRA
0.5-3B	1× A10G (24 GB)	1× A100 (80 GB)	1× T4 (16 GB)
7-8B	1× A100 (80 GB)	2× A100	1× A10G (24 GB)
13B	2× A100	4× A100	1× A100 (80 GB)
70B	4× A100	8× A100	2× A100

Chapter 12: The Reading List — Papers Every Practitioner Should Read

Tier 1: Must-Read (The Foundations)

InstructGPT — "Training Language Models to Follow Instructions with Human Feedback"
- Ouyang et al., 2022 — arXiv:2203.02155
- Why: Established the SFT → RM → PPO pipeline. Everything starts here.
DPO — "Direct Preference Optimization: Your Language Model is Secretly a Reward Model"
- Rafailov et al., 2023 — arXiv:2305.18290
- Why: Eliminated reward model + RL. The most widely used preference optimization method.
LoRA — "LoRA: Low-Rank Adaptation of Large Language Models"
- Hu et al., 2021 — arXiv:2106.09685
- Why: Made fine-tuning accessible. Practically every fine-tuning workflow uses LoRA.
DeepSeek-R1 — "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"
- DeepSeek-AI, 2025 — arXiv:2501.12948
- Why: Showed RL can teach reasoning from scratch. Opened the "reasoning era."

Tier 2: Important (Deepening Understanding)

LIMA — "LIMA: Less Is More for Alignment"
- Zhou et al., 2023 — arXiv:2305.11206
- Why: Superficial Alignment Hypothesis. Data quality >> quantity.
Constitutional AI — "Constitutional AI: Harmlessness from AI Feedback"
- Bai et al., 2022 — arXiv:2212.08073
- Why: AI feedback replacing human feedback. Foundation for RLAIF.
FLAN — "Finetuned Language Models Are Zero-Shot Learners"
- Wei et al., 2021 — arXiv:2109.01652
- Why: Proved instruction tuning works. Foundation for SFT.
Self-Instruct — "Self-Instruct: Aligning Language Models with Self-Generated Instructions"
- Wang et al., 2022 — arXiv:2212.10560
- Why: Synthetic data generation. Led to Alpaca and the open-source SFT revolution.
DeepSeekMath — "DeepSeekMath: Pushing the Limits of Mathematical Reasoning"
- Shao et al., 2024 — arXiv:2402.03300
- Why: Introduced GRPO. The paper that started the GRPO wave.
QLoRA — "QLoRA: Efficient Finetuning of Quantized Language Models"
- Dettmers et al., 2023 — arXiv:2305.14314
- Why: Made 7B fine-tuning possible on consumer GPUs.

Tier 3: Advanced (Cutting Edge)

KTO — "KTO: Model Alignment as Prospect Theoretic Optimization"
- Ethayarajh et al., 2024 — arXiv:2402.01306
ORPO — "ORPO: Monolithic Preference Optimization without Reference Model"
- Hong et al., 2024 — arXiv:2403.07691
SimPO — "SimPO: Simple Preference Optimization with a Reference-Free Reward"
- Meng et al., 2024 — arXiv:2405.14734
Tulu 3 — "Tulu 3: Pushing Frontiers in Open Language Model Post-Training"
- AI2, 2024 — A comprehensive open-source post-training recipe
Zephyr — "Zephyr: Direct Distillation of LM Alignment"
- Tunstall et al., 2023 — arXiv:2310.16944
- Why: Open-source recipe for DPO that matched much larger models.

Tier 4: Background (RL Foundations, if you want to go deeper)

PPO — "Proximal Policy Optimization Algorithms"
- Schulman et al., 2017 — arXiv:1707.06347
Learning to Summarize from Human Feedback
- Stiennon et al., 2020 — arXiv:2009.01325
- Why: First application of RLHF to LLMs (summarization).
Fine-Tuning Language Models from Human Preferences
- Ziegler et al., 2019 — arXiv:1909.08593
- Why: The original RLHF for language models paper.

Glossary

Term	Definition
Alignment	Making a model behave according to human intentions and values
RLHF	Reinforcement Learning from Human Feedback — using human preference data to train a reward model, then optimizing the LM with RL
RLAIF	RL from AI Feedback — using an AI system instead of humans to provide feedback
SFT	Supervised Fine-Tuning — training on instruction-response pairs with standard cross-entropy loss
DPO	Direct Preference Optimization — training directly on preference pairs without a separate reward model or RL
GRPO	Group Relative Policy Optimization — RL method that normalizes rewards within a group of completions
PPO	Proximal Policy Optimization — the RL algorithm used in classical RLHF
Reward Model (RM)	A model trained to score responses based on human preferences
Policy	In RL terms, the language model being trained (maps states/prompts to actions/tokens)
Reference Model (π_ref)	The SFT model used as a baseline to prevent the policy from deviating too far
KL Divergence	A measure of how different two probability distributions are — used to keep the policy close to the reference
Bradley-Terry Model	A probabilistic model for pairwise comparisons: P(A > B) = σ(score(A) - score(B))
Reward Hacking	When the model learns to exploit the reward model rather than genuinely improve
LoRA	Low-Rank Adaptation — parameter-efficient fine-tuning using small rank-decomposed matrices
QLoRA	Quantized LoRA — combines 4-bit quantization of the base model with LoRA adapters
Chat Template	The specific text format (special tokens, roles) a model uses for conversations
On-policy	Training on data generated by the current model (e.g., GRPO, Online DPO)
Off-policy	Training on data generated by a different model (e.g., standard DPO on static datasets)
Preference Data	Pairs of responses where one is marked as preferred over the other
Advantage	How much better a specific action is compared to the expected value

Quick Reference: TRL Commands

# Install TRL
pip install trl

# Run SFT from command line
trl sft --model_name_or_path Qwen/Qwen3-0.6B \
    --dataset_name trl-lib/Capybara \
    --output_dir ./sft-output

# Run DPO from command line
trl dpo --model_name_or_path your-sft-model \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --output_dir ./dpo-output

# Run GRPO from command line
trl grpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name trl-lib/DeepMath-103K \
    --output_dir ./grpo-output

# Start vLLM server for fast inference
trl vllm-serve --model Qwen/Qwen3-0.6B

# Multi-GPU training
accelerate launch --num_processes 4 train.py

# With DeepSpeed ZeRO-3
accelerate launch --config_file deepspeed_zero3.yaml train.py

Quick Reference: Dataset Formats by Trainer

# SFT (Language modeling format)
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

# SFT (Prompt-completion format)
{"prompt": "...", "completion": "..."}

# DPO / ORPO / CPO (Preference format)
{"prompt": "...", "chosen": "...", "rejected": "..."}
# Or conversational:
{"prompt": [{"role": "user", "content": "..."}],
 "chosen": [{"role": "assistant", "content": "..."}],
 "rejected": [{"role": "assistant", "content": "..."}]}

# GRPO / RLOO / Online DPO (Prompt-only format)
{"prompt": "..."}
# Or conversational:
{"prompt": [{"role": "user", "content": "..."}]}

# KTO (Unpaired preference format)
{"prompt": "...", "completion": "...", "label": True}

# Reward Model (Preference format — same as DPO)
{"prompt": "...", "chosen": "...", "rejected": "..."}

# PRM (Stepwise supervision format)
{"prompt": "...", "completions": ["step1", "step2"], "labels": [True, False]}

Where to Go Next

Hands-on: Try the TRL notebooks on Google Colab — they run for free
Course: The Hugging Face smol course covers post-training step by step
Community: Join the Hugging Face Discord and the #trl channel
Papers: Start with InstructGPT and DPO from the reading list, then follow your interests
Experiment: Fine-tune a small model (Qwen3-0.6B) on your own data — the best way to learn is by doing

This guide was compiled from primary research papers, official Hugging Face documentation, and the TRL library source code. All paper citations link to their arXiv pages. All code examples use current API patterns from TRL v1.2+.

Last updated: April 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for NeuralNoble/post-training-guide

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22, 2025 • 452

SimPO: Simple Preference Optimization with a Reference-Free Reward

Paper • 2405.14734 • Published May 23, 2024 • 12

ORPO: Monolithic Preference Optimization without Reference Model

Paper • 2403.07691 • Published Mar 12, 2024 • 73

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 145

KTO: Model Alignment as Prospect Theoretic Optimization

Paper • 2402.01306 • Published Feb 2, 2024 • 22