Hugging Face H4

Team

company

https://github.com/huggingface/alignment-handbook

Activity Feed

AI & ML interests

Aligning LLMs to be helpful, honest, harmless, and huggy (H4)

Recent Activity

sayakpaul authored a paper 9 days ago

DynEval: Holistic Evaluations of T2I Generative Models in the Wild

sayakpaul authored a paper 19 days ago

Posterior Augmented Flow Matching

sayakpaul authored a paper 19 days ago

4KLSDB: A Large-Scale Dataset for 4K Image Restoration and Generation

View all activity

sergiopaniego

posted an update 2 days ago

Post

1988

quick reminder! 🚨

tomorrow (Tuesday, July 28), we're back with Class 3 of the Training Agents live series

🧠 what: reinforcement learning for training agents (GRPO): how it works, how to implement it in TRL, and end-to-end examples
🗓️ when: Tuesday, July 28 - 🕔 5:00 PM CEST / 8:30 PM IST
📍 where: Live on @huggingface 's X, YouTube, and LinkedIn

live: https://www.youtube.com/watch?v=ztdTed5egrM

class 1: https://x.com/SergioPaniego/status/2069382207618379813
class 2: https://x.com/SergioPaniego/status/2075180665184686187

1 reply

sergiopaniego

posted an update 5 days ago

Post

146

you can now train your own coding agents with trl + openenv, starting with opencode

we just added end-to-end support for training agent harnesses:

> TRL: a loop-owning training path (AsyncGRPOTrainer + HarnessRolloutWorker) that launches the agent in an OpenEnv session, reads back its trace, reconstructs the training samples, and trains with AsyncGRPO
> OpenEnv: the OpenCode harness environment plus a transparent proxy that forwards the agent's model calls and records each turn's token ids and logprobs

you train the actual opencode agent as is, it runs its own loop and tools and the policy learns from the exact tokens it produced

we're shipping a self-contained example: local subprocess sandbox, DeepCoder problems, validated on Qwen3-8B.

> example: https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/opencode.py
> docs: https://huggingface.co/docs/trl/main/openenv

and we're working actively on both sides so expect more 🤓

1 reply

sergiopaniego

posted an update 6 days ago

Post

1464

you can train DiffusionGemma (a block-diffusion LLM) in TRL! and we're sharing an example for it

TRL trainers are made to be easily extended and adapted to different real-world use cases.

in this one, with a single method overridden in SFTTrainer (compute_loss), you can train this model

> example: https://github.com/huggingface/trl/blob/main/examples/scripts/sft_diffusion_gemma.py

sergiopaniego

posted an update 7 days ago

Post

187

join us next Tuesday, July 28, for Class 3 of the Training Agents live series!

we'll dive into reinforcement learning for agent training, covering the intuition behind GRPO, how it works, and how to implement it in TRL with practical, e2e examples

see you there 🤠

live: https://www.youtube.com/live/ztdTed5egrM

> in case you missed class 1:
https://x.com/SergioPaniego/status/2069382207618379813
> and in case you missed class 2: https://x.com/SergioPaniego/status/2075180665184686187

sayakpaul

authored a paper 9 days ago

DynEval: Holistic Evaluations of T2I Generative Models in the Wild

Paper • 2607.11199 • Published 16 days ago

sayakpaul

authored 3 papers 19 days ago

submitted a paper to Daily Papers 19 days ago

Flash-BoN: Instant Drafts for Inference-Time Scaling in Diffusion Models

Paper • 2607.04461 • Published 24 days ago • 11

sergiopaniego

posted an update 21 days ago

Post

7711

Frontier models use distillation as a step of their post-training pipelines.

In 2026 it has three jobs: compress a big model into a small one, merge RL experts into a single model, and let a model teach itself.

I wrote up which frontier models use each one and how: https://huggingface.co/blog/sergiopaniego/distillation-2026

It pairs with Class 2 of the Training an Agent series Ben and I are doing, where we teach these techniques hands-on with TRL!

3 replies

albertvillanova

posted an update 23 days ago

Post

3597

🎉 KTO is now part of the stable TRL API

As of Promote KTO to stable API, KTOTrainer and KTOConfig have graduated from trl.experimental to the stable trl API. https://github.com/huggingface/trl/pull/6175

This one closes out a long road. Over the past 6+ months, the "Align KTO with DPO" effort landed ~90 PRs methodically bringing KTO up to the standard we hold for stable trainers, one carefully-scoped change at a time:
- Feature parity with DPO: full VLM support (incl. multi-image), sync_ref_model, PEFT + Liger, ZeRO-3 + PEFT dtype fix, pad_to_multiple_of, activation offloading, IterableDataset and dict eval_dataset, remove_unused_columns, and reference-logprob precomputation at init.
- Consistency with DPO: aligned method order and signatures, tokenization, _prepare_dataset, PEFT handling, ref-model preparation for distributed training, and config layout — plus a new DataCollatorForKTO and output format. Metrics moved into _compute_loss and simplified to direct averages via the shared _metrics attribute.
- Removing legacy baggage: dropped encoder-decoder support, BOS/EOS handling, null_ref_context, generate_during_eval, model_init, preprocess_logits_for_metrics, model/ref adapter names, and several dead config knobs.
- Coverage: a full test suite mirroring DPO, text collator tests, VLM tests, and slow tests.
- The promotion itself: the experimental → stable move (#6175) and shim cleanup (#6287), handled so downstream users get a clean deprecation path.

Honestly, this has been one of the more complex tasks I've taken on since joining the team, not because any single change was hard, but because it demanded sustained consistency across a ~2,000-line trainer, with every branch, comment, and edge case kept in lockstep with DPO.

Huge thanks to everyone who reviewed along the way (especially @qgallouedec ), the incremental review cadence is exactly what kept this maintainable.

KTO now sits on equal footing with our other flagship trainers. 🚀

2 replies

abidlabs

posted an update about 1 month ago

Post

606

Uhh did Opus 4.8 cheat on PostTrainBench??

it found an API key in the PostTrainBench environment that allowed it to generate synthetic training data without using GPU hours, boosting the base model by 0.4913

Source: https://posttrainbench.com/traces/run.html?id=claude_non_api_max_claude-opus-4-8_10h_run1__healthbench_Qwen_Qwen3-4B-Base_17315102#tab=trace

1 reply

sergiopaniego

posted an update about 1 month ago

Post

364

TRL v1.7.0 is out‼️

+ continuous batching makes GRPO and RLOO 1.25x faster at -16 GB
+ proper MoE post-training across GRPO/RLOO/AsyncGRPO
+ new GMPO trainer
+ AsyncGRPO weight sync + padding-free
+ more

https://github.com/huggingface/trl/releases/tag/v1.7.0

wrote a small article about the continuous batching for GRPO feature

https://huggingface.co/blog/sergiopaniego/cb-trl-grpo

sergiopaniego

posted an update about 1 month ago

Post

352

Continuous batching just landed in TRL for GRPO!

At 64 generations it runs faster and uses less VRAM than plain generate, no vLLM needed

How it works and when to reach for it, below

https://huggingface.co/blog/sergiopaniego/cb-trl-grpo

sergiopaniego

posted an update about 1 month ago

Post

334

GLM-5.2 is open and comes with competitive performance against opus 4.8

day-0 in transformers + vllm + sglang, mit license 🤗

on the post-training side: critic-based ppo for variable-length agentic rollouts (ppo is back!) + an online anti-reward-hacking module that feeds the agent dummy info when it tries to cheat