Oh, thank you!
Stefano Fiorucci PRO
anakin87
AI & ML interests
Language Models: orchestration, post-training, GRPO, synthetic data...
Contributing to Haystack LLM framework ๐๏ธ
Recent Activity
repliedto their post about 12 hours ago
How LLM training with RL Environments works?
It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env โโญ
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env
No critic model needed, the group is the baseline
Simpler than PPO
1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling
2๏ธโฃ Each game scored with deterministic rewards (win, format, ...)
3๏ธโฃ Mean score computed across the group
4๏ธโฃ Each rollout's advantage = its score minus the group mean
5๏ธโฃ Model updated to favor trajectories above baseline
๐ Repeat
For a deep dive, check out
๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs reacted to theirpost with โค๏ธ 1 day ago
How LLM training with RL Environments works?
It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env โโญ
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env
No critic model needed, the group is the baseline
Simpler than PPO
1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling
2๏ธโฃ Each game scored with deterministic rewards (win, format, ...)
3๏ธโฃ Mean score computed across the group
4๏ธโฃ Each rollout's advantage = its score minus the group mean
5๏ธโฃ Model updated to favor trajectories above baseline
๐ Repeat
For a deep dive, check out
๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs posted an update 1 day ago
How LLM training with RL Environments works?
It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env โโญ
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env
No critic model needed, the group is the baseline
Simpler than PPO
1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling
2๏ธโฃ Each game scored with deterministic rewards (win, format, ...)
3๏ธโฃ Mean score computed across the group
4๏ธโฃ Each rollout's advantage = its score minus the group mean
5๏ธโฃ Model updated to favor trajectories above baseline
๐ Repeat
For a deep dive, check out
๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs