1️⃣ Build a solid RL env with Verifiers (Prime Intellect) 2️⃣ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env 3️⃣ SFT warm-up to teach format 4️⃣ Group-based RL (CISPO) against opponents making 20-70% random moves 5️⃣ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
Local Gemma 4 agent 💎🕵️🗺️ drop in a mysterious map, get the location, live weather, and top spots to visit
I've been exploring what google/gemma-4-E4B-it can do in a local agentic setup and put together a 📓 𝙣𝙤𝙩𝙚𝙗𝙤𝙤𝙠 with Gemma + Haystack AI Framework covering 4 demos.
I initially tried to load all tools from the GitHub MCP server, quickly filling the context available on Colab -> unusable, forgetful agent ❌
Then I used the 𝗦𝗲𝗮𝗿𝗰𝗵𝗮𝗯𝗹𝗲 𝗧𝗼𝗼𝗹𝘀𝗲𝘁 🔎 🧰 It dynamically discovers the right tools from the GitHub MCP server on the fly, loading only what it actually needs for the task at hand, keeping context lean.
Now it actually works.
The notebook also contains 💎 Multimodal weather agent: the mystery map demo above 💎 Visual Question Answering from a paper 💎 RAG on Rock music
It all starts with 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗥𝗲𝘄𝗮𝗿𝗱𝘀 - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env ❌⭕ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use 𝗚𝗿𝗼𝘂𝗽 𝗥𝗲𝗹𝗮𝘁𝗶𝘃𝗲 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 with a tic-tac-toe env
No critic model needed, the group is the baseline Simpler than PPO
1️⃣ Rollout generation: from the same board, model plays N games via sampling 2️⃣ Each game scored with deterministic rewards (win, format, ...) 3️⃣ Mean score computed across the group 4️⃣ Each rollout's advantage = its score minus the group mean 5️⃣ Model updated to favor trajectories above baseline
Over the past year, we've seen a shift in LLM Post-Training. Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.
Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.
But what actually are these environments in practice❓ And how do you build them effectively❓
Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models. I've packaged everything I learned into this short course.
What you'll learn
🔹 Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain 🔹 How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts 🔹 Common patterns: How to build single-turn, multi-turn, and tool-use environments
🔹 Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master 🔸 Build the game Environment 🔸 Use it to generate synthetic data for SFT warm-up 🔸 Group-based Reinforcement Learning
If you're interested in building "little worlds" where LLMs can learn, this course is for you.
💭 Do thinking traces make Language Models learn better? Curious what others think
𝗦𝗰𝗲𝗻𝗮𝗿𝗶𝗼 You take an instruction-following LM. You want to train it with a GRPO-style RL algorithm on a task like Tic Tac Toe. Rewards are outcome-based, applied only at the end of each episode: win/loss/draw, format adherence...
During training, the model could just output answers, but a common choice is to make it also output thinking traces.
𝗧𝗵𝗲 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻 Does forcing the model to produce thinking traces during training actually improve learning❓
💬 I'd like to hear your thoughts. Share ideas and links to relevant papers and resources.
From what I've understood so far, the answer seems to be 𝘆𝗲𝘀.
1️⃣ If you force the model to think during training, it becomes a model that thinks at inference time. It naturally allocates more budget (tokens) to a problem, which tends to improve performance.
2️⃣ While the model's "reasoning" already exists in its activation space, using explicit thinking traces as a scratchpad allows training to steer and shape that reasoning.
3️⃣ As the model produces more traces during training, the RL algorithm can progressively give higher rewards to the reasoning patterns that lead to better outcomes.
It's known that Language Models memorize data that can be extracted via prompting.
In this paper, the authors investigate this aspect: - using open models, where prompting can be fully customized by the user, including special tokens. - focusing on open-source models like Olmo, where full training data is available.
📤 How do they extract data?
During post-training (like SFT), new tokens such as <|user|> are introduced.
The authors hypothesize prompting the model with these tokens can make it output its alignment data (remember Magpie?).
For example, for SFT, their extraction prompt is <|endoftext|><|user|>.
📏 Evaluating memorization
The authors compare each sampled example with the original data using vector search with embedding similarity.
They find that many outputs are semantically very similar to the original data, even if the exact words differ.
Traditional string-matching algorithms underestimate memorization by 10x.
🔁 What about RL?
Surprisingly, the same technique works to extract data from Reinforcement Learning (PPO/GRPO) phases.
This is counter-intuitive because the RL objective is not designed to increase sequence likelihoods (unlike SFT).
Practical limitation: in this case, extraction relies on using the initial part of the training prompt, which is not generally public.
📈 Are the extracted data effective for post-training?
Both in SFT and RL, the extracted data can be used to fine-tune models to similar performance to the originals.
The authors suggest that model distillation, where a stronger model is used to drive the training of a weaker one, may be a form of indirect training on the original dataset.
RL environments help LLMs practice, reason, and improve. I explored the Environments Hub and wrote a walkthrough showing how to train and evaluate models using these open environments.
1️⃣ 𝗪𝗵𝘆 𝗥𝗟 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 𝗳𝗼𝗿 𝗟𝗟𝗠𝘀
DeepSeek-R1 made clear that Reinforcement Learning can be used to incentivize reasoning in LLMs. In GRPO, the model generates multiple answers and learns to prefer the better ones from rewards.
2️⃣ 𝗪𝗵𝗮𝘁 𝗲𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁𝘀 𝗮𝗿𝗲 In classic RL, the environment is the world where the Agent lives, interacts, and get rewards to learn.
We can also think of them as software packages, containing data, harness and scoring rules - for the model to learn and be evaluated.
Nowadays, the Agent is not just the LLM. It can use tools, from a weather API to a terminal.
This makes environments for training and evaluation more complex and critical.
3️⃣ 𝐓𝐡𝐞 𝐨𝐩𝐞𝐧 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞
Big labs are advancing, but open models and the community still face a fragmented ecosystem. We risk becoming users of systems built with tools we can't access or fully understand.
4️⃣ 𝐄𝐧𝐯𝐢𝐫𝐨𝐧𝐦𝐞𝐧𝐭𝐬 𝐇𝐮𝐛 That's why, I was excited when Prime Intellect released the Environments Hub.
It's a place where people share RL environments: tasks you can use to train LLMs with RL (GRPO-style) or evaluate Agents. Plus, the Verifiers library (@willcb) standardizes the creation of RL environments and evaluations. They can help to keep science and experimentation open. 🔬
I explored the Hub and wrote a hands-on walkthrough 📝 - RL + LLMs basics - Environments Hub navigation - Evaluating models/Agents - GRPO Training a tiny model on an alphabetical sort task
🎥 In the video, the Agent: - Goes to Hugging Face Spaces - Finds black-forest-labs/FLUX.1-schnell - Expands a short prompt ("my holiday on Lake Como") into a detailed image generation prompt - Waits for the image - Returns the image URL
## What else can it do? Great for information gathering and summarization
🗞️🗞️ Compare news websites and create a table of shared stories with links ▶️ Find content creator social profiles from YouTube videos 🛍️ Find a product's price range on Amazon 🚂 🚌 Gather public transportation travel options
## How is it built? 🏗️ Haystack → Agent execution logic 🧠 Google Gemini 2.5 Flash → Good and fast LLM with a generous free tier 🛠️ Playwright MCP server → Browser automation tools: navigate, click, type, wait...
Even without vision capabilities, this setup can get quite far.
## Next steps - Try a local open model - Move from notebook to real deployment - Incorporate vision
And you? Have you built something similar? What's in your stack?
The latest release of the Haystack OSS LLM framework adds a long-requested feature: image support!
📓 Notebooks below
This isn't just about passing images to an LLM. We built several features to enable practical multimodal use cases.
What's new? 🧠 Support for multiple LLM providers: OpenAI, Amazon Bedrock, Google Gemini, Mistral, NVIDIA, OpenRouter, Ollama and more (support for Hugging Face API coming 🔜) 🎛️ Prompt template language to handle structured inputs, including images 📄 PDF and image converters 🔍 Image embedders using CLIP-like models 🧾 LLM-based extractor to pull text from images 🧩 Components to build multimodal RAG pipelines and Agents
I had the chance of leading this effort with @sjrhuschlee (great collab).
How do you ensure your AI application is safe from harmful or inappropriate user inputs?
This is a core requirement for real-world AI deployments. Luckily, several open Language Models are built specifically for safety moderation.
I've been exploring them and put together a hands-on tutorial using the Haystack framework to build your own AI guardrails.
In the notebook, you'll learn how to use and customize: 🔹 Meta Llama Guard (via Hugging Face API) 🔹 IBM Granite Guardian (via Ollama), which can also evaluate RAG specific risk dimensions 🔹 Google ShieldGemma (via Ollama) 🔹 Nvidia NemoGuard models family, including a model for topic control
You'll also see how to integrate content moderation into a 🔎 RAG pipeline.