Lekr0 commited on 23 days ago

Commit

e9585fc

verified ·

1 Parent(s): 741f7c3

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

ICL/RL/trl_source/.github/PULL_REQUEST_TEMPLATE.md +31 -0
ICL/RL/trl_source/assets/logo-dark.png +0 -0
ICL/RL/trl_source/examples/README.md +3 -0
ICL/RL/trl_source/examples/accelerate_configs/alst_ulysses_4gpu.yaml +45 -0
ICL/RL/trl_source/examples/accelerate_configs/context_parallel_2gpu.yaml +30 -0
ICL/RL/trl_source/examples/accelerate_configs/deepspeed_zero1.yaml +20 -0
ICL/RL/trl_source/examples/accelerate_configs/deepspeed_zero2.yaml +21 -0
ICL/RL/trl_source/examples/accelerate_configs/deepspeed_zero3.yaml +22 -0
ICL/RL/trl_source/examples/accelerate_configs/fsdp1.yaml +28 -0
ICL/RL/trl_source/examples/accelerate_configs/fsdp2.yaml +25 -0
ICL/RL/trl_source/examples/accelerate_configs/multi_gpu.yaml +16 -0
ICL/RL/trl_source/examples/accelerate_configs/single_gpu.yaml +16 -0
ICL/RL/trl_source/examples/cli_configs/example_config.yaml +18 -0
ICL/RL/trl_source/examples/datasets/deepmath_103k.py +98 -0
ICL/RL/trl_source/examples/datasets/hh-rlhf-helpful-base.py +132 -0
ICL/RL/trl_source/examples/datasets/llava_instruct_mix.py +118 -0
ICL/RL/trl_source/examples/datasets/lm-human-preferences-descriptiveness.py +119 -0
ICL/RL/trl_source/examples/datasets/lm-human-preferences-sentiment.py +112 -0
ICL/RL/trl_source/examples/datasets/math_shepherd.py +169 -0
ICL/RL/trl_source/examples/datasets/prm800k.py +156 -0
ICL/RL/trl_source/examples/datasets/rlaif-v.py +112 -0
ICL/RL/trl_source/examples/datasets/tldr.py +104 -0
ICL/RL/trl_source/examples/datasets/tldr_preference.py +110 -0
ICL/RL/trl_source/examples/datasets/ultrafeedback-prompt.py +102 -0
ICL/RL/trl_source/examples/datasets/ultrafeedback.py +144 -0
ICL/RL/trl_source/examples/notebooks/README.md +17 -0
ICL/RL/trl_source/examples/notebooks/grpo_agent.ipynb +706 -0
ICL/RL/trl_source/examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb +1914 -0
ICL/RL/trl_source/examples/notebooks/grpo_ministral3_vl.ipynb +740 -0
ICL/RL/trl_source/examples/notebooks/grpo_qwen3_vl.ipynb +693 -0
ICL/RL/trl_source/examples/notebooks/grpo_rnj_1_instruct.ipynb +622 -0
ICL/RL/trl_source/examples/notebooks/grpo_trl_lora_qlora.ipynb +1638 -0
ICL/RL/trl_source/examples/notebooks/openenv_sudoku_grpo.ipynb +0 -0
ICL/RL/trl_source/examples/notebooks/openenv_wordle_grpo.ipynb +0 -0
ICL/RL/trl_source/examples/notebooks/sft_ministral3_vl.ipynb +0 -0
ICL/RL/trl_source/examples/notebooks/sft_qwen_vl.ipynb +0 -0
ICL/RL/trl_source/examples/notebooks/sft_trl_lora_qlora.ipynb +1140 -0
ICL/RL/trl_source/examples/scripts/bco.py +173 -0
ICL/RL/trl_source/examples/scripts/cpo.py +112 -0
ICL/RL/trl_source/examples/scripts/dpo.py +17 -0
ICL/RL/trl_source/examples/scripts/dpo_vlm.py +151 -0
ICL/RL/trl_source/examples/scripts/gkd.py +149 -0
ICL/RL/trl_source/examples/scripts/grpo_agent.py +326 -0
ICL/RL/trl_source/examples/scripts/grpo_vlm.py +164 -0
ICL/RL/trl_source/examples/scripts/gspo.py +137 -0
ICL/RL/trl_source/examples/scripts/gspo_vlm.py +153 -0
ICL/RL/trl_source/examples/scripts/kto.py +112 -0
ICL/RL/trl_source/examples/scripts/mpo_vlm.py +142 -0
ICL/RL/trl_source/examples/scripts/nash_md.py +153 -0
ICL/RL/trl_source/examples/scripts/nemo_gym/README.md +5 -0

ICL/RL/trl_source/.github/PULL_REQUEST_TEMPLATE.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# What does this PR do?
+<!--
+Congratulations! You've made it this far! You're not quite done yet though.
+Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.
+Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
+Once you're done, someone will review your PR shortly. They may suggest changes to make the code even better.
+-->
+<!-- Remove if not applicable -->
+Fixes # (issue)
+## Before submitting
+- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
+- [ ] Did you read the [contributor guideline](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md#create-a-pull-request),
+      Pull Request section?
+- [ ] Was this discussed/approved via a GitHub issue? Please add a link
+      to it if that's the case.
+- [ ] Did you make sure to update the documentation with your changes?
+- [ ] Did you write any new necessary tests?
+## Who can review?
+Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
+members/contributors who may be interested in your PR.

ICL/RL/trl_source/assets/logo-dark.png ADDED Viewed

ICL/RL/trl_source/examples/README.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # Examples
2	+
3	+ Please check out https://huggingface.co/docs/trl/example_overview for documentation on our examples.

ICL/RL/trl_source/examples/accelerate_configs/alst_ulysses_4gpu.yaml ADDED Viewed

	@@ -0,0 +1,45 @@

+# ALST/Ulysses Sequence Parallelism with 2D Parallelism (DP + SP) for 4 GPUs
+#
+# This configuration enables 2D parallelism:
+# - Sequence Parallelism (sp_size=2): Sequences split across 2 GPUs using ALST/Ulysses
+# - Data Parallelism (dp_shard_size=2): Model/optimizer sharded across 2 GPUs
+# - Total: 4 GPUs (2 × 2)
+#
+# Set parallelism_config in your training script:
+#   parallelism_config = ParallelismConfig(
+#       sp_backend="deepspeed",
+#       sp_size=2,
+#       dp_shard_size=2,  # Calculated as: num_gpus // sp_size
+#       sp_handler=DeepSpeedSequenceParallelConfig(...)
+#   )
+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  zero_stage: 3
+  seq_parallel_communication_data_type: bf16
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 4  # Total number of GPUs
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+parallelism_config:
+  parallelism_config_dp_replicate_size: 1
+  parallelism_config_dp_shard_size: 2  # Enables 2D parallelism with SP
+  parallelism_config_tp_size: 1
+  parallelism_config_sp_size: 2  # Sequence parallel size
+  parallelism_config_sp_backend: deepspeed
+  parallelism_config_sp_seq_length_is_variable: true
+  parallelism_config_sp_attn_implementation: flash_attention_2

ICL/RL/trl_source/examples/accelerate_configs/context_parallel_2gpu.yaml ADDED Viewed

	@@ -0,0 +1,30 @@

+# Context Parallelism with FSDP for 2 GPUs
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+  fsdp_activation_checkpointing: true  # Enable activation checkpointing for memory efficiency
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_offload_params: false
+  fsdp_reshard_after_forward: true
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_version: 2
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 2  # Number of GPUs
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+parallelism_config:
+  parallelism_config_dp_replicate_size: 1
+  parallelism_config_dp_shard_size: 1
+  parallelism_config_tp_size: 1
+  parallelism_config_cp_size: 2  # Context parallel size

ICL/RL/trl_source/examples/accelerate_configs/deepspeed_zero1.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  gradient_accumulation_steps: 1
+  zero3_init_flag: false
+  zero_stage: 1
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

ICL/RL/trl_source/examples/accelerate_configs/deepspeed_zero2.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: false
+  zero_stage: 2
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

ICL/RL/trl_source/examples/accelerate_configs/deepspeed_zero3.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_multinode_launcher: standard
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: true
+  zero3_save_16bit_model: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

ICL/RL/trl_source/examples/accelerate_configs/fsdp1.yaml ADDED Viewed

	@@ -0,0 +1,28 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+  fsdp_activation_checkpointing: false
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_forward_prefetch: true
+  fsdp_offload_params: false
+  fsdp_reshard_after_forward: FULL_SHARD
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: true
+  fsdp_version: 1
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

ICL/RL/trl_source/examples/accelerate_configs/fsdp2.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+# Requires accelerate 1.7.0 or higher
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+  fsdp_activation_checkpointing: false
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_offload_params: false
+  fsdp_reshard_after_forward: true
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_version: 2
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

ICL/RL/trl_source/examples/accelerate_configs/multi_gpu.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

ICL/RL/trl_source/examples/accelerate_configs/single_gpu.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: "NO"
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 1
+num_processes: 1
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

ICL/RL/trl_source/examples/cli_configs/example_config.yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+# This is an example configuration file of TRL CLI, you can use it for
+# SFT like that: `trl sft --config config.yaml --output_dir test-sft`
+# The YAML file supports environment variables by adding an `env` field
+# as below
+# env:
+#   CUDA_VISIBLE_DEVICES: 0
+model_name_or_path:
+  Qwen/Qwen2.5-0.5B
+dataset_name:
+  stanfordnlp/imdb
+report_to:
+  none
+learning_rate:
+  0.0001
+lr_scheduler_type:
+  cosine

ICL/RL/trl_source/examples/datasets/deepmath_103k.py ADDED Viewed

	@@ -0,0 +1,98 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/DeepMath-103K"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/DeepMath-103K",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def process_example(example):
+    solution = example["final_answer"]
+    if solution not in ["True", "False", "Yes", "No"]:
+        solution = f"${solution}$"
+    prompt = [{"role": "user", "content": example["question"]}]
+    return {"prompt": prompt, "solution": solution}
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# DeepMath-103K Dataset
+## Summary
+[DeepMath-103K](https://huggingface.co/datasets/zwhe99/DeepMath-103K) is meticulously curated to push the boundaries of mathematical reasoning in language models.
+## Data Structure
+- **Format**: [Conversational](https://huggingface.co/docs/trl/main/dataset_formats#conversational)
+- **Type**: [Prompt-only](https://huggingface.co/docs/trl/main/dataset_formats#prompt-only)
+Column:
+- `"prompt"`: The input question.
+- `"solution"`: The solution to the math problem.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/deepmath_103k.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset("zwhe99/DeepMath-103K", split="train")
+    dataset = dataset.map(
+        process_example,
+        remove_columns=dataset.column_names,
+        num_proc=script_args.dataset_num_proc,
+    )
+    dataset = dataset.train_test_split(test_size=0.05, seed=42)
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/hh-rlhf-helpful-base.py ADDED Viewed

	@@ -0,0 +1,132 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/hh-rlhf-helpful-base"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/hh-rlhf-helpful-base", metadata={"help": "Hugging Face repository ID to push the dataset to."}
+    )
+    dataset_num_proc: int | None = field(
+        default=None, metadata={"help": "Number of workers to use for dataset processing."}
+    )
+def common_start(str1: str, str2: str) -> str:
+    # Zip the two strings and iterate over them together
+    common_chars = []
+    for c1, c2 in zip(str1, str2, strict=True):
+        if c1 == c2:
+            common_chars.append(c1)
+        else:
+            break
+    # Join the common characters and return as a string
+    return "".join(common_chars)
+def extract_dialogue(example: str) -> list[dict[str, str]]:
+    # Extract the prompt, which corresponds to the common start of the chosen and rejected dialogues
+    prompt_text = common_start(example["chosen"], example["rejected"])
+    # The chosen and rejected may share a common start, so we need to remove the common part
+    if not prompt_text.endswith("\n\nAssistant: "):
+        prompt_text = prompt_text[: prompt_text.rfind("\n\nAssistant: ")] + "\n\nAssistant: "
+    # Extract the chosen and rejected lines
+    chosen_line = example["chosen"][len(prompt_text) :]
+    rejected_line = example["rejected"][len(prompt_text) :]
+    # Remove the generation prompt ("\n\nAssistant: ") from the prompt
+    prompt_text = prompt_text[: -len("\n\nAssistant: ")]
+    # Split the string at every occurrence of "Human: " or "Assistant: "
+    prompt_lines = re.split(r"(\n\nAssistant: |\n\nHuman: )", prompt_text)
+    # Remove the first element as it's empty
+    prompt_lines = prompt_lines[1:]
+    prompt = []
+    for idx in range(0, len(prompt_lines), 2):
+        role = "user" if prompt_lines[idx] == "\n\nHuman: " else "assistant"
+        content = prompt_lines[idx + 1]
+        prompt.append({"role": role, "content": content})
+    # Remove the prompt from the chosen and rejected dialogues
+    chosen = [{"role": "assistant", "content": chosen_line}]
+    rejected = [{"role": "assistant", "content": rejected_line}]
+    return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# HH-RLHF-Helpful-Base Dataset
+## Summary
+The HH-RLHF-Helpful-Base dataset is a processed version of [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, specifically curated to train models using the [TRL library](https://github.com/huggingface/trl) for preference learning and alignment tasks. It contains pairs of text samples, each labeled as either "chosen" or "rejected," based on human preferences regarding the helpfulness of the responses. This dataset enables models to learn human preferences in generating helpful responses, enhancing their ability to assist users effectively.
+## Data Structure
+- **Format**: [Conversational](https://huggingface.co/docs/trl/main/dataset_formats#conversational)
+- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
+Columns:
+- `"prompt"`: The user query.
+- `"chosen"`: A response deemed helpful by human evaluators.
+- `"rejected"`: A response considered less helpful or unhelpful.
+This structure allows models to learn to prefer the _chosen_ response over the _rejected_ one, thereby aligning with human preferences in helpfulness.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/hh-rlhf-helpful-base.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset("Anthropic/hh-rlhf", data_dir="helpful-base")
+    dataset = dataset.map(extract_dialogue, num_proc=script_args.dataset_num_proc)
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/llava_instruct_mix.py ADDED Viewed

	@@ -0,0 +1,118 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import ast
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/llava-instruct-mix"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/llava-instruct-mix",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def process_example(example):
+    messages = []
+    for message in ast.literal_eval(example["conversations"]):
+        content = message["value"]
+        content = content.replace("<image>", "").strip()
+        role = "user" if message["from"] == "human" else "assistant"
+        messages.append({"role": role, "content": content})
+    return {"messages": messages, "images": [example["image"]]}
+def filter_long_examples(example):
+    total_length = sum(len(msg["content"]) for msg in example["messages"])
+    return total_length <= 1000
+def split_prompt_completion(example):
+    """
+    Splits the messages into a prompt and a completion. The last message is considered the completion.
+    """
+    assert len(example["messages"]) > 1
+    example["prompt"] = example["messages"][:-1]
+    example["completion"] = example["messages"][-1:]
+    return example
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# LLaVA Instruct Mix
+## Summary
+The LLaVA Instruct Mix dataset is a processed version of [LLaVA Instruct Mix](https://huggingface.co/datasets/theblackcat102/llava-instruct-mix).
+## Data Structure
+- **Format**: [Conversational](https://huggingface.co/docs/trl/main/dataset_formats#conversational)
+- **Type**: [Language-modeling](https://huggingface.co/docs/trl/main/dataset_formats#language-modeling)
+Columns:
+- `"images"`: The image associated with the text.
+- `"prompt"`: A list of messages that form the context for the conversation.
+- `"completion"`: The last message in the conversation, which is the model's response.
+This structure allows models to learn from the context of the conversation, enhancing their understanding of how to generate descriptive text based on visual inputs.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/llava_instruct_mix.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset("theblackcat102/llava-instruct-mix", split="train", num_proc=script_args.dataset_num_proc)
+    dataset = dataset.map(
+        process_example, remove_columns=["conversations", "image"], num_proc=script_args.dataset_num_proc
+    )
+    dataset = dataset.filter(filter_long_examples, num_proc=script_args.dataset_num_proc)
+    dataset = dataset.map(split_prompt_completion, remove_columns=["messages"], num_proc=script_args.dataset_num_proc)
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id, num_proc=script_args.dataset_num_proc)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/lm-human-preferences-descriptiveness.py ADDED Viewed

	@@ -0,0 +1,119 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import AutoTokenizer, HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/lm-human-preferences-descriptiveness"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/lm-human-preferences-descriptiveness",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+# Edge cases handling: remove the cases where all samples are the same
+def samples_not_all_same(example):
+    return not all(example["sample0"] == example[f"sample{j}"] for j in range(1, 4))
+def to_prompt_completion(example, tokenizer):
+    prompt = tokenizer.decode(example["query"]).strip()
+    best_idx = example["best"]
+    chosen = tokenizer.decode(example[f"sample{best_idx}"])
+    for rejected_idx in range(4):  # take the first rejected sample that is different from the chosen one
+        rejected = tokenizer.decode(example[f"sample{rejected_idx}"])
+        if chosen != rejected:
+            break
+    assert chosen != rejected
+    return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# LM-Human-Preferences-Descriptiveness Dataset
+## Summary
+The LM-Human-Preferences-Descriptiveness dataset is a processed subset of [OpenAI's LM-Human-Preferences](https://github.com/openai/lm-human-preferences), focusing specifically on enhancing the descriptiveness of generated text. It contains pairs of text samples, each labeled as either "chosen" or "rejected," based on human preferences regarding the level of detail and vividness in the descriptions. This dataset enables models to learn human preferences in descriptive language, improving their ability to generate rich and engaging narratives.
+## Data Structure
+- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
+- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
+Columns:
+- `"prompt"`: The text sample.
+- `"chosen"`: A version of the text with enhanced descriptiveness.
+- `"rejected"`: A version of the text with less descriptiveness.
+This structure allows models to learn to prefer the _chosen_ response over the _rejected_ one, thereby aligning with human preferences in descriptive language.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/lm-human-preferences-descriptiveness.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset(
+        "json",
+        data_files="https://openaipublic.blob.core.windows.net/lm-human-preferences/labels/descriptiveness/offline_5k.json",
+        split="train",
+    )
+    dataset = dataset.filter(samples_not_all_same, num_proc=script_args.dataset_num_proc)
+    dataset = dataset.map(
+        to_prompt_completion,
+        num_proc=script_args.dataset_num_proc,
+        remove_columns=["query", "sample0", "sample1", "sample2", "sample3", "best"],
+        fn_kwargs={"tokenizer": AutoTokenizer.from_pretrained("gpt2")},
+    )
+    # train_size taken from https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/launch.py#L79)
+    dataset = dataset.train_test_split(train_size=4992)
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/lm-human-preferences-sentiment.py ADDED Viewed

	@@ -0,0 +1,112 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import AutoTokenizer, HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/lm-human-preferences-sentiment"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/lm-human-preferences-sentiment",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def to_prompt_completion(example, tokenizer):
+    prompt = tokenizer.decode(example["query"]).strip()
+    best_idx = example["best"]
+    chosen = tokenizer.decode(example[f"sample{best_idx}"])
+    for rejected_idx in range(4):  # take the first rejected sample that is different from the chosen one
+        rejected = tokenizer.decode(example[f"sample{rejected_idx}"])
+        if chosen != rejected:
+            break
+    assert chosen != rejected
+    return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# LM-Human-Preferences-Sentiment Dataset
+## Summary
+The LM-Human-Preferences-Sentiment dataset is a processed subset of [OpenAI's LM-Human-Preferences](https://github.com/openai/lm-human-preferences), focusing specifically on sentiment analysis tasks. It contains pairs of text samples, each labeled as either "chosen" or "rejected," based on human preferences regarding the sentiment conveyed in the text. This dataset enables models to learn human preferences in sentiment expression, enhancing their ability to generate and evaluate text with desired emotional tones.
+## Data Structure
+- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
+- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
+Columns:
+- `"prompt"`: The text sample.
+- `"chosen"`: A version of the text that conveys the desired sentiment.
+- `"rejected"`: A version of the text that does not convey the desired sentiment.
+This structure allows models to learn to prefer the _chosen_ response over the _rejected_ one, thereby aligning with human preferences in sentiment expression.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/lm-human-preferences-sentiment.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset(
+        "json",
+        data_files="https://openaipublic.blob.core.windows.net/lm-human-preferences/labels/sentiment/offline_5k.json",
+        split="train",
+    )
+    dataset = dataset.map(
+        to_prompt_completion,
+        num_proc=script_args.dataset_num_proc,
+        remove_columns=["query", "sample0", "sample1", "sample2", "sample3", "best"],
+        fn_kwargs={"tokenizer": AutoTokenizer.from_pretrained("gpt2")},
+    )
+    # train_size taken from https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/launch.py#L70)
+    dataset = dataset.train_test_split(train_size=4992)
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/math_shepherd.py ADDED Viewed

	@@ -0,0 +1,169 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+from dataclasses import dataclass, field
+from itertools import chain
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/math_shepherd"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/math_shepherd",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def process_example(example):
+    # Replace "ки" with "ⶻ" so that the size of the "input" matches the size of the "label"
+    inputs = example["input"].replace("ки", "ⶻ")
+    # Find the indices of the "ⶻ" characters (that should match with the indexes of the "+" or "-" in the label)
+    indexes = [m.start() for m in re.finditer("ⶻ", inputs)]
+    # Sanity that all indexes are either "+" or "-"
+    assert all(example["label"][idx] in ["+", "-"] for idx in indexes)
+    # Get the labels
+    labels = [example["label"][idx] == "+" for idx in indexes]
+    # Split the inputs into steps (caution, the first step is missing here, it is the prompt)
+    steps = [inputs[i:j] for i, j in zip(chain([0], indexes), chain(indexes, [None]), strict=True)]
+    # Remove the last step (single ⶻ)
+    steps = steps[:-1]
+    # Get the prompt (first part) and completions (rest)
+    prompt = steps[0]
+    completions = steps[1:]
+    # Remove the heading "ⶻ" and the final whitespace from the completions
+    assert all(completion.startswith("ⶻ") for completion in completions)
+    completions = [completion[1:].strip() for completion in completions]
+    # At this point, we need to retrieve the first step from the prompt.
+    # First, we handle particular cases (annotation error) where we have a first label before the end of the prompt.
+    if prompt.startswith(
+        (
+            "Mr. Rocky",
+            "Parker",
+            "What is the smallest positive",
+            " The Myth",
+            "Let $\\mathbf{a}$",
+            "Find the arithmetic",
+            "Determine an ordered pair",
+            "Determine the ordered pair",
+            "At the Quill and Scroll stationery",
+            "Round to the nearest",
+            r"Calculate $\sqrt{10p}",
+            r"Simplify $\sqrt{28x}",
+        )
+    ):
+        # Some spotted datasets errors where there is an annotation in the prompt: we remove it
+        labels = labels[1:]
+    # Then we handle the general case: we get the first step from the prompt by looking for "Step 1:" or "step 1:" or
+    # (less common) "?".
+    elif "Step 1:" in prompt:
+        prompt, first_step = prompt.split("Step 1:")
+        first_step = "Step 1:" + first_step
+        completions = [first_step.strip()] + completions
+    elif "step 1:" in prompt:
+        prompt, first_step = prompt.split("step 1:")
+        first_step = "step 1:" + first_step
+        completions = [first_step.strip()] + completions
+    elif "?" in prompt:
+        prompt, first_step = prompt.split("?")
+        prompt = prompt + "?"
+        completions = [first_step.strip()] + completions
+    else:
+        raise ValueError(f"Prompt can't be processed: {prompt}")
+    # Strip the prompt
+    prompt = prompt.strip()
+    # Sanity check that the length of the completions is the same as the length of the labels
+    assert len(completions) == len(labels)
+    return {"prompt": prompt, "completions": completions, "labels": labels}
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# Math-Shepherd Dataset
+## Summary
+The Math-Shepherd dataset is a processed version of [Math-Shepherd dataset](peiyi9979/Math-Shepherd), designed to train models using the [TRL library](https://github.com/huggingface/trl) for stepwise supervision tasks. It provides step-by-step solutions to mathematical problems, enabling models to learn and verify each step of a solution, thereby enhancing their reasoning capabilities.
+## Data Structure
+- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
+- **Type**: [Stepwise supervision](https://huggingface.co/docs/trl/main/dataset_formats#stepwise-supervision)
+Columns:
+- `"prompt"`: The problem statement.
+- `"completions"`: A list of reasoning steps generated to solve the problem.
+- `"labels"`: A list of booleans or floats indicating the correctness of each corresponding reasoning step.
+This structure allows models to learn the correctness of each step in a solution, facilitating improved reasoning and problem-solving abilities.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/math_shepherd.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset("peiyi9979/Math-Shepherd", split="train")
+    dataset = dataset.map(
+        process_example,
+        remove_columns=["input", "label", "task"],
+        num_proc=script_args.dataset_num_proc,
+    )
+    dataset = dataset.train_test_split(test_size=0.05, seed=42)
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/prm800k.py ADDED Viewed

	@@ -0,0 +1,156 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/prm800k"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/prm800k",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def process_example(example):
+    outputs = []
+    prompt = example["question"]["problem"]
+    # Iterate through each step
+    previous_completions = []
+    previous_labels = []
+    for step in example["label"]["steps"]:
+        if step["completions"] is None and step["human_completion"] is None and step["chosen_completion"] is None:
+            # happens sometimes
+            break
+        # Loop through completions
+        for completion_idx, completion in enumerate(step["completions"]):
+            # For every completion that are not chosen, we are in a terminal state, so we can add it to the list of outputs.
+            if completion_idx != step["chosen_completion"]:
+                content = completion["text"]
+                completions = previous_completions[:] + [content]
+                label = completion["rating"] == 1
+                labels = previous_labels[:] + [label]
+                outputs.append({"prompt": prompt, "completions": completions, "labels": labels})
+        # Now, expand the previous completions and labels
+        if step["chosen_completion"] is not None:
+            chosen_completion = step["completions"][step["chosen_completion"]]
+            label = chosen_completion["rating"] == 1
+        elif step["human_completion"] is not None:
+            chosen_completion = step["human_completion"]
+            label = True
+        else:
+            break
+        content = chosen_completion["text"]
+        previous_completions.append(content)
+        previous_labels.append(label)
+    # Last step: we are in a terminal state, so we can add it to the list of outputs
+    outputs.append({"prompt": prompt, "completions": previous_completions, "labels": previous_labels})
+    return outputs
+def process_batch(examples):
+    outputs = []
+    batch_size = len(examples["label"])
+    for idx in range(batch_size):
+        example = {k: v[idx] for k, v in examples.items()}
+        outputs.extend(process_example(example))
+    # list of dict to dict of list
+    outputs = {k: [v[k] for v in outputs] for k in outputs[0]}
+    return outputs
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# PRM800K Dataset
+## Summary
+The PRM800K dataset is a processed version of [OpenAI's PRM800K](https://github.com/openai/prm800k), designed to train models using the [TRL library](https://github.com/huggingface/trl) for stepwise supervision tasks. It contains 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset. This dataset enables models to learn and verify each step of a solution, enhancing their reasoning capabilities.
+## Data Structure
+- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
+- **Type**: [Stepwise supervision](https://huggingface.co/docs/trl/main/dataset_formats#stepwise-supervision)
+Columns:
+- `"prompt"`: The problem statement.
+- `"completions"`: A list of reasoning steps generated to solve the problem.
+- `"labels"`: A list of booleans or floats indicating the correctness of each corresponding reasoning step.
+This structure allows models to learn the correctness of each step in a solution, facilitating improved reasoning and problem-solving abilities.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/prm800k.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    data_files = {
+        "train": "https://github.com/openai/prm800k/raw/refs/heads/main/prm800k/data/phase1_train.jsonl",
+        "test": "https://github.com/openai/prm800k/raw/refs/heads/main/prm800k/data/phase1_test.jsonl",
+    }
+    dataset = load_dataset("json", data_files=data_files)
+    dataset = dataset.map(
+        process_batch,
+        batched=True,
+        batch_size=10,
+        remove_columns=[
+            "labeler",
+            "timestamp",
+            "generation",
+            "is_quality_control_question",
+            "is_initial_screening_question",
+            "question",
+            "label",
+        ],
+        num_proc=script_args.dataset_num_proc,
+    )
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/rlaif-v.py ADDED Viewed

	@@ -0,0 +1,112 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+from datasets import features, load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/rlaif-v"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/rlaif-v",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def to_conversational(example):
+    """
+    Convert prompt from "xxx" to [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "xxx"}]}]
+    and chosen and rejected from "xxx" to [{"role": "assistant", "content": [{"type": "text", "text": "xxx"}]}].
+    Images are wrapped into a list.
+    """
+    prompt = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": example["question"]}]}]
+    chosen = [{"role": "assistant", "content": [{"type": "text", "text": example["chosen"]}]}]
+    rejected = [{"role": "assistant", "content": [{"type": "text", "text": example["rejected"]}]}]
+    return {"prompt": prompt, "images": [example["image"]], "chosen": chosen, "rejected": rejected}
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# RLAIF-V Dataset
+## Summary
+The RLAIF-V dataset is a processed version of the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset#dataset-card-for-rlaif-v-dataset), specifically curated to train vision-language models using the [TRL library](https://github.com/huggingface/trl) for preference learning tasks. It contains 83,132 high-quality comparison pairs, each comprising an image and two textual descriptions: one preferred and one rejected. This dataset enables models to learn human preferences in visual contexts, enhancing their ability to generate and evaluate image captions.
+## Data Structure
+- **Format**: [Conversational](https://huggingface.co/docs/trl/main/dataset_formats#conversational)
+- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
+Columns:
+- `"prompt"`: The task related to the image.
+- `"images"`: The image.
+- `"chosen"`: The preferred answer.
+- `"rejected"`: An alternative answer that was not preferred.
+This structure allows models to learn to prefer the _chosen_ response over the _rejected_ one, thereby aligning with human preferences in visual tasks.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/rlaif-v.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train")
+    dataset = dataset.map(
+        to_conversational,
+        num_proc=script_args.dataset_num_proc,
+        remove_columns=dataset.column_names,
+        writer_batch_size=128,
+    )
+    # Cast the images to Sequence[Image] to avoid bytes format
+    f = dataset.features
+    f["images"] = features.Sequence(features.Image(decode=True))
+    dataset = dataset.cast(f)
+    dataset = dataset.train_test_split(test_size=0.01, writer_batch_size=128)
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/tldr.py ADDED Viewed

	@@ -0,0 +1,104 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/tldr"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/tldr",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def to_prompt_completion(example):
+    tldr_format_str = "SUBREDDIT: r/{subreddit}\n\nTITLE: {title}\n\nPOST: {post}\n\nTL;DR:"
+    prompt = tldr_format_str.format(subreddit=example["subreddit"], title=example["title"], post=example["post"])
+    completion = " " + example["summary"]  # Add a space to separate the prompt from the completion
+    return {"prompt": prompt, "completion": completion}
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# TL;DR Dataset
+## Summary
+The TL;DR dataset is a processed version of Reddit posts, specifically curated to train models using the [TRL library](https://github.com/huggingface/trl) for summarization tasks. It leverages the common practice on Reddit where users append "TL;DR" (Too Long; Didn't Read) summaries to lengthy posts, providing a rich source of paired text data for training summarization models.
+## Data Structure
+- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
+- **Type**: [Prompt-completion](https://huggingface.co/docs/trl/main/dataset_formats#prompt-completion)
+Columns:
+- `"prompt"`: The unabridged Reddit post.
+- `"completion"`: The concise "TL;DR" summary appended by the author.
+This structure enables models to learn the relationship between detailed content and its abbreviated form, enhancing their summarization capabilities.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/tldr.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    # Filtered reddit TL;DR dataset from https://github.com/openai/summarize-from-feedback?tab=readme-ov-file#reddit-tldr-dataset
+    data_files = {
+        "train": "https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/train.jsonl",
+        "validation": "https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/valid.jsonl",
+        "test": "https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/test.jsonl",
+    }
+    dataset = load_dataset("json", data_files=data_files)
+    dataset = dataset.map(
+        to_prompt_completion,
+        num_proc=script_args.dataset_num_proc,
+        remove_columns=["id", "subreddit", "title", "post", "summary"],
+    )
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/tldr_preference.py ADDED Viewed

	@@ -0,0 +1,110 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/tldr-preference"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/tldr-preference",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def to_preference(example):
+    info = example["info"]
+    if example["batch"] in ["batch0_cnndm", "cnndm0", "cnndm2"]:  # CNN Daily Mail batches
+        article = info["article"].replace("\n\n", "\n")
+        prompt = f"TITLE: {info['title']}\n\n{article}\n\nTL;DR:"
+    elif example["batch"] in [f"batch{i}" for i in range(3, 23)] + ["edit_b2_eval_test"]:  # Reddit batches
+        post = info["post"].replace("\n\n", "\n")
+        prompt = f"SUBREDDIT: r/{info['subreddit']}\n\nTITLE: {info['title']}\n\nPOST: {post}\n\nTL;DR:"
+    else:
+        raise ValueError(f"Unknown batch: {example['batch']}")
+    chosen_idx = example["choice"]
+    rejected_idx = 1 - chosen_idx
+    chosen = example["summaries"][chosen_idx]["text"]
+    rejected = example["summaries"][rejected_idx]["text"]
+    return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# TL;DR Dataset for Preference Learning
+## Summary
+The TL;DR dataset is a processed version of Reddit posts, specifically curated to train models using the [TRL library](https://github.com/huggingface/trl) for preference learning and Reinforcement Learning from Human Feedback (RLHF) tasks. It leverages the common practice on Reddit where users append "TL;DR" (Too Long; Didn't Read) summaries to lengthy posts, providing a rich source of paired text data for training models to understand and generate concise summaries.
+## Data Structure
+- **Format**: [Standard](https://huggingface.co/docs/trl/main/dataset_formats#standard)
+- **Type**: [Preference](https://huggingface.co/docs/trl/main/dataset_formats#preference)
+Columns:
+- `"prompt"`: The unabridged Reddit post.
+- `"chosen"`: The concise "TL;DR" summary appended by the author.
+- `"rejected"`: An alternative summary or response that was not selected.
+This structure enables models to learn the relationship between detailed content and its abbreviated form, enhancing their summarization capabilities.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/tldr_preference.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset("openai/summarize_from_feedback", "comparisons")
+    dataset = dataset.map(
+        to_preference,
+        num_proc=script_args.dataset_num_proc,
+        remove_columns=["info", "summaries", "choice", "worker", "batch", "split", "extra"],
+    )
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/ultrafeedback-prompt.py ADDED Viewed

	@@ -0,0 +1,102 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/ultrafeedback-prompt"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/ultrafeedback-prompt",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def to_unpaired_preference(example):
+    prompt = [{"role": "user", "content": example["instruction"]}]
+    return {"prompt": prompt}
+def drop_long_prompt(example):
+    if len(example["prompt"][0]["content"]) > 512:
+        return False
+    else:
+        return True
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# UltraFeedback - Prompts Dataset
+## Summary
+The UltraFeedback - Prompts dataset is a processed version of the [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset for model evaluation on specific aspects like helpfulness, honesty, and instruction-following.
+## Data Structure
+- **Format**: [Conversational](https://huggingface.co/docs/trl/main/dataset_formats#conversational)
+- **Type**: [Prompt-only](https://huggingface.co/docs/trl/main/dataset_formats#prompt-only)
+Column:
+- `"prompt"`: The input question or instruction provided to the model.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/ultrafeedback-prompt.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset("openbmb/UltraFeedback", split="train")
+    dataset = dataset.map(
+        to_unpaired_preference,
+        remove_columns=["source", "instruction", "models", "completions", "correct_answers", "incorrect_answers"],
+        num_proc=script_args.dataset_num_proc,
+    )
+    dataset = dataset.filter(drop_long_prompt)
+    dataset = dataset.train_test_split(test_size=0.05, seed=42)
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/datasets/ultrafeedback.py ADDED Viewed

	@@ -0,0 +1,144 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass, field
+from datasets import load_dataset
+from huggingface_hub import ModelCard
+from transformers import HfArgumentParser
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+    Args:
+        model_name (`str`, *optional*, defaults to `"gpt-3.5-turbo"`):
+            Language model to target. Possible values are:
+        aspect (`str`, *optional*, defaults to `"helpfulness"`):
+            Aspect to target.
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`int`, *optional*):
+            Number of workers to use for dataset processing.
+    """
+    model_name: str = field(
+        default="gpt-3.5-turbo",
+        metadata={
+            "help": "Language model to target.",
+            "choices": [
+                "alpaca-7b",
+                "bard",
+                "falcon-40b-instruct",
+                "gpt-3.5-turbo",
+                "gpt-4",
+                "llama-2-13b-chat",
+                "llama-2-70b-chat",
+                "llama-2-7b-chat",
+                "mpt-30b-chat",
+                "pythia-12b",
+                "starchat",
+                "ultralm-13b",
+                "ultralm-65b",
+                "vicuna-33b",
+                "wizardlm-13b",
+                "wizardlm-70b",
+                "wizardlm-7b",
+            ],
+        },
+    )
+    aspect: str = field(
+        default="helpfulness",
+        metadata={
+            "help": "Aspect to target. Possible values are: 'helpfulness' (default), 'honesty', "
+            "'instruction-following', 'truthfulness'.",
+            "choices": ["helpfulness", "honesty", "instruction-following", "truthfulness"],
+        },
+    )
+    push_to_hub: bool = field(
+        default=False,
+        metadata={"help": "Whether to push the dataset to the Hugging Face Hub."},
+    )
+    repo_id: str = field(
+        default="trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness",
+        metadata={"help": "Hugging Face repository ID to push the dataset to."},
+    )
+    dataset_num_proc: int | None = field(
+        default=None,
+        metadata={"help": "Number of workers to use for dataset processing."},
+    )
+def to_unpaired_preference(example, model_name, aspect):
+    prompt = [{"role": "user", "content": example["instruction"]}]
+    model_index = example["models"].index(model_name)
+    response_content = example["completions"][model_index]["response"]
+    completion = [{"role": "assistant", "content": response_content}]
+    score = int(example["completions"][model_index]["annotations"][aspect]["Rating"])
+    label = score >= 5
+    return {"prompt": prompt, "completion": completion, "label": label}
+model_card = ModelCard("""
+---
+tags: [trl]
+---
+# UltraFeedback GPT-3.5-Turbo Helpfulness Dataset
+## Summary
+The UltraFeedback GPT-3.5-Turbo Helpfulness dataset contains processed user-assistant interactions filtered for helpfulness, derived from the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset. It is designed for fine-tuning and evaluating models in alignment tasks.
+## Data Structure
+- **Format**: [Conversational](https://huggingface.co/docs/trl/main/dataset_formats#conversational)
+- **Type**: [Unpaired preference](https://huggingface.co/docs/trl/main/dataset_formats#unpaired-preference)
+Column:
+- `"prompt"`: The input question or instruction provided to the model.
+- `"completion"`: The model's response to the prompt.
+- `"label"`: A binary value indicating whether the response is sufficiently helpful.
+## Generation script
+The script used to generate this dataset can be found [here](https://github.com/huggingface/trl/blob/main/examples/datasets/ultrafeedback.py).
+""")
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+    dataset = load_dataset("openbmb/UltraFeedback", split="train")
+    dataset = dataset.filter(
+        lambda example: script_args.model_name in example["models"],
+        batched=False,
+        num_proc=script_args.dataset_num_proc,
+    )
+    dataset = dataset.map(
+        to_unpaired_preference,
+        remove_columns=["source", "instruction", "models", "completions", "correct_answers", "incorrect_answers"],
+        fn_kwargs={"model_name": script_args.model_name, "aspect": script_args.aspect},
+        num_proc=script_args.dataset_num_proc,
+    )
+    dataset = dataset.train_test_split(test_size=0.05, seed=42)
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
+        model_card.push_to_hub(script_args.repo_id, repo_type="dataset")

ICL/RL/trl_source/examples/notebooks/README.md ADDED Viewed

	@@ -0,0 +1,17 @@

+# Notebooks
+This directory contains a collection of Jupyter notebooks that demonstrate how to use the TRL library in different applications.
+| Notebook | Description | Open in Colab |
+| --- | --- | --- |
+| [`grpo_trl_lora_qlora.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/grpo_trl_lora_qlora.ipynb) | GRPO using QLoRA on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_trl_lora_qlora.ipynb) |
+| [`grpo_functiongemma_browsergym_openenv.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb) | GRPO on FunctionGemma in the BrowserGym environment | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb) |
+| [`grpo_agent.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/grpo_agent.ipynb) | GRPO for agent training | Not available due to OOM with Colab GPUs |
+| [`grpo_rnj_1_instruct.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/grpo_rnj_1_instruct.ipynb) | GRPO rnj-1-instruct with QLoRA using TRL on Colab to add reasoning capabilities | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_rnj_1_instruct.ipynb) |
+| [`sft_ministral3_vl.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/sft_ministral3_vl.ipynb) | Supervised Fine-Tuning (SFT) Ministral 3 with QLoRA using TRL on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_ministral3_vl.ipynb) |
+| [`grpo_ministral3_vl.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/grpo_ministral3_vl.ipynb) | GRPO Ministral 3 with QLoRA using TRL on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_ministral3_vl.ipynb) |
+| [`openenv_sudoku_grpo.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/openenv_sudoku_grpo.ipynb) | GRPO to play Sudoku on an OpenEnv environment | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/openenv_sudoku_grpo.ipynb) |
+| [`openenv_wordle_grpo.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/openenv_wordle_grpo.ipynb) | GRPO to play Worldle on an OpenEnv environment | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/openenv_wordle_grpo.ipynb) |
+| [`sft_trl_lora_qlora.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/sft_trl_lora_qlora.ipynb) | Supervised Fine-Tuning (SFT) using QLoRA on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_trl_lora_qlora.ipynb) |
+| [`sft_qwen_vl.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/sft_qwen_vl.ipynb) | Supervised Fine-Tuning (SFT) Qwen3-VL with QLoRA using TRL on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_qwen_vl.ipynb) |
+| [`grpo_qwen3_vl.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/grpo_qwen3_vl.ipynb) | GRPO Qwen3-VL with QLoRA using TRL on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_qwen3_vl.ipynb) |

ICL/RL/trl_source/examples/notebooks/grpo_agent.ipynb ADDED Viewed

	@@ -0,0 +1,706 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "63ceecbc-87ad-4ad3-a317-f49267ffc93b",
+   "metadata": {},
+   "source": [
+    "# Agent Training with GRPO using TRL\n",
+    "\n",
+    "![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)\n",
+    "\n",
+    "\n",
+    "With [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl), you can train a language model to act as an **agent**. One that learns to reason, interact with external tools, and improve through reinforcement.\n",
+    "\n",
+    "- [TRL GitHub Repository](https://github.com/huggingface/trl) — star us to support the project!  \n",
+    "- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)  \n",
+    "- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)\n",
+    "- [OpenEnv](https://github.com/meta-pytorch/OpenEnv)\n",
+    "\n",
+    "\n",
+    "TRL supports training agents that can use external tools as part of their decision process.  \n",
+    "In this notebook, the agent has access to the **BioGRID database**, which it can query using **read-only SQL commands** to retrieve biological interaction data. The model learns when and how to use tools based on rewards.\n",
+    "\n",
+    "We'll fine-tune a model using GRPO (Group Relative Policy Optimization) via TRL. The agent will:\n",
+    "\n",
+    "1. Generate tool call to query the database if needed.\n",
+    "2. Receive the tool response and add it it to the context.\n",
+    "3. Learn to improve its tool usage and general capabilities over time through reward signals.\n",
+    "\n",
+    "## Install dependencies\n",
+    "\n",
+    "We'll start by installing **TRL**, which automatically includes the main dependencies like **Transformers**.  \n",
+    "We'll also install **trackio** (for logging and monitoring training runs), **vLLM** (for efficient generation), and **jmespath** (needed for the tools capabilities)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b4812fbf-3f61-481e-9a64-95277eada9c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -Uq \"trl[vllm]\" git+https://github.com/huggingface/transformers.git trackio jmespath "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ede8e566-a1b5-460f-9fe8-a6010bc56148",
+   "metadata": {},
+   "source": [
+    "### Log in to Hugging Face\n",
+    "\n",
+    "Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "21756ac0-78b2-495d-8137-28dfa9faae6a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "KVGklspLYlmz",
+   "metadata": {},
+   "source": [
+    "## Create the database for the tool\n",
+    "\n",
+    "For this example, we will use the [BioGRID database](https://thebiogrid.org/), a curated resource containing **protein, genetic, and chemical interaction data**.  We've already compiled and uploaded it to the Hub at [qgallouedec/biogrid](https://huggingface.co/datasets/qgallouedec/biogrid). The dataset is loaded and converted into an sqlite database.\n",
+    "\n",
+    "> 💡 We remove spaces in the column names to easen the model work. In real-world deployments, you may keep your original column names and rely on the agent to reason about them. Here, we simplify the schema to make training smoother."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "rRzPMhfXBLkF",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sqlite3\n",
+    "from datasets import load_dataset\n",
+    "\n",
+    "# Load dataset\n",
+    "biogrid_dataset = load_dataset(\"qgallouedec/biogrid\", split=\"train\")\n",
+    "df = biogrid_dataset.to_pandas()\n",
+    "\n",
+    "# Normalize column names: remove spaces, replace with underscores\n",
+    "df.columns = [c.replace(\" \", \"_\") for c in df.columns]\n",
+    "\n",
+    "# Save to SQLite\n",
+    "conn = sqlite3.connect(\"biogrid.db\")\n",
+    "try:\n",
+    "    df.to_sql(\"interactions\", conn, if_exists=\"replace\", index=False)\n",
+    "    print(f\"biogrid.db created. Rows stored: {len(df)}\")\n",
+    "finally:\n",
+    "    conn.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "pSSGvLbmZyC2",
+   "metadata": {},
+   "source": [
+    "## Load the QA dataset\n",
+    "\n",
+    "The training objective is to fine-tune a model to answer gene-related questions. The model should learn to use the database query tool to retrieve factual information when needed.\n",
+    "\n",
+    "We'll define a formatting function for each sample, adding instructions about the database and how to call it. The model must answer with **yes** or **no**. Let's implement the `format_example` function.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "asrv7LbaD71C",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import textwrap\n",
+    "\n",
+    "def format_example(example):\n",
+    "    question = example[\"question\"]\n",
+    "    preamble = textwrap.dedent(\"\"\"\\\n",
+    "    You have access to the BioGRID SQLite database.\n",
+    "    Use SQL queries to retrieve only the information needed to answer the question.\n",
+    "\n",
+    "    Genes may appear in the database in columns `Alt_IDs_Interactor_A` `Alt_IDs_Interactor_B`, `Aliases_Interactor_A` and `Aliases_Interactor_B`,\n",
+    "    and each entry can contain multiple gene names or synonyms separated by '|', for example:\n",
+    "    'entrez gene/locuslink:JNKK(gene name synonym)|entrez gene/locuslink:MAPKK4(gene name synonym)|...'\n",
+    "    So a gene like 'JNKK' or 'MAPKK4' may appear inside one of these strings.\n",
+    "\n",
+    "    If the database schema is unclear or you are unsure about column names:\n",
+    "    - First inspect the schema with `PRAGMA table_info(interactions);`\n",
+    "    - Or preview a few rows with `SELECT * FROM interactions LIMIT 1;`\n",
+    "\n",
+    "    Otherwise, directly query the required data.\n",
+    "\n",
+    "    Final answer must be enclosed in stars, e.g. *Yes* or *No*.\n",
+    "    Facts:\n",
+    "    - The NCBI Taxonomy identifier for humans is taxid:9606.\n",
+    "    \"\"\")\n",
+    "    content = f\"{preamble}\\nQuestion: {question}\"\n",
+    "    prompt = [{\"role\": \"user\", \"content\": content}]\n",
+    "    return {\"prompt\": prompt}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "UMnHXYZla_EO",
+   "metadata": {},
+   "source": [
+    "Now, let's load the database and call the previous function.  \n",
+    "For simplicity, we will only use questions that start with **“Does the gene…”**.  \n",
+    "In a real use case, the full dataset can be used.\n",
+    "\n",
+    "The QA dataset is available on the [Hub](https://huggingface.co/datasets/qgallouedec/biogrid_qa)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "jEs12KqwDnVl",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = load_dataset(\"qgallouedec/biogrid_qa\", split=\"train\")\n",
+    "dataset = dataset.filter(\n",
+    "    lambda example: example[\"question\"].startswith(\"Does the gene \")\n",
+    ")  # keep only simple questions for example\n",
+    "dataset = dataset.map(format_example, remove_columns=[\"question\"])\n",
+    "\n",
+    "train_dataset = dataset\n",
+    "eval_dataset = None  # No eval by default, can be added if needed"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "m4GRjbHycM5L",
+   "metadata": {},
+   "source": [
+    "## Create tool for the agent\n",
+    "\n",
+    "The `query_biogrid` function is the tool the model will use to query the database and retrieve factual information.  \n",
+    "Each tool must be a standard Python function with **type-hinted arguments and return types**, and a **Google-style docstring** describing its purpose, parameters, and return value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "nLMH7hahGTyO",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from contextlib import contextmanager\n",
+    "import signal\n",
+    "\n",
+    "@contextmanager\n",
+    "def timeout(seconds):\n",
+    "    \"\"\"Context manager that raises TimeoutError if execution exceeds time limit.\"\"\"\n",
+    "\n",
+    "    def timeout_handler(signum, frame):\n",
+    "        raise TimeoutError(f\"Operation timed out after {seconds} seconds\")\n",
+    "\n",
+    "    signal.signal(signal.SIGALRM, timeout_handler)\n",
+    "    signal.alarm(seconds)\n",
+    "    try:\n",
+    "        yield\n",
+    "    finally:\n",
+    "        signal.alarm(0)\n",
+    "\n",
+    "def query_biogrid(sql_command: str) -> list[tuple]:\n",
+    "    \"\"\"\n",
+    "    Execute a read-only SQL command on the BioGRID database.\n",
+    "\n",
+    "    BioGRID is a curated biological database that compiles protein, genetic, and chemical interactions from multiple organisms. It provides researchers with experimentally verified interaction data to support studies in systems biology and functional genomics.\n",
+    "\n",
+    "    Args:\n",
+    "        sql_command: The SQL command to execute.\n",
+    "\n",
+    "    Returns:\n",
+    "        A list of tuples containing the query results.\n",
+    "    \"\"\"\n",
+    "    with timeout(5):\n",
+    "        conn = sqlite3.connect(\"file:biogrid.db?mode=ro\", uri=True)\n",
+    "        cursor = conn.cursor()\n",
+    "        try:\n",
+    "            cursor.execute(sql_command)\n",
+    "            results = cursor.fetchall()\n",
+    "        finally:\n",
+    "            conn.close()\n",
+    "    return results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "GiHtooTwci3B",
+   "metadata": {},
+   "source": [
+    "## Define reward functions\n",
+    "\n",
+    "To guide the agent during training, we define a few simple reward functions:\n",
+    "\n",
+    "- **`query_reward`**: evaluates the model’s query strategy — penalizes more than two queries, penalizes generic database scans, and rewards use of `WHERE` and evidence supporting the final answer.\n",
+    "- **`correctness_reward`**: rewards Yes/No predictions that match the expected answer.\n",
+    "- **`structure_reward`**: rewards a proper assistant structure (tool call → response → optional explanation).\n",
+    "\n",
+    "Each function returns a list of floats used by the **GRPOTrainer** during optimization.  \n",
+    "Combined, they encourage effective tool use and factual answers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "sXyqC6cJGe3L",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "\n",
+    "def query_reward(completions, answer, **kwargs):\n",
+    "    \"\"\"\n",
+    "    Reward query strategy:\n",
+    "    - Penalize more than 2 queries\n",
+    "    - Penalize generic queries (LIMIT 1 / PRAGMA)\n",
+    "    - Reward usage of WHERE\n",
+    "    - Reward evidence supporting the final answer\n",
+    "    \"\"\"\n",
+    "    rewards = []\n",
+    "\n",
+    "    for completion, ans in zip(completions, answer, strict=False):\n",
+    "        reward = 0.0\n",
+    "        sql_queries = []\n",
+    "        tool_results = []\n",
+    "\n",
+    "        # collect all SQL queries and tool results\n",
+    "        for turn in completion:\n",
+    "            if turn.get(\"tool_calls\"):\n",
+    "                for call in turn[\"tool_calls\"]:\n",
+    "                    sql = call[\"function\"][\"arguments\"].get(\"sql_command\", \"\").lower()\n",
+    "                    sql_queries.append(sql)\n",
+    "            if turn.get(\"role\") == \"tool\" and turn.get(\"content\"):\n",
+    "                tool_results.append(turn[\"content\"])\n",
+    "\n",
+    "        # --- penalize too many queries ---\n",
+    "        if len(sql_queries) > 3:\n",
+    "            reward -= 1.5\n",
+    "\n",
+    "        # --- check query quality ---\n",
+    "        where_count = 0\n",
+    "        for q in sql_queries:\n",
+    "            if \"limit 1\" in q:\n",
+    "                reward -= 1.0\n",
+    "            if \" where \" not in q:\n",
+    "                reward -= 0.5\n",
+    "            else:\n",
+    "                where_count += 1\n",
+    "        reward += min(where_count, 3) * 0.4  # small bonus for WHERE usage\n",
+    "\n",
+    "        # --- evidence check: do queries support the answer? ---\n",
+    "        combined_results = []\n",
+    "        error_detected = False\n",
+    "\n",
+    "        for res in tool_results:\n",
+    "            if isinstance(res, dict) and \"error\" in res:\n",
+    "                error_detected = True\n",
+    "            elif isinstance(res, list):\n",
+    "                combined_results.extend(res)\n",
+    "\n",
+    "        # if error detected, penalize heavily\n",
+    "        if error_detected:\n",
+    "            reward -= 2.0\n",
+    "        elif len(sql_queries) == 0:\n",
+    "            reward -= 1.5\n",
+    "        else:\n",
+    "            has_hits = len(combined_results) > 0\n",
+    "            correct_answer = ans.lower()\n",
+    "            if (has_hits and correct_answer == \"yes\") or (not has_hits and correct_answer == \"no\"):\n",
+    "                reward += 2.0\n",
+    "            else:\n",
+    "                reward -= 1.5\n",
+    "\n",
+    "        rewards.append(reward)\n",
+    "\n",
+    "    return rewards\n",
+    "\n",
+    "\n",
+    "def correctness_reward(completions, answer, **kwargs):\n",
+    "    \"\"\"\n",
+    "    Reward Yes/No correctness.\n",
+    "    Model must provide final answer enclosed in stars — *yes* or *no*.\n",
+    "    Does not reward informal yes/no buried in text.\n",
+    "    \"\"\"\n",
+    "    rewards = []\n",
+    "    for completion, ans in zip(completions, answer, strict=False):\n",
+    "        raw = completion[-1][\"content\"].lower()\n",
+    "\n",
+    "        # detect form *yes* or *no*\n",
+    "        match = re.search(r\"\\*(yes|no)\\*\", raw)\n",
+    "        guess = match.group(1) if match else None\n",
+    "\n",
+    "        reward = 0.0\n",
+    "\n",
+    "        if guess is None:\n",
+    "            reward -= 0.5  # invalid format\n",
+    "        elif guess == ans.lower():\n",
+    "            reward += 0.6  # correct under required format\n",
+    "        else:\n",
+    "            reward -= 1.0  # wrong answer\n",
+    "\n",
+    "        rewards.append(reward)\n",
+    "\n",
+    "    return rewards\n",
+    "\n",
+    "\n",
+    "def structure_reward(completions, **kwargs):\n",
+    "    \"\"\"\n",
+    "    Reward proper assistant structure.\n",
+    "    Encourages a logical sequence: tool call + response + optional extra content.\n",
+    "    \"\"\"\n",
+    "    rewards = []\n",
+    "\n",
+    "    for completion in completions:\n",
+    "        has_call = False\n",
+    "        has_response = False\n",
+    "        has_other = False\n",
+    "\n",
+    "        for turn in completion:\n",
+    "            role = turn.get(\"role\")\n",
+    "            if role == \"assistant\" and turn.get(\"tool_calls\"):\n",
+    "                has_call = True\n",
+    "            elif role == \"tool\":\n",
+    "                has_response = True\n",
+    "            else:\n",
+    "                content = turn.get(\"content\")\n",
+    "                if content and content.strip() not in [\"\", \"<think>\"]:\n",
+    "                    has_other = True\n",
+    "\n",
+    "        # Reward sequences\n",
+    "        if has_call and has_response:\n",
+    "            if has_other:\n",
+    "                reward = 0.1\n",
+    "            else:\n",
+    "                reward = 0.05  # still positive even without extra text\n",
+    "        elif has_call and not has_response:\n",
+    "            reward = -0.15\n",
+    "        else:\n",
+    "            reward = 0.0  # neutral if no call\n",
+    "\n",
+    "        rewards.append(reward)\n",
+    "\n",
+    "    return rewards\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "zcgkrKtTb4T9",
+   "metadata": {},
+   "source": [
+    "## Set GRPO Config\n",
+    "\n",
+    "Next, we define the **GRPOConfig**, which controls the main training parameters.  \n",
+    "This configuration specifies how the model interacts with **vLLM**, manages memory, and logs results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "t4ifJsNLElIN",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import GRPOConfig\n",
+    "\n",
+    "output_dir = \"grpo_biogrid_qwen_3g-1.7b\"\n",
+    "\n",
+    "grpo_config = GRPOConfig(\n",
+    "    # Training schedule / optimization\n",
+    "    max_steps=400,                                              # Max number of training steps\n",
+    "    chat_template_kwargs = {\"enable_thinking\": False},          # Disable thinking to reduce token generation\n",
+    "\n",
+    "    # GRPO configuration\n",
+    "    max_completion_length = 1024,                               # Maximum tokens generated per model response\n",
+    "\n",
+    "    # vLLM configuration\n",
+    "    use_vllm = True,                                            # Enable vLLM for faster inference during rollouts\n",
+    "    vllm_mode = \"colocate\",                                     # Run vLLM in colocate mode (same process as training)\n",
+    "    vllm_enable_sleep_mode=False,\n",
+    "\n",
+    "    # Logging / reporting\n",
+    "    output_dir = output_dir,                                    # Directory for checkpoints and logs\n",
+    "    report_to=\"trackio\",                                        # Experiment tracking tool (integrates with HF Spaces)\n",
+    "    trackio_space_id = output_dir,                              # HF Space where experiment tracking will be saved\n",
+    "    save_steps = 10,                                            # Interval for saving checkpoints\n",
+    "    log_completions = True,\n",
+    "\n",
+    "    # Memory optimization\n",
+    "    gradient_checkpointing = True,                              # Enable activation recomputation to save memory\n",
+    "\n",
+    "    # Hub integration\n",
+    "    push_to_hub = True,                                         # Set True to automatically push model to Hugging Face Hub\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34I-Q2MJuf42",
+   "metadata": {},
+   "source": [
+    "## Create `GRPOTrainer` and Start Training\n",
+    "\n",
+    "Next, we initialize the **`GRPOTrainer`**, which handles the full reinforcement learning loop.\n",
+    "\n",
+    "It receives the model name, reward functions, tool(s), and dataset defined earlier.  \n",
+    "\n",
+    "Finally, we call `trainer.train()` to begin fine-tuning, allowing the model to learn how to query the database effectively through iterative feedback."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "IysntAUOFvRn",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import GRPOTrainer\n",
+    "\n",
+    "model_name=\"Qwen/Qwen3-1.7B\"\n",
+    "\n",
+    "trainer = GRPOTrainer(\n",
+    "    model=model_name,\n",
+    "    train_dataset=train_dataset,\n",
+    "    eval_dataset=eval_dataset,\n",
+    "    tools=[query_biogrid],\n",
+    "    reward_funcs=[correctness_reward, structure_reward, query_reward],\n",
+    "    args=grpo_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "r_qJ5UwLuzCG",
+   "metadata": {},
+   "source": [
+    "Show memory stats before training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "DusT8JUaGmA6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "gpu_stats = torch.cuda.get_device_properties(0)\n",
+    "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
+    "\n",
+    "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
+    "print(f\"{start_gpu_memory} GB of memory reserved.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "OTPkiz3fu0lp",
+   "metadata": {},
+   "source": [
+    "And train!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "NwI3buPOFMFk",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer_stats = trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ITnLBLcTu2-p",
+   "metadata": {},
+   "source": [
+    "Show memory stats after training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ftek6m4-GncK",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
+    "used_percentage = round(used_memory / max_memory * 100, 3)\n",
+    "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n",
+    "\n",
+    "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
+    "print(f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\")\n",
+    "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
+    "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
+    "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
+    "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "O6LAwznKu7mc",
+   "metadata": {},
+   "source": [
+    "Let's save the trained model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "idVgnNS1MWPr",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.save_model(output_dir)\n",
+    "trainer.push_to_hub()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "707318cb",
+   "metadata": {},
+   "source": [
+    "## Load the fine-tuned model and run inference using `smolagents`\n",
+    "\n",
+    "After fine-tuning the model with **GRPO (TRL)** for tool calling, we can test it at inference time using **`smolagents`**, a lightweight library for running multi-step agents.\n",
+    "\n",
+    "`smolagents` handles the agent loop for us:\n",
+    "- Detecting tool calls generated by the model\n",
+    "- Executing the corresponding tools (e.g. database queries)\n",
+    "- Feeding the results back to the model until a final answer is produced\n",
+    "\n",
+    "> **Note**  \n",
+    "> Using an agent framework is optional. The fine-tuned model can also be used directly with `transformers` by manually controlling the inference loop and executing the tools outside the model.\n",
+    "> Agent frameworks are especially useful when the number of steps or tool calls is not fixed.\n",
+    "\n",
+    "We start by installing the required package:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aab7fd5c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install git+https://github.com/huggingface/smolagents.git"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "24453572",
+   "metadata": {},
+   "source": [
+    "We will use the `CodeAgent` class from `smolagents` to instantiate our agent.  \n",
+    "First, we need to define the tool the agent can use. This is done using the `@tool` decorator.\n",
+    "\n",
+    "As shown below, the tool definition is **exactly the same** as the one used during GRPO training with TRL. This consistency is important: the model was trained to emit calls following this schema, and at inference time the agent simply executes the corresponding Python function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "adcbbafa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from smolagents import tool\n",
+    "\n",
+    "@tool\n",
+    "def query_biogrid(sql_command: str) -> list[tuple]:\n",
+    "    \"\"\"\n",
+    "    Execute a read-only SQL query on the BioGRID database.\n",
+    "\n",
+    "    BioGRID is a curated biological database that compiles protein, genetic,\n",
+    "    and chemical interactions from multiple organisms.\n",
+    "\n",
+    "    Args:\n",
+    "        sql_command: A read-only SQL query to execute.\n",
+    "\n",
+    "    Returns:\n",
+    "        A list of tuples containing the query results.\n",
+    "    \"\"\"\n",
+    "    with timeout(5):\n",
+    "        conn = sqlite3.connect(\n",
+    "            \"file:biogrid.db?mode=ro\",\n",
+    "            uri=True,\n",
+    "        )\n",
+    "        cursor = conn.cursor()\n",
+    "        try:\n",
+    "            cursor.execute(sql_command)\n",
+    "            results = cursor.fetchall()\n",
+    "        finally:\n",
+    "            conn.close()\n",
+    "\n",
+    "    return results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "59721ad2",
+   "metadata": {},
+   "source": [
+    "Now we can instantiate the agent using our fine-tuned model and the database tool defined above.\n",
+    "We wrap the model with `TransformersModel` and pass both the model and the tool when creating the `CodeAgent`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e9ed8d00",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from smolagents import TransformersModel, CodeAgent\n",
+    "\n",
+    "model = TransformersModel(model_id=\"sergiopaniego/grpo_biogrid_qwen_3g-1.7b\", apply_chat_template_kwargs={\"enable_thinking\": False})\n",
+    "\n",
+    "# Create an agent with query_biogrid as tool\n",
+    "agent = CodeAgent(tools=[query_biogrid], model=model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "57ba9462",
+   "metadata": {},
+   "source": [
+    "Finally, we run the agent by passing the full prompt (including the instruction preamble and the question), exactly as it was used during training. This ensures the agent operates under the same context and assumptions learned with GRPO, allowing it to correctly decide when to query the database and how to format the final answer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "23a3cdf4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result = agent.run(train_dataset[0]['prompt'][0]['content'])\n",
+    "print(result)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

ICL/RL/trl_source/examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb ADDED Viewed

	@@ -0,0 +1,1914 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "lSR2nwdJg962"
+      },
+      "source": [
+        "# Fine-Tune FunctionGemma using Hugging Face TRL and OpenEnv\n",
+        "\n",
+        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb)\n",
+        "\n",
+        "![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)\n",
+        "\n",
+        "This guide describes the process of fine-tuning [FunctionGemma](https://huggingface.co/google/functiongemma-270m-it) by Google DeepMind in the [BrowserGym](https://meta-pytorch.org/OpenEnv/environments/browsergym/) environment provided by OpenEnv, using Hugging Face TRL. The steps covered include:\n",
+        "\n",
+        "* What is GRPO and OpenEnv\n",
+        "* Setup dependencies for training\n",
+        "* Initialize the OpenEnv's BrowserGym environment\n",
+        "* Create rollout function with helpers\n",
+        "* Define the reward functions\n",
+        "* Load the custom dataset\n",
+        "* Fine tune using TRL and the GRPOTrainer\n",
+        "* Load the fine-tuned model and run inference\n",
+        "\n",
+        "> Note: The guide is designed to run on Google Colaboratory with access to an NVIDIA A100 GPU (40GB) using FunctionGemma. The workflow can be adapted to other GPU configurations, models, or environments."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "duXYuR6Cu_na"
+      },
+      "source": [
+        "## What is GRPO and OpenEnv\n",
+        "\n",
+        "Group Relative Policy Optimization ([GRPO](https://huggingface.co/papers/2402.03300)) is a post-training method widely used for efficiently fine-tuning large language models. GRPO leverages reward functions to guide learning, enabling models to optimize task-specific behaviors without retraining the entire network.\n",
+        "\n",
+        "[OpenEnv](https://meta-pytorch.org/OpenEnv) provides a standard interface for interacting with agentic execution environments using simple Gymnasium-style APIs, such as `step()`, `reset()`, and `state()`. These APIs facilitate reinforcement learning training loops by allowing models to interact with environments in a structured manner. OpenEnv also offers tools for environment creators to build isolated, secure, and deployable environments that can be shared via common protocols like HTTP or packaged in Docker.\n",
+        "\n",
+        "The combination of GRPO and OpenEnv enables efficient fine-tuning of models in controlled, interactive tasks while minimizing resource requirements."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "cpSAQkzKmv50"
+      },
+      "source": [
+        "## Setup dependencies for training\n",
+        "\n",
+        "Install the required libraries, including Hugging Face TRL for fine-tuning and OpenEnv for reinforcement learning environments."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "c-2drnj5BP56"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -Uq trl[vllm] git+https://huggingface.co/spaces/openenv/browsergym_env liger-kernel trackio"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Inxeq6ZGpRno"
+      },
+      "source": [
+        "A valid Hugging Face token is required to save the fine-tuned model. In Google Colab, the token can be securely accessed through Colab secrets. Otherwise, it can be provided directly in the login method. Ensure the token has write permissions to allow uploading the model to the Hugging Face Hub during training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "C4q5UVu3BP57"
+      },
+      "outputs": [],
+      "source": [
+        "from google.colab import userdata\n",
+        "from huggingface_hub import login\n",
+        "\n",
+        "# Login into Hugging Face Hub\n",
+        "hf_token = userdata.get('HF_TOKEN') # If you are running inside a Google Colab\n",
+        "login(hf_token)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "O3kr38TGm_hb"
+      },
+      "source": [
+        "## Initialize the OpenEnv's BrowserGym environment\n",
+        "\n",
+        "External environments can guide the fine-tuning of LLMs for function calling by providing interactive feedback that enhances performance on task-specific behaviors.\n",
+        "\n",
+        "[BrowserGym](https://meta-pytorch.org/OpenEnv/environments/browsergym/) is a unified framework for web-based agent tasks, offering multiple benchmarks through a Gymnasium-compatible API. It enables training on simple synthetic tasks with [MiniWoB++](https://github.com/Farama-Foundation/miniwob-plusplus) and evaluation on more complex, realistic tasks with [WebArena](https://github.com/web-arena-x/webarena), [VisualWebArena](https://github.com/web-arena-x/visualwebarena), or [WorkArena](https://github.com/ServiceNow/WorkArena). This setup supports iterative training and assessment of web agents without requiring extensive infrastructure.\n",
+        "\n",
+        "BrowserGym supports both LLM and VLM training by providing visual information, including screenshots and DOM data, which can be utilized depending on the model type. This guide focuses on a simple web-based task called *\"click-test\"*, which is part of the MiniWoB++ benchmark of synthetic web tasks. Environments can be run locally, in Docker containers, or accessed remotely via the Hugging Face Hub. For this example, the remote environment [openenv/browsergym_env](https://huggingface.co/spaces/openenv/browsergym_env) will be used.\n",
+        "\n",
+        "> Note: Hosted environments on the Hub currently have limited concurrency. For higher reliability or parallel runs, duplicating the Space to your own account is strongly recommended."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "clDs-WQlBP57"
+      },
+      "outputs": [],
+      "source": [
+        "from browsergym_env import BrowserGymEnv\n",
+        "space_url = \"https://openenv-browsergym-env.hf.space\"\n",
+        "\n",
+        "client = BrowserGymEnv(base_url=space_url)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "EqfDavDQnD_5"
+      },
+      "source": [
+        "## Create rollout function with helpers\n",
+        "\n",
+        "The rollout function defines how the agent interacts with the environment during GRPO training. It generates model outputs, collects feedback in the form of rewards, and returns the information required for optimization.\n",
+        "\n",
+        "In this setup:\n",
+        "- The function is invoked automatically by the GRPOTrainer (introduced later), which orchestrates the training loop and handles policy updates.\n",
+        "- It uses the trainer's `generate_rollout_completions()` method for efficient output generation. This leverages vLLM, a high-performance inference engine for large language models, and is integrated within TRL to streamline rollout generation and reward collection during fine-tuning.\n",
+        "- Each rollout represents a complete interaction loop, where the model acts, receives feedback from the environment, and updates based on reward signals.\n",
+        "\n",
+        "Rewards capture various aspects of the agent's performance. Helper functions, such as `rollout_once`, manage individual episodes, keeping the main `rollout_func` clean, modular, and reusable.\n",
+        "\n",
+        "This modular structure allows GRPO to efficiently sample, evaluate, and refine the model's behavior through reinforcement learning.\n",
+        "\n",
+        "Before executing rollouts, a `system prompt` is defined to instruct the model on how to interact with the environment. This prompt specifies the available BrowserGym actions (such as `click`, `fill`, `send_keys`, and `scroll`), describes the page structure, and enforces that the model responds with exactly one action per step. It ensures consistent and structured interactions, guiding the model to complete tasks effectively without providing extra explanations or multiple actions."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ItCXS6H0BP58"
+      },
+      "outputs": [],
+      "source": [
+        "# @title System prompt (click to expand)\n",
+        "SYSTEM_PROMPT = \"\"\"You control a web browser through BrowserGym actions.\n",
+        "You must complete the given web task by interacting with the page.\n",
+        "\n",
+        "Available actions:\n",
+        "- noop() - Do nothing\n",
+        "- click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+        "- fill(bid, text) - Fill input field with text\n",
+        "- send_keys(text) - Send keyboard input\n",
+        "- scroll(direction) - Scroll up/down\n",
+        "\n",
+        "The page structure shows elements as: [bid] element_type 'element_text'\n",
+        "For example: [13] button 'Click Me!' means bid='13'\n",
+        "\n",
+        "Reply with exactly ONE action on a single line, e.g.:\n",
+        "click('13')\n",
+        "fill('42', 'hello world')\n",
+        "noop()\n",
+        "\n",
+        "Do not include explanations or multiple actions.\"\"\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Vi1rFey39GUl"
+      },
+      "source": [
+        "The `rollout_func` orchestrates the interaction between the model and the remote BrowserGym environment. For each prompt in the batch, it executes a complete episode using the `rollout_once` function, collecting model outputs and rewards for GRPO optimization.\n",
+        "\n",
+        "The parameter `max_steps` defines the maximum number of steps the model can take within a single episode. This limits the length of the interaction loop, ensuring that episodes terminate even if the task is not completed, and helps maintain efficient training.\n",
+        "\n",
+        "During each episode, the function tracks prompt and completion IDs, log probabilities, and both step-wise and final rewards, returning them in a structured format for the trainer to perform policy updates."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "CgHd5CFBBP58"
+      },
+      "outputs": [],
+      "source": [
+        "from trl import GRPOTrainer\n",
+        "\n",
+        "max_steps=10\n",
+        "\n",
+        "def rollout_func(prompts: list[str], trainer: GRPOTrainer) -> dict[str, list]:\n",
+        "    episode_prompt_ids: list[list[int]] = []\n",
+        "    episode_completion_ids: list[list[int]] = []\n",
+        "    episode_logprobs: list[list[float]] = []\n",
+        "    completion_rewards: list[float] = []\n",
+        "\n",
+        "    print(f\"\\n[DEBUG] rollout_func called with {len(prompts)} prompts (LLM mode, text-only)\")\n",
+        "\n",
+        "    for i, prompt_text in enumerate(prompts):\n",
+        "        print(f\"[DEBUG] Processing prompt {i + 1}/{len(prompts)}\")\n",
+        "        episode = rollout_once(\n",
+        "            trainer=trainer,\n",
+        "            env=client,\n",
+        "            tokenizer=trainer.processing_class,\n",
+        "            dataset_prompt=prompt_text,\n",
+        "            max_steps=max_steps,\n",
+        "        )\n",
+        "        episode_prompt_ids.append(episode[\"prompt_ids\"])\n",
+        "        episode_completion_ids.append(episode[\"completion_ids\"])\n",
+        "        episode_logprobs.append(episode[\"logprobs\"])\n",
+        "        completion_rewards.append(episode[\"completion_reward\"])\n",
+        "\n",
+        "    return {\n",
+        "        \"prompt_ids\": episode_prompt_ids,\n",
+        "        \"completion_ids\": episode_completion_ids,\n",
+        "        \"logprobs\": episode_logprobs,\n",
+        "        \"completion_reward\": completion_rewards,\n",
+        "    }"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ioUHdIxr9ZQO"
+      },
+      "source": [
+        "### Define `rollout_once`\n",
+        "\n",
+        "The `rollout_once` function runs one complete interaction loop between the model and the BrowserGym environment using the trainer's generation method.  \n",
+        "It executes a single episode, from generating an action to receiving feedback and computing rewards.\n",
+        "\n",
+        "Here's the step-by-step breakdown:\n",
+        "\n",
+        "1. Environment reset: Start a new BrowserGym session and initialize the observation.\n",
+        "2. Prompt construction: Combine the system prompt, environment observation (text-only via the accessibility tree), and any relevant errors or state information to form the model input.\n",
+        "3. Generation: Use `trl.experimental.openenv.generate_rollout_completions()` to produce the model's action efficiently with vLLM.\n",
+        "4. Action parsing and execution: Interpret the model's output and execute the corresponding BrowserGym action (e.g., `click`, `fill`, `scroll`).\n",
+        "5. Reward calculation: Track step-wise rewards provided by the environment and compute completion rewards based on task success or failure.\n",
+        "6. Return structured rollout data: Includes prompt/completion IDs, log probabilities, step rewards, and the final reward for the episode.\n",
+        "\n",
+        "This modular design allows each episode to be processed independently while providing rich feedback for the GRPO training loop, supporting both task completion and intermediate reward shaping."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "y8Ml47SYBP58"
+      },
+      "outputs": [],
+      "source": [
+        "from trl.experimental.openenv import generate_rollout_completions\n",
+        "from browsergym_env import BrowserGymAction\n",
+        "from transformers import AutoTokenizer\n",
+        "\n",
+        "def rollout_once(\n",
+        "    trainer: GRPOTrainer,\n",
+        "    env: BrowserGymEnv,\n",
+        "    tokenizer: AutoTokenizer,\n",
+        "    dataset_prompt: str,\n",
+        "    max_steps: int,\n",
+        ") -> dict[str, list]:\n",
+        "    \"\"\"Run one episode and collect training data (text-only, no screenshots).\"\"\"\n",
+        "    result = env.reset()\n",
+        "    observation = result.observation\n",
+        "\n",
+        "    prompt_ids: list[int] = []\n",
+        "    completion_ids: list[int] = []\n",
+        "    logprobs: list[float] = []\n",
+        "    step_rewards: list[float] = []\n",
+        "    completion_rewards: list[float] = []\n",
+        "\n",
+        "    for step_num in range(max_steps):\n",
+        "        if result.done:\n",
+        "            break\n",
+        "\n",
+        "        # Create prompt from observation (text-only using accessibility tree)\n",
+        "        goal = observation.goal or dataset_prompt\n",
+        "        axtree = observation.axtree_txt or \"\"\n",
+        "        error = observation.error if observation.last_action_error else \"\"\n",
+        "\n",
+        "        user_prompt = make_user_prompt(goal, step_num, axtree, error)\n",
+        "        messages = [\n",
+        "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+        "            {\"role\": \"user\", \"content\": user_prompt},\n",
+        "        ]\n",
+        "        prompt_text = tokenizer.apply_chat_template(\n",
+        "            messages,\n",
+        "            add_generation_prompt=True,\n",
+        "            tokenize=False,\n",
+        "        )\n",
+        "\n",
+        "        # Generate action with vLLM\n",
+        "        rollout_outputs = generate_rollout_completions(trainer, [prompt_text])[0]\n",
+        "        prompt_ids.extend(rollout_outputs[\"prompt_ids\"])\n",
+        "        completion_ids.extend(rollout_outputs[\"completion_ids\"])\n",
+        "        logprobs.extend(rollout_outputs[\"logprobs\"])\n",
+        "\n",
+        "        completion_text = rollout_outputs.get(\"text\") or tokenizer.decode(\n",
+        "            rollout_outputs[\"completion_ids\"], skip_special_tokens=True\n",
+        "        )\n",
+        "\n",
+        "        # Parse and execute action\n",
+        "        action_str = parse_action(completion_text)\n",
+        "\n",
+        "        print(f\"Step {step_num + 1}: {action_str}\")\n",
+        "\n",
+        "        # Take action in environment\n",
+        "        result = env.step(BrowserGymAction(action_str=action_str))\n",
+        "        observation = result.observation\n",
+        "\n",
+        "        # Track rewards\n",
+        "        step_reward = float(result.reward or 0.0)\n",
+        "        step_rewards.append(step_reward)\n",
+        "\n",
+        "        # Reward shaping: success is most important\n",
+        "        if result.done and step_reward > 0:\n",
+        "            completion_rewards.append(1.0)  # Task completed successfully\n",
+        "        elif result.done and step_reward == 0:\n",
+        "            completion_rewards.append(0.0)  # Task failed\n",
+        "        else:\n",
+        "            completion_rewards.append(step_reward)  # Intermediate reward\n",
+        "\n",
+        "    # Final reward is based on task completion\n",
+        "    final_reward = completion_rewards[-1] if completion_rewards else 0.0\n",
+        "\n",
+        "    return {\n",
+        "        \"prompt_ids\": prompt_ids,\n",
+        "        \"completion_ids\": completion_ids,\n",
+        "        \"logprobs\": logprobs,\n",
+        "        \"step_rewards\": step_rewards,\n",
+        "        \"completion_reward\": final_reward,\n",
+        "    }"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MDJKMQ__8qzj"
+      },
+      "source": [
+        "### Helper functions\n",
+        "\n",
+        "Supporting utilities used in `rollout_once`:\n",
+        "\n",
+        "- `make_user_prompt`: builds the user prompt combining the base text and previous game messages.\n",
+        "- `parse_action`: parses BrowserGym action from model response"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "GG4ba41PBP58"
+      },
+      "outputs": [],
+      "source": [
+        "# @title Helpers (click to expand)\n",
+        "def make_user_prompt(goal: str, step_num: int, axtree: str, error: str = \"\") -> str:\n",
+        "    \"\"\"Create user prompt from observation.\"\"\"\n",
+        "    prompt_parts = [f\"Step {step_num + 1}\"]\n",
+        "\n",
+        "    if goal:\n",
+        "        prompt_parts.append(f\"Goal: {goal}\")\n",
+        "\n",
+        "    if error:\n",
+        "        prompt_parts.append(f\"Previous action error: {error}\")\n",
+        "\n",
+        "    # Include accessibility tree (truncated for context)\n",
+        "    if axtree:\n",
+        "        max_len = 2000\n",
+        "        axtree_truncated = axtree[:max_len] + \"...\" if len(axtree) > max_len else axtree\n",
+        "        prompt_parts.append(f\"Page structure:\\n{axtree_truncated}\")\n",
+        "\n",
+        "    prompt_parts.append(\"What action do you take?\")\n",
+        "\n",
+        "    return \"\\n\\n\".join(prompt_parts)\n",
+        "\n",
+        "\n",
+        "def parse_action(response_text: str) -> str:\n",
+        "    \"\"\"Parse BrowserGym action from model response.\"\"\"\n",
+        "    # Extract first line that looks like an action\n",
+        "    for line in response_text.strip().split(\"\\n\"):\n",
+        "        line = line.strip()\n",
+        "        if \"(\" in line and \")\" in line:\n",
+        "            return line\n",
+        "\n",
+        "    # Fallback to noop if no valid action found\n",
+        "    return \"noop()\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Oek3JhcWnKhw"
+      },
+      "source": [
+        "## Define the reward functions\n",
+        "\n",
+        "Reward functions quantify the model's performance in the environment and guide the GRPO optimization process.\n",
+        "\n",
+        "In this setup, the `reward_completion` function assigns rewards based on task completion. It extracts the final reward for each episode, which indicates whether the agent successfully completed the task. If no reward information is available, it defaults to zero.\n",
+        "\n",
+        "This modular approach allows additional reward functions to be added easily, enabling more granular feedback such as intermediate progress, efficiency, or correctness of actions, depending on the task requirements."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "WxkXaz5aBP59"
+      },
+      "outputs": [],
+      "source": [
+        "def reward_completion(completions: list[str], **kwargs) -> list[float]:\n",
+        "    \"\"\"Reward for task completion.\"\"\"\n",
+        "    rewards = kwargs.get(\"completion_reward\") if kwargs else None\n",
+        "    if rewards is None:\n",
+        "        return [0.0 for _ in completions]\n",
+        "    return [float(r) for r in rewards]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "66ZsrLplm07U"
+      },
+      "source": [
+        "## Load the custom dataset\n",
+        "\n",
+        "The dataset is constructed with repeated prompts to control the total number of training episodes.\n",
+        "\n",
+        "Each entry in the dataset triggers a single rollout episode during training. The `dataset_prompt` provides the initial instruction to the model at the start of each episode, ensuring consistent guidance for task execution."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "UX6jUjxaBP59"
+      },
+      "outputs": [],
+      "source": [
+        "from datasets import Dataset\n",
+        "\n",
+        "dataset_prompt = \"Complete the web task successfully.\"\n",
+        "dataset_size = 1000\n",
+        "\n",
+        "dataset = Dataset.from_dict({\"prompt\": [dataset_prompt] * dataset_size})"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-mvka-96m3I7"
+      },
+      "source": [
+        "## Fine-tune using TRL and the GRPOTrainer\n",
+        "\n",
+        "The next step is to define the GRPOConfig, which sets all key training parameters.\n",
+        "\n",
+        "This configuration determines how the model interacts with vLLM, handles memory and computation, and records training metrics and logs for monitoring the fine-tuning process."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "TZ34a1h-BP59"
+      },
+      "outputs": [],
+      "source": [
+        "from trl import GRPOConfig\n",
+        "output_dir = \"browsergym-grpo-functiongemma-270m-it\"\n",
+        "\n",
+        "grpo_config = GRPOConfig(\n",
+        "    # num_train_epochs=1,                                     # Number of times to iterate over the full dataset (use for full training runs)\n",
+        "    max_steps=100,                                            # Number of dataset passes (for shorter runs/testing). For full trainings, use `num_train_epochs` instead\n",
+        "    learning_rate=5e-6,                                       # Learning rate for the optimizer\n",
+        "    warmup_steps=10,                                          # Number of steps to linearly increase learning rate at the start of training\n",
+        "\n",
+        "    per_device_train_batch_size=1,                            # Number of samples per device per step\n",
+        "    num_generations=4,                                        # Number of completions to generate per prompt\n",
+        "    generation_batch_size=4,                                  # Batch size used during generation (must be divisible by num_generations)\n",
+        "    max_completion_length=32,                                 # Maximum length of generated completions\n",
+        "\n",
+        "    use_vllm=True,                                            # Use vLLM engine for fast inference\n",
+        "    vllm_mode=\"colocate\",                                     # vLLM mode: \"colocate\" runs generation on the same GPU as training\n",
+        "    vllm_gpu_memory_utilization=0.1,                          # Fraction of GPU memory allocated to vLLM\n",
+        "\n",
+        "    output_dir=str(output_dir),                               # Directory where checkpoints, logs, and outputs will be saved\n",
+        "    logging_steps=1,                                          # Log metrics every N steps\n",
+        "    report_to=\"trackio\",                                      # Logging/reporting platform (e.g., \"trackio\")\n",
+        "    trackio_space_id=output_dir,                              # HF Space where the experiment tracking will be saved\n",
+        "    push_to_hub=True,                                         # Optionally push trained model to Hugging Face Hub\n",
+        "\n",
+        "    use_liger_kernel=True,                                    # Enable Liger kernel optimizations for faster training\n",
+        ")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "a1taGmD--0Y4"
+      },
+      "source": [
+        "The next step is to initialize the GRPOTrainer, which manages the complete reinforcement learning loop.\n",
+        "\n",
+        "It receives the model name, reward functions, rollout function, and dataset defined earlier. From the model name, the trainer automatically initializes the model and tokenizer. It then coordinates interactions between the model and the environment, applies the defined reward signals, and updates the policy during training.\n",
+        "\n",
+        "Finally, calling `trainer.train()` starts the fine-tuning process, enabling the model to progressively improve its performance through iterative interaction and reinforcement learning.\n",
+        "\n",
+        "> Note: The training pipeline uses approximately 10.6 GB of GPU VRAM and can be adapted to different hardware configurations."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "En43o4NZBP59"
+      },
+      "outputs": [],
+      "source": [
+        "model_name = \"google/functiongemma-270m-it\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "047d386e54704add95edd4beace781d7"
+          ]
+        },
+        "id": "k8-SvqJcBP59",
+        "outputId": "6a4d9276-fc91-4217-d3a2-51a18d222338"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/tmp/ipython-input-3830121904.py:1: UserWarning: You are importing from 'rollout_func', which is an experimental feature. This API may change or be removed at any time without prior notice. Silence this warning by setting environment variable TRL_EXPERIMENTAL_SILENCE=1.\n",
+            "  trainer = GRPOTrainer(\n",
+            "The model is already on multiple devices. Skipping the move to device specified in `args`.\n",
+            "`torch_dtype` is deprecated! Use `dtype` instead!\n"
+          ]
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "047d386e54704add95edd4beace781d7",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]\n"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 4/4 [00:00<00:00, 19.64it/s]\n"
+          ]
+        }
+      ],
+      "source": [
+        "trainer = GRPOTrainer(\n",
+        "    model=model_name,\n",
+        "    reward_funcs=[reward_completion],\n",
+        "    train_dataset=dataset,\n",
+        "    args=grpo_config,\n",
+        "    rollout_func=rollout_func,\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "e1PrBB7gBP59",
+        "outputId": "61740a89-228c-4b3c-8e59-b4a3eb972c03"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 2, 'pad_token_id': 0}.\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "* Trackio project initialized: huggingface\n",
+            "* Trackio metrics will be synced to Hugging Face Dataset: sergiopaniego/browsergym-grpo-functiongemma-270m-it-dataset\n",
+            "* Creating new space: https://huggingface.co/spaces/sergiopaniego/browsergym-grpo-functiongemma-270m-it\n",
+            "* View dashboard by going to: https://sergiopaniego-browsergym-grpo-functiongemma-270m-it.hf.space/\n"
+          ]
+        },
+        {
+          "data": {
+            "text/html": [
+              "<div><iframe src=\"https://sergiopaniego-browsergym-grpo-functiongemma-270m-it.hf.space/\" width=\"100%\" height=\"1000px\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
+            ],
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "* Created new run: sergiopaniego-1765969078\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: noop()\n",
+            "Step 2: noop()\n",
+            "Step 3: noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: noop()\n",
+            "Step 6: noop()\n",
+            "Step 7: Click 'click(bid) - Click element with BrowserGym ID (the number in brackets\n",
+            "Step 8: I will use the action `click()` to click the button.\n",
+            "Step 9: noop()\n",
+            "Step 10: Click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: noop()\n",
+            "Step 2: noop()\n",
+            "Step 3: Clicks ('13')\n",
+            "Step 4: I will click 'Click Me!' using action 'click(bid)' on page 'Click Test Task' using a bid of '13'.\n",
+            "Step 5: noop()\n",
+            "Step 6: noop()\n",
+            "Step 7: noop()\n",
+            "Step 8: noop()\n",
+            "Step 9: noop()\n",
+            "Step 10: noop()\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: I will use the 'click(bid)' action.\n",
+            "Step 2: mouse_click(bid)\n",
+            "Step 3: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 4: Add action 'click(bid)' to Step 4.\n",
+            "Step 5: Click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 6: noop()\n",
+            "Step 7: noop()\n",
+            "Step 8: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 9: noop()\n",
+            "Step 10: Click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: noop()\n",
+            "Step 2: noop()\n",
+            "Step 3: noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: Click('13')\n",
+            "Step 6: noop()\n",
+            "Step 7: noop()\n",
+            "Step 8: noop()\n",
+            "Step 9: noop()\n",
+            "Step 10: noop()\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "WARNING:liger_kernel.transformers.model.gemma3:It is strongly recommended to train Gemma3 models with the `eager` attention implementation instead of `sdpa`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.\n",
+            "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py:282: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.\n",
+            "  warnings.warn(\n",
+            "/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7095: UserWarning: \n",
+            "Online softmax is disabled on the fly since Inductor decides to\n",
+            "split the reduction. Cut an issue to PyTorch if this is an\n",
+            "important use case and you want to speed it up with online\n",
+            "softmax.\n",
+            "\n",
+            "  warnings.warn(\n"
+          ]
+        },
+        {
+          "data": {
+            "text/html": [
+              "\n",
+              "    <div>\n",
+              "      \n",
+              "      <progress value='100' max='100' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+              "      [100/100 35:02, Epoch 0/1]\n",
+              "    </div>\n",
+              "    <table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              " <tr style=\"text-align: left;\">\n",
+              "      <th>Step</th>\n",
+              "      <th>Training Loss</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <td>1</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>2</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>3</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>4</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>5</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>6</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>7</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>8</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>9</td>\n",
+              "      <td>-0.877900</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>10</td>\n",
+              "      <td>1965.894400</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>11</td>\n",
+              "      <td>-0.830900</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>12</td>\n",
+              "      <td>10.616100</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>13</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>14</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>15</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>16</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>17</td>\n",
+              "      <td>2.320100</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>18</td>\n",
+              "      <td>1.887500</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>19</td>\n",
+              "      <td>-0.691600</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>20</td>\n",
+              "      <td>-0.764400</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>21</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>22</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>23</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>24</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>25</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>26</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>27</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>28</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>29</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>30</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>31</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>32</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>33</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>34</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>35</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>36</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>37</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>38</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>39</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>40</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>41</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>42</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>43</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>44</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>45</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>46</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>47</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>48</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>49</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>50</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>51</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>52</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>53</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>54</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>55</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>56</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>57</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>58</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>59</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>60</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>61</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>62</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>63</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>64</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>65</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>66</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>67</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>68</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>69</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>70</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>71</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>72</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>73</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>74</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>75</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>76</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>77</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>78</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>79</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>80</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>81</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>82</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>83</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>84</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>85</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>86</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>87</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>88</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>89</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>90</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>91</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>92</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>93</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>94</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>95</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>96</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>97</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>98</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>99</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>100</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table><p>"
+            ],
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: Clicks ('13')\n",
+            "Step 2: noop()\n",
+            "Step 3: noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: noop()\n",
+            "Step 6: Click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 7: noop()\n",
+            "Step 8: noop()\n",
+            "Step 9: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 10: noop()\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: noop()\n",
+            "Step 2: I will use action: click(bid) to click the button.\n",
+            "Step 3: Yes, I can handle this. I will use the `click()` action to click the button.\n",
+            "Step 4: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 5: noop()\n",
+            "Step 6: noop()\n",
+            "Step 7: noop()\n",
+            "Step 8: Click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 9: noop()\n",
+            "Step 10: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 2: noop()\n",
+            "Step 3: noop()\n",
+            "Step 4: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 5: noop()\n",
+            "Step 6: noop()\n",
+            "Step 7: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 8: noop()\n",
+            "Step 9: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 10: Pass the button ID ('Click Me!') to the action \"click('bid')\".\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: noop()\n",
+            "Step 2: noop()\n",
+            "Step 3: noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: I will click the button by emitting `click(bid)` and `fill(bid, text)` simultaneously.\n",
+            "Step 6: noop()\n",
+            "Step 7: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 8: noop()\n",
+            "Step 9: noop()\n",
+            "Step 10: noop()\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: - Noop()\n",
+            "Step 2: noop()\n",
+            "Step 3: -noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: Click('13')\n",
+            "Step 6: noop()\n",
+            "Step 7: noop()\n",
+            "Step 8: noop()\n",
+            "Step 9: noop()\n",
+            "Step 10: noop()\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: noop()\n",
+            "Step 2: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 3: noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: noop()\n",
+            "Step 6: Complete action: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: I will use the action 'click('bid') to click the button.\n",
+            "Step 2: noop()\n",
+            "Step 3: noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: noop()\n",
+            "Step 6: I call action Click (bid) on the page.\n",
+            "Step 7: noop()\n",
+            "Step 8: noop()\n",
+            "Step 9: noop()\n",
+            "Step 10: noop()\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: Oops()\n",
+            "Step 2: noop()\n",
+            "Step 3: fill(bid, text)\n",
+            "Step 4: noop()\n",
+            "Step 5: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: def click_button_on_page():\n",
+            "Step 2: noop()\n",
+            "Step 3: click(bid)\n",
+            "Step 4: Click('13')\n",
+            "Step 5: noop()\n",
+            "Step 6: noop()\n",
+            "Step 7: noop()\n",
+            "Step 8: noop()\n",
+            "Step 9: noop()\n",
+            "Step 10: noop()\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: noop()\n",
+            "Step 2: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 3: noop()\n",
+            "Step 4: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 5: Click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 6: I will click the button 'Click Me!' by using the action `click(bid)` and emitting a bid of 13.\n",
+            "Step 7: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 8: noop()\n",
+            "Step 9: noop()\n",
+            "Step 10: noop()\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: `click(bid)` - No action\n",
+            "Step 2: - Noop()\n",
+            "Step 3: noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: noop()\n",
+            "Step 6: noop()\n",
+            "Step 7: noop()\n",
+            "Step 8: noop()\n",
+            "Step 9: noop()\n",
+            "Step 10: I will click the button 'Click Me!' using the action 'click(bid)'.\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: noop()\n",
+            "Step 2: noop()\n",
+            "Step 3: noop()\n",
+            "Step 4: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 5: noop()\n",
+            "Step 6: noop()\n",
+            "Step 7: noop()\n",
+            "Step 8: noop()\n",
+            "Step 9: Complete action: click(bid)\n",
+            "Step 10: noop()\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: noop()\n",
+            "Step 2: I will perform action 1: click('13') to complete the action.\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: noop()\n",
+            "Step 2: noop()\n",
+            "Step 3: noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: noop()\n",
+            "Step 6: noop()\n",
+            "Step 7: Click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 8: noop()\n",
+            "Step 9: Click ('13')\n",
+            "Step 10: Add action 'fill(bid, text) - Send keyboard input' to perform the click.\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: noop()\n",
+            "Step 2: Click('click(bid) - Bid')\n",
+            "Step 3: noop()\n",
+            "Step 4: noop()\n",
+            "Step 5: noop()\n",
+            "Step 6: noop()\n",
+            "Step 7: noop()\n",
+            "Step 8: noop()\n",
+            "Step 9: click(bid) - Click element with BrowserGym ID (the number in brackets)\n",
+            "Step 10: noop()\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "\n",
+            "[DEBUG] rollout_func called with 4 prompts (LLM mode, text-only)\n",
+            "[DEBUG] Processing prompt 1/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 2/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 3/4\n",
+            "Step 1: click('13')\n",
+            "[DEBUG] Processing prompt 4/4\n",
+            "Step 1: click('13')\n",
+            "* Run finished. Uploading logs to Trackio (please wait...)\n"
+          ]
+        }
+      ],
+      "source": [
+        "trainer_stats = trainer.train()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "BZj4IG9ZBAix"
+      },
+      "source": [
+        "In this step, the fine-tuned model is saved locally and uploaded to the Hugging Face Hub using the configured account credentials."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "244ced1920694dbaae9bf98065b4f01d",
+            "e3769ae107554c9ba38c1e491b15bf4e",
+            "6d5b8bff73474faeb1d1b438fb4e8cec",
+            "9f952f8eb63b42e4b38711737da5461e",
+            "bd12780895064467b5be14e2ec3df114",
+            "d1261c1083a74dca877e6eece6395d73",
+            "999744cacd6a4fb08a1d4977ce2f06fd",
+            "faa5e0fb4ee244689c0f9eef9902acf7",
+            "6403bed2cd984ba18f74f416748c64e4",
+            "38be017369524e2eb22050e7a0a18ec5",
+            "b0720a4a2df948308011d4d87a288426",
+            "889ca2520f4d446daf2e6ed16ce11d2e"
+          ]
+        },
+        "id": "9oOBgEWeBP59",
+        "outputId": "76bef375-fc6b-4fdd-a296-549a9b109b11"
+      },
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "244ced1920694dbaae9bf98065b4f01d",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Processing Files (0 / 0)      : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "e3769ae107554c9ba38c1e491b15bf4e",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "New Data Upload               : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "6d5b8bff73474faeb1d1b438fb4e8cec",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...270m-it/training_args.bin: 100%|##########| 7.57kB / 7.57kB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "9f952f8eb63b42e4b38711737da5461e",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...a-270m-it/tokenizer.model: 100%|##########| 4.69MB / 4.69MB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "bd12780895064467b5be14e2ec3df114",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...ma-270m-it/tokenizer.json: 100%|##########| 33.4MB / 33.4MB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "d1261c1083a74dca877e6eece6395d73",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...270m-it/model.safetensors:   4%|3         | 41.9MB / 1.07GB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "No files have been modified since last commit. Skipping to prevent empty commit.\n",
+            "WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.\n"
+          ]
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "999744cacd6a4fb08a1d4977ce2f06fd",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Processing Files (0 / 0)      : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "faa5e0fb4ee244689c0f9eef9902acf7",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "New Data Upload               : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "6403bed2cd984ba18f74f416748c64e4",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...270m-it/training_args.bin: 100%|##########| 7.57kB / 7.57kB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "38be017369524e2eb22050e7a0a18ec5",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...a-270m-it/tokenizer.model: 100%|##########| 4.69MB / 4.69MB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "b0720a4a2df948308011d4d87a288426",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...270m-it/model.safetensors:   3%|3         | 33.5MB / 1.07GB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "889ca2520f4d446daf2e6ed16ce11d2e",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...ma-270m-it/tokenizer.json: 100%|##########| 33.4MB / 33.4MB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "No files have been modified since last commit. Skipping to prevent empty commit.\n",
+            "WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.\n"
+          ]
+        },
+        {
+          "data": {
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "string"
+            },
+            "text/plain": [
+              "CommitInfo(commit_url='https://huggingface.co/sergiopaniego/browsergym-grpo-functiongemma-270m-it/commit/a17de133c28ca7fddfcb2694c32f2791de5ddbe6', commit_message='End of training', commit_description='', oid='a17de133c28ca7fddfcb2694c32f2791de5ddbe6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sergiopaniego/browsergym-grpo-functiongemma-270m-it', endpoint='https://huggingface.co', repo_type='model', repo_id='sergiopaniego/browsergym-grpo-functiongemma-270m-it'), pr_revision=None, pr_num=None)"
+            ]
+          },
+          "execution_count": 12,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "trainer.save_model(output_dir)\n",
+        "trainer.push_to_hub()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "talmc8b7nPXJ"
+      },
+      "source": [
+        "## Load the Fine-Tuned Model and Run Inference\n",
+        "\n",
+        "The fine-tuned model is loaded to perform inference and evaluate its behavior on the target task.  \n",
+        "In this case, the model is tested within the BrowserGym environment using OpenEnv, focusing on the *click* task from the MiniWoB++ benchmark, which is included among the available BrowserGym tasks."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "c3879b716f37442a87d51b8414fe8c48"
+          ]
+        },
+        "id": "iIDiaGVlBP5-",
+        "outputId": "4dc0e365-e89f-40ba-b391-74c7efdc932d"
+      },
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "c3879b716f37442a87d51b8414fe8c48",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "model.safetensors:   0%|          | 0.00/1.07G [00:00<?, ?B/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+        "\n",
+        "model_name = \"sergiopaniego/browsergym-grpo-functiongemma-270m-it\" # Replace with your HF username or organization\n",
+        "\n",
+        "fine_tuned_model = AutoModelForCausalLM.from_pretrained(model_name, dtype=\"float32\", device_map=\"auto\")\n",
+        "tokenizer = AutoTokenizer.from_pretrained(model_name)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "lyT-vudO5ekj"
+      },
+      "source": [
+        "With the fine-tuned model loaded, testing can be conducted on the BrowserGym environment.\n",
+        "To streamline evaluation, a reusable function is defined that executes multiple rounds of the task.\n",
+        "This function follows the same interaction logic as used during training, generating model actions from observations, executing them in the environment, and printing the results step by step."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "doAEIf5IBP5-"
+      },
+      "outputs": [],
+      "source": [
+        "def test_click_in_browsergym(env, model, tokenizer):\n",
+        "    result = env.reset()\n",
+        "    observation = result.observation\n",
+        "\n",
+        "    for step_num in range(max_steps):\n",
+        "        if result.done:\n",
+        "            break\n",
+        "\n",
+        "        # Create prompt from observation (text-only using accessibility tree)\n",
+        "        goal = observation.goal or dataset_prompt\n",
+        "        axtree = observation.axtree_txt or \"\"\n",
+        "        error = observation.error if observation.last_action_error else \"\"\n",
+        "\n",
+        "        user_prompt = make_user_prompt(goal, step_num, axtree, error)\n",
+        "        messages = [\n",
+        "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+        "            {\"role\": \"user\", \"content\": user_prompt},\n",
+        "        ]\n",
+        "        prompt_text = tokenizer.apply_chat_template(\n",
+        "            messages,\n",
+        "            add_generation_prompt=True,\n",
+        "            tokenize=False,\n",
+        "        )\n",
+        "\n",
+        "        # Generate action\n",
+        "        prompt_text = tokenizer.apply_chat_template(\n",
+        "            messages,\n",
+        "            add_generation_prompt=True,\n",
+        "            tokenize=False,\n",
+        "            enable_thinking=False,\n",
+        "        )\n",
+        "\n",
+        "        model_inputs = tokenizer([prompt_text], return_tensors=\"pt\").to(model.device)\n",
+        "\n",
+        "        generated_ids = model.generate(\n",
+        "            **model_inputs,\n",
+        "            max_new_tokens=512\n",
+        "        )\n",
+        "        output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]\n",
+        "\n",
+        "        # Decode and extract model response\n",
+        "        generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)\n",
+        "\n",
+        "        action_str = parse_action(generated_text)\n",
+        "        print(f\"Step {step_num + 1}: {action_str}\")\n",
+        "\n",
+        "        # Take action in environment\n",
+        "        result = env.step(BrowserGymAction(action_str=action_str))\n",
+        "        observation = result.observation"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9QvGD8f8CQx1"
+      },
+      "source": [
+        "The `test_click_in_browsergym` function is called to run a full evaluation of the fine-tuned model on the BrowserGym *click* task.  \n",
+        "\n",
+        "The environment client is safely closed after testing using a `try/finally` block, ensuring that all resources are released even if an error occurs during execution."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Z77wlVb6BP5-",
+        "outputId": "ed4ad094-1529-4cc7-8274-2782784efe2d"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Step 1: click('13')\n"
+          ]
+        }
+      ],
+      "source": [
+        "try:\n",
+        "    test_click_in_browsergym(client, fine_tuned_model, tokenizer)\n",
+        "finally:\n",
+        "    client.close()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "wHydP-ZVCcYK"
+      },
+      "source": [
+        "## Summary and Next Steps\n",
+        "\n",
+        "This tutorial demonstrated how to fine-tune a FunctionGemma model using TRL, GRPO, and the BrowserGym environment from OpenEnv. Check out the following docs next:\n",
+        "\n",
+        "- Learn how to [generate text with a Gemma model](https://ai.google.dev/gemma/docs/get_started).\n",
+        "- Learn how to [fine-tune Gemma for vision tasks using Hugging Face Transformers](https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora).\n",
+        "- Learn how to [full model fine-tune using Hugging Face Transformers](https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune).\n",
+        "- Learn how to [fine-tune Gemma using Hugging Face Transformers with QLoRA](https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora).  \n",
+        "- Learn how to perform [distributed fine-tuning and inference on a Gemma model](https://ai.google.dev/gemma/docs/core/distributed_tuning).\n",
+        "- Learn how to [use Gemma open models with Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/open-models/use-gemma).\n",
+        "- Learn how to [fine-tune Gemma using KerasNLP and deploy to Vertex AI](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gemma_kerasnlp_to_vertexai.ipynb)."
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "A100",
+      "provenance": []
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}

ICL/RL/trl_source/examples/notebooks/grpo_ministral3_vl.ipynb ADDED Viewed

	@@ -0,0 +1,740 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "-J8iGzLf4rUJ"
+   },
+   "source": [
+    "# GRPO Ministral-3 with QLoRA using TRL\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_ministral3_vl.ipynb)\n",
+    "\n",
+    "![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)\n",
+    "\n",
+    "\n",
+    "With [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl), you can fine-tune cutting edge vision language models. It comes with support for quantized parameter efficient fine-tuning technique **QLoRA**, so we can use free Colab (T4 GPU) to fine-tune models like [Ministral-3](https://huggingface.co/collections/mistralai/ministral-3).\n",
+    "\n",
+    "\n",
+    "- [TRL GitHub Repository](https://github.com/huggingface/trl) — star us to support the project!  \n",
+    "- [Official TRL Examples (notebooks and scripts)](https://huggingface.co/docs/trl/example_overview)  \n",
+    "- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "NvrzGRnu48Vz"
+   },
+   "source": [
+    "## Install dependencies\n",
+    "\n",
+    "We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install **trackio** to log and monitor our experiments, and **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Dbvb3UmQ99p9",
+    "outputId": "3ad47e9a-017e-4066-8fe8-77a59586fff3"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -Uq \"trl[peft]\" bitsandbytes trackio math_verify git+https://github.com/huggingface/transformers mistral-common"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gpzI6omi7728"
+   },
+   "source": [
+    "### Log in to Hugging Face\n",
+    "\n",
+    "Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "referenced_widgets": [
+      "2ac44d3c070845af86d9b2e3ce8b949f"
+     ]
+    },
+    "id": "h5Ubc70Z99p-",
+    "outputId": "633485d3-c79b-4702-ac01-f5a7be5cadfb"
+   },
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "V_Zylc4t79-n"
+   },
+   "source": [
+    "## Load dataset\n",
+    "\n",
+    "\n",
+    "We'll load the [**lmms-lab/multimodal-open-r1-8k-verified**](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) dataset from the Hugging Face Hub using the `datasets` library.\n",
+    "\n",
+    "This dataset contains maths problems with the image representing the problem,  along with the solution in thinking format specially tailored for VLMs. By training our model with this dataset, it'll improve its maths and thinking reasoning.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "referenced_widgets": [
+      "3538a24e7f63433d91144b0ef765d8f0",
+      "23c73818302c4c879d7eca629b4d734d",
+      "663a6d37e74c4663a0d5c31aa14b47d6"
+     ]
+    },
+    "id": "OsyilesY99p-",
+    "outputId": "4cca7fa0-5f49-4c40-e36a-3a87d2496177"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset_id = 'lmms-lab/multimodal-open-r1-8k-verified'\n",
+    "train_dataset = load_dataset(dataset_id, split='train[:5%]')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gVV7RoRN8zk5"
+   },
+   "source": [
+    "In addition to the `problem` and `image` columns, we also include a custom system prompt to tell the model how we'd like the generation.\n",
+    "\n",
+    "The system prompt is extracted from DeepSeek R1. Refer to [this previous recipe](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl) for more details.\n",
+    "\n",
+    "We convert the dataset samples into conversation samples, including the system prompt and one image and problem description per sample, since this is how the GRPO trainer expects them.\n",
+    "\n",
+    "We also set `padding_side=\"left\"` to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses.\n",
+    "\n",
+    "> **Note:**\n",
+    "> In older GPUs (including those available on Colab), **FP8 support** is limited, so we use the BF16 version of the model.\n",
+    "> In that case, you can select the official checkpoint or the one from Unsloth.\n",
+    "> If you have access to GPUs with **FP8 support**, you can switch to that version instead."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "referenced_widgets": [
+      "83dfeaab2bd04b06899d09b6b35bacd1",
+      "8588996c1d2d444193e9cf53c1a73b8e",
+      "138a997da09f40ada32171e51b51b708",
+      "06ef4d5f41de4436ad4731cbf2f8471f"
+     ]
+    },
+    "id": "WlK7KYKT99p-",
+    "outputId": "db72808f-21cf-4022-ed1a-b78ebb3ee47e"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoProcessor\n",
+    "\n",
+    "#model_name = \"mistralai/Ministral-3-3B-Instruct-2512\"\n",
+    "model_name = \"mistralai/Ministral-3-3B-Instruct-2512-BF16\" # \"unsloth/Ministral-3-3B-Instruct-2512\"\n",
+    "\n",
+    "processor = AutoProcessor.from_pretrained(model_name, padding_side=\"left\")\n",
+    "\n",
+    "SYSTEM_PROMPT = (\n",
+    "    \"You are a helpful AI Assistant that provides well-reasoned and detailed responses. \"\n",
+    "    \"You first think about the reasoning process as an internal monologue and then provide the user with the answer. \"\n",
+    "    \"Respond in the following format: <think>\\n...\\n</think>\\n<answer>\\n...\\n</answer>\"\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def make_conversation(example):\n",
+    "    conversation = [\n",
+    "        {\n",
+    "            \"role\": \"system\",\n",
+    "            \"content\": [{\"type\": \"text\", \"text\": SYSTEM_PROMPT}],\n",
+    "        },\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": [\n",
+    "                {\"type\": \"image\", \"image\": example[\"image\"]},\n",
+    "                {\"type\": \"text\", \"text\": example[\"problem\"]},\n",
+    "            ],\n",
+    "        },\n",
+    "    ]\n",
+    "    return {\n",
+    "        \"prompt\": conversation,\n",
+    "        \"image\": example[\"image\"],\n",
+    "    }\n",
+    "\n",
+    "train_dataset = train_dataset.map(make_conversation)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "5txAuMAa8ock"
+   },
+   "source": [
+    "Let's review one example to understand the internal structure:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "sjxG7duU99p_"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "ZooycTF099p_"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset = train_dataset.remove_columns(['problem', 'original_question', 'original_answer'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "2LcjFKgD99p_"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "YY3uMp909Eqy"
+   },
+   "source": [
+    "## Load model and configure LoRA/QLoRA\n",
+    "\n",
+    "This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "RcQn7mGs99p_"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import Mistral3ForConditionalGeneration, FineGrainedFP8Config, BitsAndBytesConfig\n",
+    "import torch\n",
+    "\n",
+    "FP8 = False\n",
+    "\n",
+    "if FP8:\n",
+    "    model_name = \"mistralai/Ministral-3-3B-Instruct-2512\"\n",
+    "    quantization_config = FineGrainedFP8Config(dequantize=False)\n",
+    "else:\n",
+    "    model_name = \"mistralai/Ministral-3-3B-Instruct-2512-BF16\" # \"unsloth/Ministral-3-3B-Instruct-2512\"\n",
+    "    quantization_config = BitsAndBytesConfig(\n",
+    "        load_in_4bit=True,  # Load the model in 4-bit precision to save memory\n",
+    "        bnb_4bit_compute_dtype=torch.float16,  # Data type used for internal computations in quantization\n",
+    "        bnb_4bit_use_double_quant=True,  # Use double quantization to improve accuracy\n",
+    "        bnb_4bit_quant_type=\"nf4\",  # Type of quantization. \"nf4\" is recommended for recent LLMs\n",
+    "    )\n",
+    "\n",
+    "model = Mistral3ForConditionalGeneration.from_pretrained(\n",
+    "    model_name,\n",
+    "    dtype=\"float32\",\n",
+    "    device_map=\"auto\",\n",
+    "    quantization_config=quantization_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "WZGf-GF09Gsc"
+   },
+   "source": [
+    "The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter** — a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "LqCEI4hf99p_"
+   },
+   "outputs": [],
+   "source": [
+    "from peft import LoraConfig\n",
+    "\n",
+    "# You may need to update `target_modules` depending on the architecture of your chosen model.\n",
+    "# For example, different VLMs might have different attention/projection layer names.\n",
+    "peft_config = LoraConfig(\n",
+    "    r=8,\n",
+    "    lora_alpha=32,\n",
+    "    lora_dropout=0.1,\n",
+    "    target_modules=[\"q_proj\", \"v_proj\"],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "mDq4V6dN9MGk"
+   },
+   "source": [
+    "## Train model\n",
+    "\n",
+    "We'll configure **GRPO** using `GRPOConfig`, keeping the parameters minimal so the training fits on a free Colab instance. You can adjust these settings if more resources are available. For full details on all available parameters, check the [TRL GRPOConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.GRPOConfig).\n",
+    "\n",
+    "First, we need to define the rewards functions that the training algorithm will use to improve the model. In this case, we'll include two reward functions.\n",
+    "We'll use a format reward that will reward the model when the output includes `<think>` and `<answer>` tags and additionally a length-based reward to discourage overthinking. Both functions have been extracted from [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "jhgqx8kO99p_"
+   },
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "\n",
+    "def format_reward(completions, **kwargs):\n",
+    "    \"\"\"Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags.\"\"\"\n",
+    "    pattern = r\"<think>.*?</think>.*?<answer>.*?</answer>\"\n",
+    "\n",
+    "    matches = []\n",
+    "    for item in completions:\n",
+    "        if isinstance(item, list):\n",
+    "            text = item[0]['content']\n",
+    "        else:\n",
+    "            text = item\n",
+    "        match = re.match(pattern, text, re.DOTALL | re.MULTILINE)\n",
+    "        matches.append(match)\n",
+    "\n",
+    "    return [1.0 if match else 0.0 for match in matches]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "sVmzQ_wL99p_"
+   },
+   "outputs": [],
+   "source": [
+    "from math_verify import LatexExtractionConfig, parse, verify\n",
+    "from latex2sympy2_extended import NormalizationConfig\n",
+    "\n",
+    "\n",
+    "def len_reward(completions, solution, **kwargs) -> float:\n",
+    "    \"\"\"Compute length-based rewards to discourage overthinking and promote token efficiency.\n",
+    "\n",
+    "    Taken from the Kimi 1.5 tech report: https://huggingface.co/papers/2501.12599\n",
+    "\n",
+    "    Args:\n",
+    "        completions: List of model completions\n",
+    "        solution: List of ground truth solutions\n",
+    "\n",
+    "    Returns:\n",
+    "        List of rewards where:\n",
+    "        - For correct answers: reward = 0.5 - (len - min_len)/(max_len - min_len)\n",
+    "        - For incorrect answers: reward = min(0, 0.5 - (len - min_len)/(max_len - min_len))\n",
+    "    \"\"\"\n",
+    "    contents = []\n",
+    "    for item in completions:\n",
+    "        if isinstance(item, list):\n",
+    "            text = item[0]['content']\n",
+    "        else:\n",
+    "            text = item\n",
+    "        contents.append(text)\n",
+    "\n",
+    "    # First check correctness of answers\n",
+    "    correctness = []\n",
+    "    for content, sol in zip(contents, solution):\n",
+    "        gold_parsed = parse(\n",
+    "            sol,\n",
+    "            extraction_mode=\"first_match\",\n",
+    "            extraction_config=[LatexExtractionConfig()],\n",
+    "        )\n",
+    "        if len(gold_parsed) == 0:\n",
+    "            # Skip unparsable examples\n",
+    "            correctness.append(True)  # Treat as correct to avoid penalizing\n",
+    "            print(\"Failed to parse gold solution: \", sol)\n",
+    "            continue\n",
+    "\n",
+    "        answer_parsed = parse(\n",
+    "            content,\n",
+    "            extraction_config=[\n",
+    "                LatexExtractionConfig(\n",
+    "                    normalization_config=NormalizationConfig(\n",
+    "                        nits=False,\n",
+    "                        malformed_operators=False,\n",
+    "                        basic_latex=True,\n",
+    "                        equations=True,\n",
+    "                        boxed=True,\n",
+    "                        units=True,\n",
+    "                    ),\n",
+    "                    boxed_match_priority=0,\n",
+    "                    try_extract_without_anchor=False,\n",
+    "                )\n",
+    "            ],\n",
+    "            extraction_mode=\"first_match\",\n",
+    "        )\n",
+    "        correctness.append(verify(answer_parsed, gold_parsed))\n",
+    "\n",
+    "    # Calculate lengths\n",
+    "    lengths = [len(content) for content in contents]\n",
+    "    min_len = min(lengths)\n",
+    "    max_len = max(lengths)\n",
+    "\n",
+    "    # If all responses have the same length, return zero rewards\n",
+    "    if max_len == min_len:\n",
+    "        return [0.0] * len(completions)\n",
+    "\n",
+    "    rewards = []\n",
+    "    for length, is_correct in zip(lengths, correctness):\n",
+    "        lambda_val = 0.5 - (length - min_len) / (max_len - min_len)\n",
+    "\n",
+    "        if is_correct:\n",
+    "            reward = lambda_val\n",
+    "        else:\n",
+    "            reward = min(0, lambda_val)\n",
+    "\n",
+    "        rewards.append(float(reward))\n",
+    "\n",
+    "    return rewards"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "9xBL7Rni9LZb"
+   },
+   "source": [
+    "After defining the reward function(s), we can define the `GRPOConfig`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "pcv6KXUD99qA"
+   },
+   "outputs": [],
+   "source": [
+    "from trl import GRPOConfig\n",
+    "\n",
+    "output_dir = \"Ministral-3-3B-Instruct-trl-grpo\"\n",
+    "\n",
+    "# Configure training arguments using GRPOConfig\n",
+    "training_args = GRPOConfig(\n",
+    "    learning_rate=2e-5,\n",
+    "    #num_train_epochs=1,\n",
+    "    max_steps=100,                                        # Number of dataset passes. For full trainings, use `num_train_epochs` instead\n",
+    "\n",
+    "    # Parameters that control the data preprocessing\n",
+    "    per_device_train_batch_size=2,\n",
+    "    max_completion_length=1024, # default: 256            # Max completion length produced during training\n",
+    "    num_generations=2, # 2, # default: 8                  # Number of generations produced during training for comparison\n",
+    "\n",
+    "    fp16=False,\n",
+    "    bf16=False,\n",
+    "\n",
+    "    # Parameters related to reporting and saving\n",
+    "    output_dir=output_dir,                                # Where to save model checkpoints and logs\n",
+    "    logging_steps=1,                                      # Log training metrics every N steps\n",
+    "    report_to=\"trackio\",                                  # Experiment tracking tool\n",
+    "    trackio_space_id = output_dir,\n",
+    "\n",
+    "    # Hub integration\n",
+    "    push_to_hub=True,\n",
+    "    log_completions=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "O0q3myQg927v"
+   },
+   "source": [
+    "Configure the GRPO Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "-zd7s5Cs99qA"
+   },
+   "outputs": [],
+   "source": [
+    "from trl import GRPOTrainer\n",
+    "\n",
+    "trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    reward_funcs=[format_reward, len_reward],\n",
+    "    args=training_args,\n",
+    "    train_dataset=train_dataset,\n",
+    "    peft_config=peft_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "kQC7Q5kg95xq"
+   },
+   "source": [
+    "Show memory stats before training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "iF7cnD0T99qA"
+   },
+   "outputs": [],
+   "source": [
+    "gpu_stats = torch.cuda.get_device_properties(0)\n",
+    "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
+    "\n",
+    "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
+    "print(f\"{start_gpu_memory} GB of memory reserved.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "YazYtLAe97Dc"
+   },
+   "source": [
+    "And train!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Ynhxdv3a99qA"
+   },
+   "outputs": [],
+   "source": [
+    "trainer_stats = trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "SmcYN5yW99IP"
+   },
+   "source": [
+    "Show memory stats after training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "mi-exH7699qA"
+   },
+   "outputs": [],
+   "source": [
+    "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
+    "used_percentage = round(used_memory / max_memory * 100, 3)\n",
+    "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n",
+    "\n",
+    "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
+    "print(f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\")\n",
+    "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
+    "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
+    "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
+    "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "saarW87Y9_-R"
+   },
+   "source": [
+    "## Saving fine tuned model\n",
+    "\n",
+    "In this step, we save the fine-tuned model both **locally** and to the **Hugging Face Hub** using the credentials from your account."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "m3mlwQl699qA"
+   },
+   "outputs": [],
+   "source": [
+    "trainer.save_model(output_dir)\n",
+    "trainer.push_to_hub(dataset_name=dataset_id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "nfqvO0qw-OvS"
+   },
+   "source": [
+    "## Load the fine-tuned model and run inference\n",
+    "\n",
+    "Now, let's test our fine-tuned model by loading the **LoRA/QLoRA adapter** and performing **inference**. We'll start by loading the **base model**, then attach the adapter to it, creating the final fine-tuned model ready for evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "B7usNBq699qA"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend\n",
+    "from peft import PeftModel\n",
+    "\n",
+    "base_model = model_name\n",
+    "adapter_model = f\"{output_dir}\" # Replace with your HF username or organization\n",
+    "\n",
+    "model = Mistral3ForConditionalGeneration.from_pretrained(base_model, dtype=\"float32\", device_map=\"auto\")\n",
+    "model = PeftModel.from_pretrained(model, adapter_model)\n",
+    "\n",
+    "tokenizer = MistralCommonBackend.from_pretrained(base_model)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "XnIOkXfy99qA"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "0le5gBl_99qA"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "import base64\n",
+    "from io import BytesIO\n",
+    "\n",
+    "dataset_id = 'lmms-lab/multimodal-open-r1-8k-verified'\n",
+    "train_dataset = load_dataset(dataset_id, split='train[:5%]')\n",
+    "\n",
+    "problem = train_dataset[0]['problem']\n",
+    "image = train_dataset[0]['image']\n",
+    "\n",
+    "buffer = BytesIO()\n",
+    "image.save(buffer, format=\"JPEG\")\n",
+    "image_bytes = buffer.getvalue()\n",
+    "image_b64 = base64.b64encode(image_bytes).decode(\"utf-8\")\n",
+    "\n",
+    "messages = [\n",
+    "    {\n",
+    "        \"role\": \"system\", \"content\": [\n",
+    "            {\"type\": \"text\", \"text\": SYSTEM_PROMPT}\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"role\": \"user\",\n",
+    "        \"content\": [\n",
+    "            {\n",
+    "                \"type\": \"image_url\",\n",
+    "                \"image_url\": {\n",
+    "                    \"url\": f\"data:image/jpeg;base64,{image_b64}\"\n",
+    "                },\n",
+    "            },\n",
+    "            {\"type\": \"text\", \"text\": problem},\n",
+    "        ],\n",
+    "    },\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "f9PgBCD499qA"
+   },
+   "outputs": [],
+   "source": [
+    "messages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "ENOGILKk99qA"
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "tokenized = tokenizer.apply_chat_template(messages, return_tensors=\"pt\", return_dict=True)\n",
+    "tokenized[\"input_ids\"] = tokenized[\"input_ids\"].to(device=\"cuda\")\n",
+    "tokenized[\"pixel_values\"] = tokenized[\"pixel_values\"].to(dtype=torch.bfloat16, device=\"cuda\")\n",
+    "image_sizes = [tokenized[\"pixel_values\"].shape[-2:]]\n",
+    "\n",
+    "output = model.generate(\n",
+    "    **tokenized,\n",
+    "    image_sizes=image_sizes,\n",
+    "    max_new_tokens=512,\n",
+    ")[0]\n",
+    "\n",
+    "decoded_output = tokenizer.decode(output[len(tokenized[\"input_ids\"][0]):])\n",
+    "print(decoded_output)"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "T4",
+   "provenance": []
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}

ICL/RL/trl_source/examples/notebooks/grpo_qwen3_vl.ipynb ADDED Viewed

	@@ -0,0 +1,693 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "-J8iGzLf4rUJ"
+   },
+   "source": [
+    "# GRPO Qwen3-VL with QLoRA using TRL\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_qwen3_vl.ipynb)\n",
+    "\n",
+    "![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)\n",
+    "\n",
+    "\n",
+    "With [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl), you can fine-tune cutting edge vision language models. It comes with support for quantized parameter efficient fine-tuning technique **QLoRA**, so we can use free Colab (T4 GPU) to fine-tune models like [Qwen3-VL](https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe).\n",
+    "\n",
+    "\n",
+    "- [TRL GitHub Repository](https://github.com/huggingface/trl) — star us to support the project!  \n",
+    "- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)  \n",
+    "- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)\n",
+    "- [More Qwen3-VL Fine-tuning Examples (including TRL scripts)](https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-finetune/)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "NvrzGRnu48Vz"
+   },
+   "source": [
+    "## Install dependencies\n",
+    "\n",
+    "We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install **trackio** to log and monitor our experiments, and **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "8CfZlUevmkg7"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -Uq \"trl[peft]\" bitsandbytes trackio math_verify"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gpzI6omi7728"
+   },
+   "source": [
+    "### Log in to Hugging Face\n",
+    "\n",
+    "Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "4Ncx0wYtnYCW"
+   },
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "V_Zylc4t79-n"
+   },
+   "source": [
+    "## Load dataset\n",
+    "\n",
+    "\n",
+    "We'll load the [**lmms-lab/multimodal-open-r1-8k-verified**](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) dataset from the Hugging Face Hub using the `datasets` library.\n",
+    "\n",
+    "This dataset contains maths problems with the image representing the problem,  along with the solution in thinking format specially tailored for VLMs. By training our model with this dataset, it'll improve its maths and thinking reasoning.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "TzXogU24F_QR"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset_id = 'lmms-lab/multimodal-open-r1-8k-verified'\n",
+    "train_dataset = load_dataset(dataset_id, split='train[:5%]')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gVV7RoRN8zk5"
+   },
+   "source": [
+    "In addition to the `problem` and `image` columns, we also include a custom system prompt to tell the model how we'd like the generation.\n",
+    "\n",
+    "The system prompt is extracted from DeepSeek R1. Refer to [this previous recipe](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl) for more details.\n",
+    "\n",
+    "We convert the dataset samples into conversation samples, including the system prompt and one image and problem description per sample, since this is how the GRPO trainer expects them.\n",
+    "\n",
+    "We also set `padding_side=\"left\"` to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "ZT1JfiiTGExB"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoProcessor\n",
+    "\n",
+    "model_name = \"Qwen/Qwen3-VL-4B-Instruct\" # \"Qwen/Qwen3-VL-8B-Instruct\"\n",
+    "processor = AutoProcessor.from_pretrained(model_name, padding_side=\"left\")\n",
+    "\n",
+    "SYSTEM_PROMPT = (\n",
+    "    \"You are a helpful AI Assistant that provides well-reasoned and detailed responses. \"\n",
+    "    \"You first think about the reasoning process as an internal monologue and then provide the user with the answer. \"\n",
+    "    \"Respond in the following format: <think>\\n...\\n</think>\\n<answer>\\n...\\n</answer>\"\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def make_conversation(example):\n",
+    "    conversation = [\n",
+    "        {\n",
+    "            \"role\": \"system\",\n",
+    "            \"content\": [{\"type\": \"text\", \"text\": SYSTEM_PROMPT}],\n",
+    "        },\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": [\n",
+    "                {\"type\": \"image\", \"image\": example[\"image\"]},\n",
+    "                {\"type\": \"text\", \"text\": example[\"problem\"]},\n",
+    "            ],\n",
+    "        },\n",
+    "    ]\n",
+    "    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)\n",
+    "    return {\n",
+    "        \"prompt\": prompt,\n",
+    "        \"image\": example[\"image\"],\n",
+    "    }\n",
+    "\n",
+    "train_dataset = train_dataset.map(make_conversation)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "5txAuMAa8ock"
+   },
+   "source": [
+    "Let's review one example to understand the internal structure:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "PDXQd5Jk2Bqe"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "hzSR_56wxKDA"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset = train_dataset.remove_columns(['problem', 'original_question', 'original_answer'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "T9rCkeqDODba"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "YY3uMp909Eqy"
+   },
+   "source": [
+    "## Load model and configure LoRA/QLoRA\n",
+    "\n",
+    "This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "gt05dgXgm9QR"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import Qwen3VLForConditionalGeneration, BitsAndBytesConfig\n",
+    "import torch\n",
+    "\n",
+    "model = Qwen3VLForConditionalGeneration.from_pretrained(\n",
+    "    model_name, dtype=\"float32\",\n",
+    "    device_map=\"auto\",\n",
+    "    quantization_config=BitsAndBytesConfig(\n",
+    "        load_in_4bit=True,\n",
+    "        bnb_4bit_use_double_quant=True,\n",
+    "        bnb_4bit_quant_type=\"nf4\",\n",
+    "        bnb_4bit_compute_dtype=torch.float16\n",
+    "    ),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "WZGf-GF09Gsc"
+   },
+   "source": [
+    "The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter** — a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "ME1im5gh2LFg"
+   },
+   "outputs": [],
+   "source": [
+    "from peft import LoraConfig\n",
+    "\n",
+    "# You may need to update `target_modules` depending on the architecture of your chosen model.\n",
+    "# For example, different VLMs might have different attention/projection layer names.\n",
+    "peft_config = LoraConfig(\n",
+    "    r=8,\n",
+    "    lora_alpha=32,\n",
+    "    lora_dropout=0.1,\n",
+    "    target_modules=[\"q_proj\", \"v_proj\"],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "mDq4V6dN9MGk"
+   },
+   "source": [
+    "## Train model\n",
+    "\n",
+    "We'll configure **GRPO** using `GRPOConfig`, keeping the parameters minimal so the training fits on a free Colab instance. You can adjust these settings if more resources are available. For full details on all available parameters, check the [TRL GRPOConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.GRPOConfig).\n",
+    "\n",
+    "First, we need to define the rewards functions that the training algorithm will use to improve the model. In this case, we'll include two reward functions.\n",
+    "We'll use a format reward that will reward the model when the output includes `<think>` and `<answer>` tags and additionally a length-based reward to discourage overthinking. Both functions have been extracted from [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Dqp3TfUwHUxW"
+   },
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "\n",
+    "def format_reward(completions, **kwargs):\n",
+    "    \"\"\"Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags.\"\"\"\n",
+    "    pattern = r\"^<think>\\n.*?\\n</think>\\n<answer>\\n.*?\\n</answer>$\"\n",
+    "    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completions]\n",
+    "    return [1.0 if match else 0.0 for match in matches]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "rxNPUp7RBFcz"
+   },
+   "outputs": [],
+   "source": [
+    "from math_verify import LatexExtractionConfig, parse, verify\n",
+    "from latex2sympy2_extended import NormalizationConfig\n",
+    "\n",
+    "\n",
+    "def len_reward(completions, solution, **kwargs) -> float:\n",
+    "    \"\"\"Compute length-based rewards to discourage overthinking and promote token efficiency.\n",
+    "\n",
+    "    Taken from the Kimi 1.5 tech report: https://huggingface.co/papers/2501.12599\n",
+    "\n",
+    "    Args:\n",
+    "        completions: List of model completions\n",
+    "        solution: List of ground truth solutions\n",
+    "\n",
+    "    Returns:\n",
+    "        List of rewards where:\n",
+    "        - For correct answers: reward = 0.5 - (len - min_len)/(max_len - min_len)\n",
+    "        - For incorrect answers: reward = min(0, 0.5 - (len - min_len)/(max_len - min_len))\n",
+    "    \"\"\"\n",
+    "    contents = completions\n",
+    "\n",
+    "    # First check correctness of answers\n",
+    "    correctness = []\n",
+    "    for content, sol in zip(contents, solution):\n",
+    "        gold_parsed = parse(\n",
+    "            sol,\n",
+    "            extraction_mode=\"first_match\",\n",
+    "            extraction_config=[LatexExtractionConfig()],\n",
+    "        )\n",
+    "        if len(gold_parsed) == 0:\n",
+    "            # Skip unparsable examples\n",
+    "            correctness.append(True)  # Treat as correct to avoid penalizing\n",
+    "            print(\"Failed to parse gold solution: \", sol)\n",
+    "            continue\n",
+    "\n",
+    "        answer_parsed = parse(\n",
+    "            content,\n",
+    "            extraction_config=[\n",
+    "                LatexExtractionConfig(\n",
+    "                    normalization_config=NormalizationConfig(\n",
+    "                        nits=False,\n",
+    "                        malformed_operators=False,\n",
+    "                        basic_latex=True,\n",
+    "                        equations=True,\n",
+    "                        boxed=True,\n",
+    "                        units=True,\n",
+    "                    ),\n",
+    "                    boxed_match_priority=0,\n",
+    "                    try_extract_without_anchor=False,\n",
+    "                )\n",
+    "            ],\n",
+    "            extraction_mode=\"first_match\",\n",
+    "        )\n",
+    "        correctness.append(verify(answer_parsed, gold_parsed))\n",
+    "\n",
+    "    # Calculate lengths\n",
+    "    lengths = [len(content) for content in contents]\n",
+    "    min_len = min(lengths)\n",
+    "    max_len = max(lengths)\n",
+    "\n",
+    "    # If all responses have the same length, return zero rewards\n",
+    "    if max_len == min_len:\n",
+    "        return [0.0] * len(completions)\n",
+    "\n",
+    "    rewards = []\n",
+    "    for length, is_correct in zip(lengths, correctness):\n",
+    "        lambda_val = 0.5 - (length - min_len) / (max_len - min_len)\n",
+    "\n",
+    "        if is_correct:\n",
+    "            reward = lambda_val\n",
+    "        else:\n",
+    "            reward = min(0, lambda_val)\n",
+    "\n",
+    "        rewards.append(float(reward))\n",
+    "\n",
+    "    return rewards\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "9xBL7Rni9LZb"
+   },
+   "source": [
+    "After defining the reward function(s), we can define the `GRPOConfig`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "OEmRM0rIHXQ4"
+   },
+   "outputs": [],
+   "source": [
+    "from trl import GRPOConfig\n",
+    "\n",
+    "output_dir = \"Qwen3-VL-4B-Instruct-trl-grpo\"\n",
+    "\n",
+    "# Configure training arguments using GRPOConfig\n",
+    "training_args = GRPOConfig(\n",
+    "    learning_rate=2e-5,\n",
+    "    #num_train_epochs=1,\n",
+    "    max_steps=100,                                        # Number of dataset passes. For full trainings, use `num_train_epochs` instead\n",
+    "\n",
+    "    # Parameters that control the data preprocessing\n",
+    "    per_device_train_batch_size=2,\n",
+    "    max_completion_length=1024, # default: 256            # Max completion length produced during training\n",
+    "    num_generations=2, # 2, # default: 8                  # Number of generations produced during training for comparison\n",
+    "\n",
+    "    fp16=True,\n",
+    "\n",
+    "    # Parameters related to reporting and saving\n",
+    "    output_dir=output_dir,                                # Where to save model checkpoints and logs\n",
+    "    logging_steps=1,                                      # Log training metrics every N steps\n",
+    "    report_to=\"trackio\",                                  # Experiment tracking tool\n",
+    "\n",
+    "    # Hub integration\n",
+    "    push_to_hub=True,\n",
+    "    log_completions=True\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "O0q3myQg927v"
+   },
+   "source": [
+    "Configure the GRPO Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "z5JxkmS9HqD5",
+    "outputId": "2b39338e-2194-4829-fc54-5e286566fd28"
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/usr/local/lib/python3.12/dist-packages/peft/mapping_func.py:73: UserWarning: You are trying to modify a model with PEFT for a second time. If you want to reload the model with a different config, make sure to call `.unload()` before.\n",
+      "  warnings.warn(\n",
+      "/usr/local/lib/python3.12/dist-packages/peft/tuners/tuners_utils.py:196: UserWarning: Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "from trl import GRPOTrainer\n",
+    "\n",
+    "trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    reward_funcs=[format_reward, len_reward],\n",
+    "    args=training_args,\n",
+    "    train_dataset=train_dataset,\n",
+    "    peft_config=peft_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "kQC7Q5kg95xq"
+   },
+   "source": [
+    "Show memory stats before training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "naG_7qlYyBP6"
+   },
+   "outputs": [],
+   "source": [
+    "gpu_stats = torch.cuda.get_device_properties(0)\n",
+    "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
+    "\n",
+    "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
+    "print(f\"{start_gpu_memory} GB of memory reserved.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "YazYtLAe97Dc"
+   },
+   "source": [
+    "And train!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "pbJXrhA0ywra"
+   },
+   "outputs": [],
+   "source": [
+    "trainer_stats = trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "SmcYN5yW99IP"
+   },
+   "source": [
+    "Show memory stats after training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "TrrwP4ADMmrp"
+   },
+   "outputs": [],
+   "source": [
+    "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
+    "used_percentage = round(used_memory / max_memory * 100, 3)\n",
+    "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n",
+    "\n",
+    "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
+    "print(f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\")\n",
+    "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
+    "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
+    "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
+    "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "saarW87Y9_-R"
+   },
+   "source": [
+    "## Saving fine tuned model\n",
+    "\n",
+    "In this step, we save the fine-tuned model both **locally** and to the **Hugging Face Hub** using the credentials from your account."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "71A8aqEyyETA"
+   },
+   "outputs": [],
+   "source": [
+    "trainer.save_model(output_dir)\n",
+    "trainer.push_to_hub(dataset_name=dataset_id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "nfqvO0qw-OvS"
+   },
+   "source": [
+    "## Load the fine-tuned model and run inference\n",
+    "\n",
+    "Now, let's test our fine-tuned model by loading the **LoRA/QLoRA adapter** and performing **inference**. We'll start by loading the **base model**, then attach the adapter to it, creating the final fine-tuned model ready for evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "R8T2uFQVyFeH"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import Qwen3VLForConditionalGeneration, AutoProcessor\n",
+    "from peft import PeftModel\n",
+    "\n",
+    "base_model = model_name\n",
+    "adapter_model = f\"{output_dir}\" # Replace with your HF username or organization\n",
+    "\n",
+    "model = Qwen3VLForConditionalGeneration.from_pretrained(base_model, dtype=\"float32\", device_map=\"auto\")\n",
+    "model = PeftModel.from_pretrained(model, adapter_model)\n",
+    "\n",
+    "processor = AutoProcessor.from_pretrained(base_model)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "dPBHP0CpLa6K"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "cG5-ccGRyHgo"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset_id = 'lmms-lab/multimodal-open-r1-8k-verified'\n",
+    "train_dataset = load_dataset(dataset_id, split='train[:5%]')\n",
+    "\n",
+    "problem = train_dataset[0]['problem']\n",
+    "image = train_dataset[0]['image']\n",
+    "\n",
+    "messages = [\n",
+    "    {\n",
+    "        \"role\": \"system\", \"content\": [\n",
+    "            {\"type\": \"text\", \"text\": SYSTEM_PROMPT}\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"role\": \"user\",\n",
+    "        \"content\": [\n",
+    "            {\"type\": \"image\", \"image\": image},\n",
+    "            {\"type\": \"text\", \"text\": problem},\n",
+    "        ],\n",
+    "    },\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "r_70q_8lLgfV"
+   },
+   "outputs": [],
+   "source": [
+    "messages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "PX92MjqlyIwB"
+   },
+   "outputs": [],
+   "source": [
+    "inputs = processor.apply_chat_template(\n",
+    "    messages,\n",
+    "    add_generation_prompt=True,\n",
+    "    tokenize=True,\n",
+    "    return_tensors=\"pt\",\n",
+    "    return_dict=True,\n",
+    ").to(model.device)\n",
+    "\n",
+    "# Inference: Generation of the output\n",
+    "generated_ids = model.generate(**inputs, max_new_tokens=500)\n",
+    "generated_ids_trimmed = [\n",
+    "    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)\n",
+    "]\n",
+    "output_text = processor.batch_decode(\n",
+    "    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False\n",
+    ")\n",
+    "print(output_text)"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "T4",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}

ICL/RL/trl_source/examples/notebooks/grpo_rnj_1_instruct.ipynb ADDED Viewed

	@@ -0,0 +1,622 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "-J8iGzLf4rUJ"
+   },
+   "source": [
+    "# GRPO EssentialAI/rnj-1-instruct with QLoRA using TRL\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_rnj_1_instruct.ipynb)\n",
+    "\n",
+    "![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)\n",
+    "\n",
+    "\n",
+    "With [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl), you can fine-tune cutting edge large language models. It comes with support for quantized parameter efficient fine-tuning technique **QLoRA**, so we can use Colab to fine-tune models like [EssentialAI/rnj-1-instruct](https://huggingface.co/collections/EssentialAI/rnj-1).\n",
+    "\n",
+    "\n",
+    "- [TRL GitHub Repository](https://github.com/huggingface/trl) — star us to support the project!  \n",
+    "- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)  \n",
+    "- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)\n",
+    "\n",
+    "In this notebook, we'll add reasoning capabilities to the model, teaching it to generate reasoning traces (`<think></think>`) before giving us the final answer (`<answer></answer>`)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "NvrzGRnu48Vz"
+   },
+   "source": [
+    "## Install dependencies\n",
+    "\n",
+    "We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install **trackio** to log and monitor our experiments, and **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "8VOdRz9fgFa8"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -Uq \"trl[peft]\" bitsandbytes trackio math_verify"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gpzI6omi7728"
+   },
+   "source": [
+    "### Log in to Hugging Face\n",
+    "\n",
+    "Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "d3j3BsdQgFa8"
+   },
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "V_Zylc4t79-n"
+   },
+   "source": [
+    "## Load dataset\n",
+    "\n",
+    "\n",
+    "We'll load the [**AI-MO/NuminaMath-TIR**](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset from the Hugging Face Hub using the `datasets` library.\n",
+    "\n",
+    "This dataset contains maths problems, along with the solution in thinking format specially tailored for LLMs. By training our model with this dataset, it'll improve its maths and thinking reasoning.\n",
+    "\n",
+    "> We only use a subset for educational purposes. In a real scenario, we'd use the complete dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "YSuLNZAmgFa9"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset_id = 'AI-MO/NuminaMath-TIR'\n",
+    "train_dataset = load_dataset(dataset_id, split='train[:5%]')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gVV7RoRN8zk5"
+   },
+   "source": [
+    "In addition to the current columns, we also include a custom system prompt to tell the model how we'd like the generation.\n",
+    "\n",
+    "This system prompt is an adapted version of the original one extracted from **DeepSeek R1**. For additional background, see [this previous recipe](https://huggingface.co/learn/cookbook/fine_tuning_llm_grpo_trl). We extend the prompt with **examples** and a **more explicit, verbose formulation** to make the desired behavior easier for the model to learn. Depending on your goals, you may further enrich the prompt to simplify learning, or intentionally shorten and harden it to encourage more robust and generalizable behavior.\n",
+    "\n",
+    "We convert the dataset samples into conversation samples, including the system prompt and problem description per sample, since this is how the GRPO trainer expects them.\n",
+    "\n",
+    "We also set `padding_side=\"left\"` to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "vr9t-9Z5gFa9"
+   },
+   "outputs": [],
+   "source": [
+    "SYSTEM_PROMPT = \"\"\"A conversation between User and Assistant. The user asks a question, and the Assistant solves it.\n",
+    "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\n",
+    "The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.\n",
+    "Use exactly one <think>...</think> block followed by exactly one <answer>...</answer> block.\n",
+    "\n",
+    "Examples:\n",
+    "\n",
+    "User: What is 2 + 2?\n",
+    "Assistant:\n",
+    "<think>\n",
+    "I will add 2 and 2 together.\n",
+    "</think>\n",
+    "<answer>4</answer>\n",
+    "\n",
+    "User: What is 3 × 5?\n",
+    "Assistant:\n",
+    "<think>\n",
+    "I will multiply 3 by 5.\n",
+    "</think>\n",
+    "<answer>15</answer>\n",
+    "\n",
+    "User: Find the GCD of 12 and 18.\n",
+    "Assistant:\n",
+    "<think>\n",
+    "I will list the divisors of 12 and 18 and find the greatest one they have in common.\n",
+    "</think>\n",
+    "<answer>6</answer>\n",
+    "\"\"\"\n",
+    "\n",
+    "def make_conversation(example):\n",
+    "    return {\n",
+    "        \"prompt\": [\n",
+    "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "            {\"role\": \"user\", \"content\": example[\"problem\"]},\n",
+    "        ],\n",
+    "    }\n",
+    "\n",
+    "train_dataset = train_dataset.map(make_conversation)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "5txAuMAa8ock"
+   },
+   "source": [
+    "Let's review one example to understand the internal structure:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "jZtkB0D9gFa9"
+   },
+   "outputs": [],
+   "source": [
+    "print(train_dataset[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FtdKjmyFZImL"
+   },
+   "source": [
+    "And remove the columns that are not needed for training:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Ai4F1GaPgFa-"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset = train_dataset.remove_columns(['messages', 'problem'])\n",
+    "print(train_dataset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "YY3uMp909Eqy"
+   },
+   "source": [
+    "## Load model and configure LoRA/QLoRA\n",
+    "\n",
+    "This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "DSKcUQ9RgFa-"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, BitsAndBytesConfig\n",
+    "import torch\n",
+    "\n",
+    "model_name = \"EssentialAI/rnj-1-instruct\"\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    model_name,\n",
+    "    dtype=\"float32\",\n",
+    "    device_map=\"auto\",\n",
+    "    quantization_config=BitsAndBytesConfig(\n",
+    "        load_in_4bit=True,\n",
+    "        bnb_4bit_use_double_quant=True,\n",
+    "        bnb_4bit_quant_type=\"nf4\",\n",
+    "        bnb_4bit_compute_dtype=torch.float16\n",
+    "    ),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "WZGf-GF09Gsc"
+   },
+   "source": [
+    "The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter**, a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "nMMlDxJSgFa-"
+   },
+   "outputs": [],
+   "source": [
+    "from peft import LoraConfig\n",
+    "\n",
+    "# You may need to update `target_modules` depending on the architecture of your chosen model.\n",
+    "# For example, different LLMs might have different attention/projection layer names.\n",
+    "peft_config = LoraConfig(\n",
+    "    r=32,\n",
+    "    lora_alpha=32,\n",
+    "    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\",],\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "mDq4V6dN9MGk"
+   },
+   "source": [
+    "## Train model\n",
+    "\n",
+    "We'll configure **GRPO** using `GRPOConfig`, keeping the parameters minimal so the training fits on a Colab instance. You can adjust these settings depending on the resources available. For full details on all available parameters, check the [TRL GRPOConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.GRPOConfig).\n",
+    "\n",
+    "First, we need to define the rewards functions that the training algorithm will use to improve the model. In this case, we'll include just one reward function.\n",
+    "We'll use a format reward that will reward the model when the output includes `<think>` and `<answer>` tags. This is a simplification of the pipeline for educational purposes, but in a real scenario, you'd at least all need a reward function to check the correctness of the model answer. The function has been extracted from [here](https://github.com/huggingface/open-r1/blob/main/src/open_r1/rewards.py).\n",
+    "\n",
+    "> 💡 **Note**:  \n",
+    "> You can further refine this reward by making it more granular. For example, assigning partial rewards when `<think>` and `<answer>` appear independently, or when they are present but incorrectly ordered. This can make the learning signal denser and speed up early training. However, overly simplifying the reward may reduce robustness, even if it helps the model converge faster. In practice, there is a trade-off between ease of learning and the generalization quality of the final model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Rtx5owCRgFa-"
+   },
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "\n",
+    "def format_reward(completions, **kwargs):\n",
+    "    \"\"\"Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags.\"\"\"\n",
+    "    pattern = r\"<think>.*?</think>.*?<answer>.*?</answer>\"\n",
+    "\n",
+    "    matches = []\n",
+    "    for item in completions:\n",
+    "        if isinstance(item, list):\n",
+    "            text = item[0]['content']\n",
+    "        else:\n",
+    "            text = item\n",
+    "        match = re.match(pattern, text, re.DOTALL | re.MULTILINE)\n",
+    "        matches.append(match)\n",
+    "\n",
+    "    return [1.0 if match else 0.0 for match in matches]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "9xBL7Rni9LZb"
+   },
+   "source": [
+    "After defining the reward function(s), we can define the `GRPOConfig`. You can adapt the values in the config depending on your training setting and even fit the training in more constrained setups like free Colab (T4)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "rJ0VfG3wgFa-"
+   },
+   "outputs": [],
+   "source": [
+    "from trl import GRPOConfig\n",
+    "\n",
+    "output_dir = \"EssentialAI-rnj-1-instruct-trl-grpo\"\n",
+    "\n",
+    "# Configure training arguments using GRPOConfig\n",
+    "training_args = GRPOConfig(\n",
+    "    learning_rate=2e-5,                                   # Learning rate used during traing\n",
+    "    num_train_epochs=1,                                   # Number of full dataset passes. For testing, use `max_steps` instead\n",
+    "    #max_steps=100,\n",
+    "\n",
+    "    # Parameters that control the data preprocessing\n",
+    "    per_device_train_batch_size=8,\n",
+    "    max_completion_length=256, # default: 256             # Max completion length produced during training\n",
+    "    num_generations=8, # default: 8                       # Number of generations produced during training for comparison\n",
+    "\n",
+    "    # Parameters related to reporting and saving\n",
+    "    output_dir=output_dir,                                # Where to save model checkpoints and logs\n",
+    "    logging_steps=10,                                     # Log training metrics every N steps\n",
+    "    report_to=\"trackio\",                                  # Experiment tracking tool\n",
+    "    trackio_space_id = output_dir,                        # HF Space where you trackio will be\n",
+    "\n",
+    "    # Hub integration\n",
+    "    push_to_hub=True,                                     # Push the resulted model to the Hub\n",
+    "    log_completions=True,                                 # Log completions during training\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "O0q3myQg927v"
+   },
+   "source": [
+    "Configure the GRPO Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "aW7Gi4nXgFa-"
+   },
+   "outputs": [],
+   "source": [
+    "from trl import GRPOTrainer\n",
+    "\n",
+    "trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    reward_funcs=[format_reward],\n",
+    "    args=training_args,\n",
+    "    train_dataset=train_dataset,\n",
+    "    peft_config=peft_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "kQC7Q5kg95xq"
+   },
+   "source": [
+    "Show memory stats before training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "OJdVlC_mgFa_"
+   },
+   "outputs": [],
+   "source": [
+    "gpu_stats = torch.cuda.get_device_properties(0)\n",
+    "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
+    "\n",
+    "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
+    "print(f\"{start_gpu_memory} GB of memory reserved.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "YazYtLAe97Dc"
+   },
+   "source": [
+    "And train!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Mtv8s7rBgFa_"
+   },
+   "outputs": [],
+   "source": [
+    "trainer_stats = trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "SmcYN5yW99IP"
+   },
+   "source": [
+    "Show memory stats after training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "-ROfX8e9gFa_"
+   },
+   "outputs": [],
+   "source": [
+    "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
+    "used_percentage = round(used_memory / max_memory * 100, 3)\n",
+    "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n",
+    "\n",
+    "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
+    "print(f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\")\n",
+    "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
+    "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
+    "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
+    "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "saarW87Y9_-R"
+   },
+   "source": [
+    "## Saving fine tuned model\n",
+    "\n",
+    "In this step, we save the fine-tuned model both **locally** and to the **Hugging Face Hub** using the credentials from your account."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "09zYXJ3GgFa_"
+   },
+   "outputs": [],
+   "source": [
+    "trainer.save_model(output_dir)\n",
+    "trainer.push_to_hub(dataset_name=dataset_id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "nfqvO0qw-OvS"
+   },
+   "source": [
+    "## Load the fine-tuned model and run inference\n",
+    "\n",
+    "Now, let's test our fine-tuned model by loading the **LoRA/QLoRA adapter** and performing **inference**. We'll start by loading the **base model**, then attach the adapter to it, creating the final fine-tuned model ready for evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "9Yk9RAABgFa_"
+   },
+   "outputs": [],
+   "source": [
+    "output_dir = 'sergiopaniego/EssentialAI-rnj-1-instruct-trl-grpo'\n",
+    "model_name = \"EssentialAI/rnj-1-instruct\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "CdzlQcCAgFa_"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from peft import PeftModel\n",
+    "\n",
+    "base_model = model_name\n",
+    "adapter_model = f\"{output_dir}\" # Replace with your HF username or organization\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(base_model, dtype=\"float32\", device_map=\"auto\")\n",
+    "model = PeftModel.from_pretrained(model, adapter_model)\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(base_model)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "LZgjlAu-gFa_"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "gjY6TqQHgFa_"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset_id = 'AI-MO/NuminaMath-TIR'\n",
+    "train_dataset = load_dataset(dataset_id, split='train[:5%]')\n",
+    "\n",
+    "problem = train_dataset[0]['problem']\n",
+    "\n",
+    "messages = [\n",
+    "    {\n",
+    "        \"role\": \"system\", \"content\": [\n",
+    "            {\"type\": \"text\", \"text\": SYSTEM_PROMPT}\n",
+    "        ]\n",
+    "    },\n",
+    "    {\n",
+    "        \"role\": \"user\",\n",
+    "        \"content\": [\n",
+    "            {\"type\": \"text\", \"text\": problem},\n",
+    "        ],\n",
+    "    },\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "eaVubGYmgFa_"
+   },
+   "outputs": [],
+   "source": [
+    "messages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "2M6Xh4JMgFa_"
+   },
+   "outputs": [],
+   "source": [
+    "input_ids = tokenizer.apply_chat_template(\n",
+    "    messages,\n",
+    "    add_generation_prompt=True,\n",
+    "    return_tensors=\"pt\",\n",
+    "    return_dict=False,\n",
+    ").to(model.device)\n",
+    "\n",
+    "# --- Generate Prediction --- #\n",
+    "print(\"Generating prediction...\")\n",
+    "output_ids = model.generate(\n",
+    "    input_ids,\n",
+    "    max_new_tokens=50,\n",
+    "    pad_token_id=tokenizer.eos_token_id,\n",
+    "    do_sample=True,\n",
+    "    temperature=0.2,\n",
+    "    top_p=0.95\n",
+    ")\n",
+    "\n",
+    "response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "A100",
+   "provenance": []
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}

ICL/RL/trl_source/examples/notebooks/grpo_trl_lora_qlora.ipynb ADDED Viewed

	@@ -0,0 +1,1638 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "27ozP4Uy-Cz2"
+      },
+      "source": [
+        "# Group Relative Policy Optimization (GRPO) with LoRA/QLoRA using TRL — on a Free Colab Notebook\n",
+        "\n",
+        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/grpo_trl_lora_qlora.ipynb)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "eOjY4AR1-QnF"
+      },
+      "source": [
+        "![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)\n",
+        "\n",
+        "Easily fine-tune **Large Language Models (LLMs)** or **Vision-Language Models (VLMs)** with **LoRA** or **QLoRA** using the [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl) library by Hugging Face and Group Relative Policy Optimization (GRPO) — all within a **free Google Colab notebook** powered by a **T4 GPU**.\n",
+        "\n",
+        "Thanks to the **built-in memory and training optimizations in TRL**, including LoRA, quantization, gradient checkpointing, and optimized attention kernels, it is possible to **fine-tune a 7B model on a free T4** with a **~7× reduction in memory consumption** compared to naive FP16 training.\n",
+        "\n",
+        "- [TRL GitHub Repository](https://github.com/huggingface/trl) — star us to support the project!  \n",
+        "- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)  \n",
+        "- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "w2TnJ6ta-2zj"
+      },
+      "source": [
+        "## Key concepts\n",
+        "\n",
+        "- **GRPO**: A reinforcement learning algorithm that optimizes a policy by comparing multiple generated responses for the same prompt and updating the model based on their relative rewards, without requiring a separate value model.\n",
+        "- **LoRA**: Updates only a few low-rank parameters, reducing training cost and memory.\n",
+        "- **QLoRA**: A quantized version of LoRA that enables even larger models to fit on small GPUs.\n",
+        "- **TRL**: The Hugging Face library that makes fine-tuning and reinforcement learning simple and efficient.\n",
+        "\n",
+        "Learn how to perform **GRPO (Group Relative Policy Optimization)** with **LoRA/QLoRA** using **TRL**."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "EzScUBxoT4Nt"
+      },
+      "source": [
+        "This table demonstrates how **progressively enabling efficiency techniques** affects **memory usage** and **training throughput** across different hardware configurations.  \n",
+        "The techniques range from naive FP16 training to **LoRA, quantization, Liger kernels, paged_adamw_8bit, and gradient checkpointing**.\n",
+        "\n",
+        "| Configuration | LoRA | Quant | Liger | Optimizer | Grad. Ckpt | attn_impl  | VRAM (T4) GB | VRAM (A100-40GB)| VRAM (A100-80GB) | Tokens/s (T4) | Tokens/s (A100-40GB) | Tokens/s (A100-80GB) | Status (T4) |\n",
+        "|--------------|------|-------|-------|-----------|------------|-----------|---------------|----------------|---------|---------|---------------|------------------|-------------|\n",
+        "| **Worst (naive FP16)** | ❌ | ❌ | ❌ | AdamW | ❌  | eager | OOM | OOM | 62 GB | - | - | 0.06 it/s | ❌ |\n",
+        "| **Best (all optimizations)** | ✅ | ✅ | ✅ | paged_adamw_8bit | ✅ | sdpa  | 9.2 GB | 9.6 GB | 9.6 GB | 0.01 it/s | 0.03 it/s | 0.04 it/s | ✅ |\n",
+        "\n",
+        "With all efficiency techniques enabled, **memory usage on Colab T4 is reduced by ~7×**, making it possible to **fine-tune a 7B model on free Colab** where naive FP16 training would fail.\n",
+        "\n",
+        "> A small trade-off in training speed is observed, but the **VRAM reduction is the key enabler**. For faster training on compatible hardware, **vLLM** can also be leveraged.\n",
+        "\n",
+        "> 💡 Note: For a fair comparison, the number of generations and the batch size were not changed."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9RFq6Op7rjc3"
+      },
+      "source": [
+        "## Install dependencies\n",
+        "\n",
+        "We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install **trackio** to log and monitor our experiments, **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training, and **liger-kernel** for more efficient training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "c2jy45nfWbdo"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -Uq \"trl[peft]\" bitsandbytes trackio math_verify liger-kernel"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "B33zJG_Q_qb3"
+      },
+      "source": [
+        "### Log in to Hugging Face\n",
+        "\n",
+        "Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "eec717d21e734c4da066763b4a6add7e"
+          ]
+        },
+        "id": "8zqnTyUDWbdo",
+        "outputId": "62d71aaf-352b-4736-acb9-189d78654718"
+      },
+      "outputs": [],
+      "source": [
+        "from huggingface_hub import notebook_login\n",
+        "\n",
+        "notebook_login()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "cTEw4xlFrhnQ"
+      },
+      "source": [
+        "## Load Dataset\n",
+        "\n",
+        "In this step, we load the [**AI-MO/NuminaMath-TIR**](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset from the Hugging Face Hub using the `datasets` library.\n",
+        "This dataset focuses on **mathematical reasoning**, featuring problems that require step-by-step logical solutions.\n",
+        "By fine-tuning a model that does not yet exhibit strong reasoning capabilities, it can learn to **generate structured reasoning steps**, enhancing both the model's **accuracy** and **interpretability** on math-related tasks.\n",
+        "\n",
+        "For efficiency, we'll load only a **small portion of the training split**:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "zU5icx67Wbdp",
+        "outputId": "6480b287-dc0e-4e79-feda-f5e4f41d2a82"
+      },
+      "outputs": [],
+      "source": [
+        "from datasets import load_dataset\n",
+        "\n",
+        "dataset_name = 'AI-MO/NuminaMath-TIR'\n",
+        "train_dataset = load_dataset(dataset_name, split='train[:5%]')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "P1AIokQrBEGw"
+      },
+      "source": [
+        "Let's check the structure of the dataset"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ff6Gx1TWWbdp",
+        "outputId": "30d49bed-273a-47d9-d131-a677ca5a8b65"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Dataset({\n",
+            "    features: ['problem', 'solution', 'messages'],\n",
+            "    num_rows: 3622\n",
+            "})\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(train_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "QY5hkOqDBGns"
+      },
+      "source": [
+        "Let's check one sample:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-y9c7i29Wbdp",
+        "outputId": "760662ea-4db4-4b8e-c234-92ae2c8ecc17"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "{'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\\\left(\\\\frac{3}{5}x-\\\\frac{y}{2}\\\\right)^8$?  Express your answer as a common fraction.', 'solution': \"To determine the coefficient of \\\\(x^2y^6\\\\) in the expansion of \\\\(\\\\left(\\\\frac{3}{5}x - \\\\frac{y}{2}\\\\right)^8\\\\), we can use the binomial theorem.\\n\\nThe binomial theorem states:\\n\\\\[\\n(a + b)^n = \\\\sum_{k=0}^{n} \\\\binom{n}{k} a^{n-k} b^k\\n\\\\]\\n\\nIn this case, \\\\(a = \\\\frac{3}{5}x\\\\), \\\\(b = -\\\\frac{y}{2}\\\\), and \\\\(n = 8\\\\).\\n\\nWe are interested in the term that contains \\\\(x^2y^6\\\\). In the general term of the binomial expansion:\\n\\\\[\\n\\\\binom{8}{k} \\\\left(\\\\frac{3}{5}x\\\\right)^{8-k} \\\\left(-\\\\frac{y}{2}\\\\right)^k\\n\\\\]\\n\\nTo get \\\\(x^2\\\\), we need \\\\(8 - k = 2\\\\), thus \\\\(k = 6\\\\).\\n\\nSubstituting \\\\(k = 6\\\\) into the expression:\\n\\\\[\\n\\\\binom{8}{6} \\\\left(\\\\frac{3}{5}x\\\\right)^{8-6} \\\\left(-\\\\frac{y}{2}\\\\right)^6 = \\\\binom{8}{6} \\\\left(\\\\frac{3}{5}x\\\\right)^2 \\\\left(-\\\\frac{y}{2}\\\\right)^6\\n\\\\]\\n\\nNow, we will compute each part of this expression.\\n\\n1. Calculate the binomial coefficient \\\\(\\\\binom{8}{6}\\\\).\\n2. Compute \\\\(\\\\left(\\\\frac{3}{5}\\\\right)^2\\\\).\\n3. Compute \\\\(\\\\left(-\\\\frac{y}{2}\\\\right)^6\\\\).\\n4. Combine everything together to get the coefficient of \\\\(x^2y^6\\\\).\\n\\nLet's compute these in Python.\\n```python\\nfrom math import comb\\n\\n# Given values\\nn = 8\\nk = 6\\n\\n# Calculate the binomial coefficient\\nbinom_coeff = comb(n, k)\\n\\n# Compute (3/5)^2\\na_term = (3/5)**2\\n\\n# Compute (-1/2)^6\\nb_term = (-1/2)**6\\n\\n# Combine terms to get the coefficient of x^2y^6\\ncoefficient = binom_coeff * a_term * b_term\\nprint(coefficient)\\n```\\n```output\\n0.1575\\n```\\nThe coefficient of \\\\(x^2y^6\\\\) in the expansion of \\\\(\\\\left(\\\\frac{3}{5}x - \\\\frac{y}{2}\\\\right)^8\\\\) is \\\\(0.1575\\\\). To express this as a common fraction, we recognize that:\\n\\n\\\\[ 0.1575 = \\\\frac{1575}{10000} = \\\\frac{63}{400} \\\\]\\n\\nThus, the coefficient can be expressed as:\\n\\n\\\\[\\n\\\\boxed{\\\\frac{63}{400}}\\n\\\\]\", 'messages': [{'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\\\left(\\\\frac{3}{5}x-\\\\frac{y}{2}\\\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}, {'content': \"To determine the coefficient of \\\\(x^2y^6\\\\) in the expansion of \\\\(\\\\left(\\\\frac{3}{5}x - \\\\frac{y}{2}\\\\right)^8\\\\), we can use the binomial theorem.\\n\\nThe binomial theorem states:\\n\\\\[\\n(a + b)^n = \\\\sum_{k=0}^{n} \\\\binom{n}{k} a^{n-k} b^k\\n\\\\]\\n\\nIn this case, \\\\(a = \\\\frac{3}{5}x\\\\), \\\\(b = -\\\\frac{y}{2}\\\\), and \\\\(n = 8\\\\).\\n\\nWe are interested in the term that contains \\\\(x^2y^6\\\\). In the general term of the binomial expansion:\\n\\\\[\\n\\\\binom{8}{k} \\\\left(\\\\frac{3}{5}x\\\\right)^{8-k} \\\\left(-\\\\frac{y}{2}\\\\right)^k\\n\\\\]\\n\\nTo get \\\\(x^2\\\\), we need \\\\(8 - k = 2\\\\), thus \\\\(k = 6\\\\).\\n\\nSubstituting \\\\(k = 6\\\\) into the expression:\\n\\\\[\\n\\\\binom{8}{6} \\\\left(\\\\frac{3}{5}x\\\\right)^{8-6} \\\\left(-\\\\frac{y}{2}\\\\right)^6 = \\\\binom{8}{6} \\\\left(\\\\frac{3}{5}x\\\\right)^2 \\\\left(-\\\\frac{y}{2}\\\\right)^6\\n\\\\]\\n\\nNow, we will compute each part of this expression.\\n\\n1. Calculate the binomial coefficient \\\\(\\\\binom{8}{6}\\\\).\\n2. Compute \\\\(\\\\left(\\\\frac{3}{5}\\\\right)^2\\\\).\\n3. Compute \\\\(\\\\left(-\\\\frac{y}{2}\\\\right)^6\\\\).\\n4. Combine everything together to get the coefficient of \\\\(x^2y^6\\\\).\\n\\nLet's compute these in Python.\\n```python\\nfrom math import comb\\n\\n# Given values\\nn = 8\\nk = 6\\n\\n# Calculate the binomial coefficient\\nbinom_coeff = comb(n, k)\\n\\n# Compute (3/5)^2\\na_term = (3/5)**2\\n\\n# Compute (-1/2)^6\\nb_term = (-1/2)**6\\n\\n# Combine terms to get the coefficient of x^2y^6\\ncoefficient = binom_coeff * a_term * b_term\\nprint(coefficient)\\n```\\n```output\\n0.1575\\n```\\nThe coefficient of \\\\(x^2y^6\\\\) in the expansion of \\\\(\\\\left(\\\\frac{3}{5}x - \\\\frac{y}{2}\\\\right)^8\\\\) is \\\\(0.1575\\\\). To express this as a common fraction, we recognize that:\\n\\n\\\\[ 0.1575 = \\\\frac{1575}{10000} = \\\\frac{63}{400} \\\\]\\n\\nThus, the coefficient can be expressed as:\\n\\n\\\\[\\n\\\\boxed{\\\\frac{63}{400}}\\n\\\\]\", 'role': 'assistant'}]}\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(train_dataset[0])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DiqBlxK_A0SD"
+      },
+      "source": [
+        "We will adapt our dataset to a conversational format using a custom system prompt, guiding the LLM to generate both step-by-step reasoning and the final answer."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "RWxK5xFKWbdp"
+      },
+      "outputs": [],
+      "source": [
+        "SYSTEM_PROMPT = (\n",
+        "    \"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant  \"\n",
+        "    \"first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning \"\n",
+        "    \"process is enclosed strictly within <think> and </think> tags. \"\n",
+        "    \"After closing </think>, the assistant MUST provide the final answer in plain text.\"\n",
+        ")\n",
+        "\n",
+        "\n",
+        "def make_conversation(example):\n",
+        "    return {\n",
+        "        \"prompt\": [\n",
+        "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+        "            {\"role\": \"user\", \"content\": example[\"problem\"]},\n",
+        "        ],\n",
+        "    }\n",
+        "\n",
+        "train_dataset = train_dataset.map(make_conversation)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sND566XAC0kD"
+      },
+      "source": [
+        "Let's take a look at an example:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Q-kHUmpMWbdp",
+        "outputId": "452beb3a-1091-46d4-997e-04b91562d66c"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "[{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant  first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process is enclosed strictly within <think> and </think> tags. After closing </think>, the assistant MUST provide the final answer in plain text.', 'role': 'system'}, {'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\\\left(\\\\frac{3}{5}x-\\\\frac{y}{2}\\\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}]\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(train_dataset[0]['prompt'])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "bw0qcp-CC3G0"
+      },
+      "source": [
+        "We'll remove the `messages` and `problem` columns, as we only need the custom `prompt` column and `solution` to verify the generated answer."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "SzbF3hdRWbdp",
+        "outputId": "bd59a383-1d4e-4020-c232-79ce66073fd1"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Dataset({\n",
+            "    features: ['solution', 'prompt'],\n",
+            "    num_rows: 3622\n",
+            "})\n"
+          ]
+        }
+      ],
+      "source": [
+        "train_dataset = train_dataset.remove_columns(['messages', 'problem'])\n",
+        "print(train_dataset)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "tvs5rjQBr7af"
+      },
+      "source": [
+        "## Load model and configure LoRA/QLoRA\n",
+        "\n",
+        "Below, choose your **preferred model**. All of the options have been tested on **free Colab instances**.\n",
+        "\n",
+        "> 💡 Note: Some models, such as Qwen2.5 and Qwen3, are known to have been pretrained on data that improves their math performance. Be cautious when selecting the appropriate model for training to ensure meaningful fine-tuning results ([source](https://thinkingmachines.ai/blog/lora/))."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "7_uaW3JfWbdp"
+      },
+      "outputs": [],
+      "source": [
+        "# Select one model below by uncommenting the line you want to use 👇\n",
+        "## Qwen\n",
+        "model_id, output_dir = \"Qwen/Qwen2-7B-Instruct\", \"t4-Qwen2-7B-Instruct-GRPO\"                             # ✅ ~9.2GB VRAM\n",
+        "# model_id, output_dir = \"unsloth/qwen3-14b-unsloth-bnb-4bit\", \"qwen3-14b-unsloth-bnb-4bit-GRPO\"         # ⚠️ OOM with this config; fits if GRPO params are reduced\n",
+        "# model_id, output_dir = \"Qwen/Qwen3-8B\", \"Qwen3-8B-GRPO\"                                                # ✅ ~9.9GB VRAM\n",
+        "# model_id, output_dir = \"Qwen/Qwen2.5-7B-Instruct\", \"Qwen2.5-7B-Instruct-GRPO\"                          # ✅ ~9.2GB VRAM\n",
+        "\n",
+        "## Llama\n",
+        "# model_id, output_dir = \"meta-llama/Llama-3.2-3B-Instruct\", \"Llama-3.2-3B-Instruct-GRPO\"             # ✅ ~5.7GB VRAM\n",
+        "# model_id, output_dir = \"meta-llama/Llama-3.1-8B-Instruct\", \"Llama-3.1-8B-Instruct-GRPO\"             # ✅ ~9.5GB VRAM\n",
+        "\n",
+        "## LFM2.5\n",
+        "# model_id, output_dir = \"LiquidAI/LFM2.5-1.2B-Instruct\", \"LFM2.5-1.2B-Instruct-GRPO\"                                   # ✅ ~1.12 GB VRAM"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "aw__94OWDnER"
+      },
+      "source": [
+        "This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration (training without quantization consumes more memory).\n",
+        "\n",
+        "Let's load the selected model using `transformers`, configuring QLoRA via `bitsandbytes` (you can remove it if doing LoRA). We don't need to configure the tokenizer since the trainer takes care of that automatically."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "1130e5a744864ca5b5873731e4764983"
+          ]
+        },
+        "id": "o86TnTchWbdp",
+        "outputId": "77a7e6c8-0360-40f1-eea7-b941be031366"
+      },
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "1130e5a744864ca5b5873731e4764983",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "import torch\n",
+        "from transformers import AutoModelForCausalLM, BitsAndBytesConfig\n",
+        "\n",
+        "model = AutoModelForCausalLM.from_pretrained(\n",
+        "    model_id,\n",
+        "    attn_implementation=\"sdpa\",                   # Change to Flash Attention if GPU has support\n",
+        "    dtype=\"float32\",                          # Change to bfloat16 if GPU has support\n",
+        "    quantization_config=BitsAndBytesConfig(\n",
+        "        load_in_4bit=True,                        # Load the model in 4-bit precision to save memory\n",
+        "        bnb_4bit_compute_dtype=torch.float16,     # Data type used for internal computations in quantization\n",
+        "        bnb_4bit_use_double_quant=True,           # Use double quantization to improve accuracy\n",
+        "        bnb_4bit_quant_type=\"nf4\"                 # Type of quantization. \"nf4\" is recommended for recent LLMs\n",
+        "    )\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "AM-G0_QmDyZC"
+      },
+      "source": [
+        "The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter**, a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "WIz2pmX6Wbdp"
+      },
+      "outputs": [],
+      "source": [
+        "from peft import LoraConfig\n",
+        "\n",
+        "# You may need to update `target_modules` depending on the architecture of your chosen model.\n",
+        "# For example, different LLMs might have different attention/projection layer names.\n",
+        "peft_config = LoraConfig(\n",
+        "    r=32,\n",
+        "    lora_alpha=32,\n",
+        "    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\",],\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "prKnAp-Esyiq"
+      },
+      "source": [
+        "## Train model\n",
+        "\n",
+        "GRPO requires **reward functions** to guide the learning process. For convenience, we can directly load pre-defined rewards from `trl.rewards`, which already includes a [collection of ready-to-use rewards](https://huggingface.co/docs/trl/rewards).\n",
+        "\n",
+        "If you want to create your own custom reward functions to teach the model, a reward function is simply a Python function that takes the generated completions and returns a list of floats. For example, the following function, which we use in this notebook, rewards completions that correctly follow the `<think>` format:\n",
+        "\n",
+        "```python\n",
+        "def think_format_reward(completions: list[list[dict[str, str]]], **kwargs) -> list[float]:\n",
+        "    pattern = r\"^<think>(?!.*<think>)(.*?)</think>.*$\"\n",
+        "    completion_contents = [completion[0][\"content\"] for completion in completions]\n",
+        "    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]\n",
+        "    return [1.0 if match else 0.0 for match in matches]\n",
+        "```\n",
+        "\n",
+        "In this notebook, we will use both `think_format_reward`, which rewards completions that correctly follow the `<think>` format, and `reasoning_accuracy_reward`, which evaluates the correctness of the model's solution to the mathematical problem. Together, these rewards guide the model to generate **structured reasoning** while producing **accurate answers**."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "lj42Qs5vWbdp"
+      },
+      "outputs": [],
+      "source": [
+        "from trl.rewards import think_format_reward, reasoning_accuracy_reward"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "bFgYgxMbtbEZ"
+      },
+      "source": [
+        "We'll configure **GRPO** using `GRPOConfig`, keeping the parameters minimal so that the training can run on a free Colab instance. You can adjust these settings if you have access to more resources. For a complete list of available parameters and their descriptions, refer to the [TRL GRPOConfig documentation](https://huggingface.co/docs/trl/grpo_trainer#trl.GRPOConfig).\n",
+        "\n",
+        "> 💡 Note: TRL supports using **vLLM** for generation during GRPO training, which can significantly speed up training. However, it increases VRAM usage since a separate vLLM process is active to handle generation. In this notebook, we do not enable vLLM because we are using **QLoRA**, which updates the quantized vLLM model weights at every step. Enabling vLLM in this setup can cause weight precision issues and make convergence more challenging. The configuration includes the vLLM parameters in case you want to experiment with it. Learn more about vLLM integration in TRL [here](https://huggingface.co/docs/trl/main/en/vllm_integration)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "JY11EQMhWbdp"
+      },
+      "outputs": [],
+      "source": [
+        "from trl import GRPOConfig\n",
+        "\n",
+        "# Configure training arguments using GRPOConfig\n",
+        "training_args = GRPOConfig(\n",
+        "    # Training schedule / optimization\n",
+        "    learning_rate=2e-5,                                     # Learning rate for the optimizer\n",
+        "    #num_train_epochs=1,\n",
+        "    max_steps=500,                                          # Number of dataset passes. For full trainings, use `num_train_epochs` instead\n",
+        "\n",
+        "    # Parameters that control GRPO training (you can adapt them)\n",
+        "    per_device_train_batch_size = 8,\n",
+        "    max_completion_length=256, # default: 256               # Max completion length produced during training\n",
+        "    num_generations=8, # default: 8                         # Number of generations produced during trainig for comparison\n",
+        "\n",
+        "    # Optimizations\n",
+        "    optim = \"paged_adamw_8bit\",                             # Optimizer\n",
+        "    use_liger_kernel=True,                                  # Enable Liger kernel optimizations for faster training\n",
+        "\n",
+        "    # Parameters related to reporting and saving\n",
+        "    output_dir=output_dir,                                  # Where to save model checkpoints and logs\n",
+        "    logging_steps=10,                                       # Log training metrics every N steps\n",
+        "    report_to=\"trackio\",                                    # Experiment tracking tool\n",
+        "    trackio_space_id=output_dir,                            # HF Space where the experiment tracking will be saved\n",
+        "    log_completions=False,                                  # Return model completions during training\n",
+        "\n",
+        "    # Hub integration\n",
+        "    push_to_hub=True,                                       # Automatically push the trained model to the Hugging Face Hub\n",
+        "                                                            # The model will be saved under your Hub account in the repository named `output_dir`\n",
+        "    # vLLM params\n",
+        "    #use_vllm=False,                                        # Activate vLLM training for faster training\n",
+        "    #vllm_mode='colocate',\n",
+        "    #vllm_gpu_memory_utilization=0.1,\n",
+        "    #vllm_enable_sleep_mode=True\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-9LlOAvWFSor"
+      },
+      "source": [
+        "Configure the `GRPOTrainer` by passing the previously defined `training_args`. To keep memory usage low, we are not using an evaluation dataset, but you can include one if desired. We also provide the reward functions that were imported earlier to guide the training process."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "iI_E9KCUWbdq"
+      },
+      "outputs": [],
+      "source": [
+        "from trl import GRPOTrainer\n",
+        "\n",
+        "trainer = GRPOTrainer(\n",
+        "    model=model,\n",
+        "    reward_funcs=[think_format_reward, reasoning_accuracy_reward],\n",
+        "    args=training_args,\n",
+        "    train_dataset=train_dataset,\n",
+        "    peft_config=peft_config,\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "8dY7bK8FGLhh"
+      },
+      "source": [
+        "Show memory stats before training"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "PEVRGlrAWbdq",
+        "outputId": "78fac9e4-4ae6-4836-bd10-c30b39059782"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "GPU = Tesla T4. Max memory = 14.741 GB.\n",
+            "6.773 GB of memory reserved.\n"
+          ]
+        }
+      ],
+      "source": [
+        "gpu_stats = torch.cuda.get_device_properties(0)\n",
+        "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+        "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
+        "\n",
+        "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
+        "print(f\"{start_gpu_memory} GB of memory reserved.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "z-5xPtfIGQL5"
+      },
+      "source": [
+        "And train!"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Training on a T4 in Colab with the configuration defined in this notebook takes around 13 hours. If you're just experimenting, you can try the following quicker task ([source](https://huggingface.co/learn/llm-course/en/chapter12/5)):\n",
+        "\n",
+        "```python\n",
+        "dataset = load_dataset(\"mlabonne/smoltldr\")\n",
+        "\n",
+        "# Reward function\n",
+        "ideal_length = 50\n",
+        "\n",
+        "def reward_len(completions, **kwargs):\n",
+        "    return [-abs(ideal_length - len(completion)) for completion in completions]\n",
+        "```"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "zl7-PmoXWbdq",
+        "outputId": "f39c8c3c-43c2-4f2d-c98d-4c595ae1129f"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "* Trackio project initialized: huggingface\n",
+            "* Trackio metrics will be synced to Hugging Face Dataset: sergiopaniego/t4-Qwen2-7B-Instruct-GRPO-dataset\n",
+            "* Creating new space: https://huggingface.co/spaces/sergiopaniego/t4-Qwen2-7B-Instruct-GRPO\n",
+            "* View dashboard by going to: https://sergiopaniego-t4-Qwen2-7B-Instruct-GRPO.hf.space/\n"
+          ]
+        },
+        {
+          "data": {
+            "text/html": [
+              "<div><iframe src=\"https://sergiopaniego-t4-Qwen2-7B-Instruct-GRPO.hf.space/\" width=\"100%\" height=\"1000px\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
+            ],
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "* Created new run: sergiopaniego-1766143600\n"
+          ]
+        },
+        {
+          "data": {
+            "text/html": [
+              "\n",
+              "    <div>\n",
+              "      \n",
+              "      <progress value='500' max='500' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+              "      [500/500 13:05:04, Epoch 0/1]\n",
+              "    </div>\n",
+              "    <table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              " <tr style=\"text-align: left;\">\n",
+              "      <th>Step</th>\n",
+              "      <th>Training Loss</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <td>10</td>\n",
+              "      <td>0.027900</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>20</td>\n",
+              "      <td>-0.011600</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>30</td>\n",
+              "      <td>0.021500</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>40</td>\n",
+              "      <td>0.033400</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>50</td>\n",
+              "      <td>0.039400</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>60</td>\n",
+              "      <td>0.010300</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>70</td>\n",
+              "      <td>0.048200</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>80</td>\n",
+              "      <td>0.067300</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>90</td>\n",
+              "      <td>0.030600</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>100</td>\n",
+              "      <td>0.064000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>110</td>\n",
+              "      <td>0.021500</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>120</td>\n",
+              "      <td>0.021400</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>130</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>140</td>\n",
+              "      <td>-0.028500</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>150</td>\n",
+              "      <td>-0.003100</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>160</td>\n",
+              "      <td>0.017300</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>170</td>\n",
+              "      <td>-0.024700</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>180</td>\n",
+              "      <td>0.003300</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>190</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>200</td>\n",
+              "      <td>-0.001400</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>210</td>\n",
+              "      <td>0.008000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>220</td>\n",
+              "      <td>0.034300</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>230</td>\n",
+              "      <td>0.044600</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>240</td>\n",
+              "      <td>0.016400</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>250</td>\n",
+              "      <td>-0.015200</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>260</td>\n",
+              "      <td>0.016800</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>270</td>\n",
+              "      <td>0.042900</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>280</td>\n",
+              "      <td>0.031300</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>290</td>\n",
+              "      <td>0.006200</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>300</td>\n",
+              "      <td>0.043300</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>310</td>\n",
+              "      <td>0.029700</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>320</td>\n",
+              "      <td>0.001100</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>330</td>\n",
+              "      <td>0.027000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>340</td>\n",
+              "      <td>-0.006700</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>350</td>\n",
+              "      <td>0.027200</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>360</td>\n",
+              "      <td>0.008200</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>370</td>\n",
+              "      <td>-0.015800</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>380</td>\n",
+              "      <td>0.007200</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>390</td>\n",
+              "      <td>0.012100</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>400</td>\n",
+              "      <td>0.000000</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>410</td>\n",
+              "      <td>0.010500</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>420</td>\n",
+              "      <td>0.019800</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>430</td>\n",
+              "      <td>0.000800</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>440</td>\n",
+              "      <td>0.003400</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>450</td>\n",
+              "      <td>-0.007900</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>460</td>\n",
+              "      <td>-0.011800</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>470</td>\n",
+              "      <td>-0.016300</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>480</td>\n",
+              "      <td>-0.002300</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>490</td>\n",
+              "      <td>-0.005500</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <td>500</td>\n",
+              "      <td>0.038000</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table><p>"
+            ],
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "* Run finished. Uploading logs to Trackio (please wait...)\n"
+          ]
+        }
+      ],
+      "source": [
+        "trainer_stats = trainer.train()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "iqAN-XLCGTGW"
+      },
+      "source": [
+        "Show memory stats after training"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "4BeEwp5EWbds",
+        "outputId": "668b8a2c-2eef-4e34-8d4a-2a43ccbbdc00"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "47228.679 seconds used for training.\n",
+            "787.14 minutes used for training.\n",
+            "Peak reserved memory = 8.832 GB.\n",
+            "Peak reserved memory for training = 2.059 GB.\n",
+            "Peak reserved memory % of max memory = 59.915 %.\n",
+            "Peak reserved memory for training % of max memory = 13.968 %.\n"
+          ]
+        }
+      ],
+      "source": [
+        "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+        "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
+        "used_percentage = round(used_memory / max_memory * 100, 3)\n",
+        "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n",
+        "\n",
+        "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
+        "print(f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\")\n",
+        "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
+        "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
+        "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
+        "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "R8Sd_AqILeYi"
+      },
+      "source": [
+        "The training procedure generates both standard training logs and **trackio** logs, which help us monitor the training progress. Example outputs would look like the following:"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "2bPn6gruLf-n"
+      },
+      "source": [
+        "<img src=\"https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/grpo-qlora-notebook-trackio.png\" width=\"50%\">"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ibO4f7tuLboQ"
+      },
+      "source": [
+        "## Saving fine tuned model\n",
+        "\n",
+        "In this step, we save the fine-tuned model both **locally** and to the **Hugging Face Hub** using the credentials from your account."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "e6a3677667ce47bcba55e3e950e446f9",
+            "17adb84604d84cf688a89a21f6cc6150",
+            "a21c1bbd3cd04738a8c96fbfc0c016c6",
+            "65cadde3da7642188f029bb2aceaa7c6",
+            "0404b89e5ce24e76958c72bedc1a95cc",
+            "c52baf990fde40c0873747e827dc6926",
+            "191653e8ce184123a68f26fbf2b78745",
+            "0bb882d400864b249c80132264de2623",
+            "09cbfcf6e51c431798f4e392a81be6d3",
+            "d6521f73f23f42e18ee462a547f251a1"
+          ]
+        },
+        "id": "itpVDjy0Wbdt",
+        "outputId": "b821c7ed-6c9d-440a-a797-e25291627bef"
+      },
+      "outputs": [],
+      "source": [
+        "trainer.save_model(output_dir)\n",
+        "trainer.push_to_hub(dataset_name=dataset_name)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "81eBZe-X7daz"
+      },
+      "source": [
+        "## Load the fine-tuned model and run inference\n",
+        "\n",
+        "Now, let's test our fine-tuned model by loading the **LoRA/QLoRA adapter** and performing **inference**. We'll start by loading the **base model**, then attach the adapter to it, creating the final fine-tuned model ready for evaluation."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "1d3fbf86d53845beac599c5b231e87ea"
+          ]
+        },
+        "id": "ZLdaWYzNWbdt",
+        "outputId": "a103b64b-1f6b-4423-c5fd-402f210e6dc3"
+      },
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "1d3fbf86d53845beac599c5b231e87ea",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+        "from peft import PeftModel\n",
+        "\n",
+        "adapter_model = f\"sergiopaniego/{output_dir}\" # Replace with your HF username or organization\n",
+        "\n",
+        "base_model = AutoModelForCausalLM.from_pretrained(model_id, dtype=\"auto\", device_map=\"auto\")\n",
+        "\n",
+        "tokenizer = AutoTokenizer.from_pretrained(model_id)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "JvwM6ym-7nnt"
+      },
+      "source": [
+        "Let's test with one example from the test set of the dataset"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "74ca3f7b365640ba883a9a236700517e"
+          ]
+        },
+        "id": "XjpojLV-Wbdt",
+        "outputId": "bcc039de-72ae-4713-a1fb-c006163999e7"
+      },
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "74ca3f7b365640ba883a9a236700517e",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Map:   0%|          | 0/1 [00:00<?, ? examples/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "text/plain": [
+              "[{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant  first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process is enclosed strictly within <think> and </think> tags. After closing </think>, the assistant MUST provide the final answer in plain text.',\n",
+              "  'role': 'system'},\n",
+              " {'content': \"In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?\",\n",
+              "  'role': 'user'}]"
+            ]
+          },
+          "execution_count": 5,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "from datasets import load_dataset\n",
+        "\n",
+        "dataset_name = 'AI-MO/NuminaMath-TIR'\n",
+        "test_dataset = load_dataset(dataset_name, split='test[:1%]')\n",
+        "test_dataset = test_dataset.map(make_conversation)\n",
+        "test_dataset = test_dataset.remove_columns(['messages', 'problem'])\n",
+        "test_dataset[0]['prompt']"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "CxKyZwG28BYJ"
+      },
+      "source": [
+        "Let's first check what's the output for the base model, without the adapter."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "qTPJY96eWbdt",
+        "outputId": "ed02acca-e856-44ec-fa20-c32efd81e018"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "To solve this problem, let's denote the birth year of the person as \\(Y\\) (where \\(Y\\) is a four-digit number) and their age in 1988 as \\(A\\). According to the given condition, their age in 1988 is equal to the sum of the digits of their birth year. \n",
+            "\n",
+            "Since we're looking at the year 1988, the person would be \\(1988 - Y\\) years old in that year. Given the condition:\n",
+            "\n",
+            "\\[1988 - Y = \\text{sum of the digits of } Y\\]\n",
+            "\n",
+            "Let's break down the possible range for \\(Y\\). Since the person's age must be less than or equal to 100 (as the sum of the digits of any four-digit number cannot exceed 36), \\(Y\\) must be between 1989 and 2088.\n",
+            "\n",
+            "We can systematically check each year in this range to find when the condition holds true. However, considering the constraint on age, we can narrow our search significantly. For example, if \\(Y\\) were 1990, the sum of its digits would be 18, which is not a reasonable age. We need\n"
+          ]
+        }
+      ],
+      "source": [
+        "messages = test_dataset[0]['prompt']\n",
+        "text = tokenizer.apply_chat_template(\n",
+        "    messages, add_generation_prompt=True, tokenize=False\n",
+        ")\n",
+        "model_inputs = tokenizer([text], return_tensors=\"pt\").to(base_model.device)\n",
+        "\n",
+        "generated_ids = base_model.generate(\n",
+        "    **model_inputs,\n",
+        "    max_new_tokens=256\n",
+        ")\n",
+        "output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]\n",
+        "\n",
+        "# Decode and extract model response\n",
+        "generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)\n",
+        "print(generated_text)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "V9eoUwQS8SIi"
+      },
+      "source": [
+        "The base model neither produced reasoning traces nor provided a correct answer. Let's now load the fine-tuned model and check its performance."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "073b351afd264bf0bf23043b37e0d8ce",
+            "3dee429faf4e40b192cabebfe4bf2245"
+          ]
+        },
+        "id": "CNannsXXWbdt",
+        "outputId": "fc43a5b9-4ec6-43eb-fc34-f26e92434faf"
+      },
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "073b351afd264bf0bf23043b37e0d8ce",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "adapter_config.json: 0.00B [00:00, ?B/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "3dee429faf4e40b192cabebfe4bf2245",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "adapter_model.safetensors:   0%|          | 0.00/162M [00:00<?, ?B/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "fine_tuned_model = PeftModel.from_pretrained(base_model, adapter_model)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "3yOJ82F9Wbdt",
+        "outputId": "f7b2d716-0ded-4ba4-9534-0481e81b4a15"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "<think> I need to find a birth year where the sum of its digits equals the person's age in 1988 </think>\n",
+            "\n",
+            "The person would have been born in 1979, since 1+9+7+9 = 26 and 26 is the age in 1988\n",
+            "\n",
+            "answer: 26\n"
+          ]
+        }
+      ],
+      "source": [
+        "text = tokenizer.apply_chat_template(\n",
+        "    messages, add_generation_prompt=True, tokenize=False\n",
+        ")\n",
+        "model_inputs = tokenizer([text], return_tensors=\"pt\").to(fine_tuned_model.device)\n",
+        "\n",
+        "generated_ids = fine_tuned_model.generate(\n",
+        "    **model_inputs,\n",
+        "    max_new_tokens=256\n",
+        ")\n",
+        "output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]\n",
+        "\n",
+        "# Decode and extract model response\n",
+        "generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)\n",
+        "print(generated_text)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "OU-xDHpEEmg9"
+      },
+      "source": [
+        "The final answer is correct!"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "XNtBOpRY8a2O"
+      },
+      "source": [
+        "## Inference and Serving with vLLM\n",
+        "\n",
+        "You can use Transformer models with **vLLM** to serve them in real-world applications. Learn more [here](https://blog.vllm.ai/2025/04/11/transformers-backend.html)."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "nkhu0uY78lV3"
+      },
+      "source": [
+        "### Push Merged Model (for LoRA or QLoRA Training)\n",
+        "\n",
+        "To serve the model via **vLLM**, the repository must contain the merged model (base model + LoRA adapter). Therefore, you need to upload it first."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "NF8ZP9Z-Wbdt",
+        "outputId": "32a5ab71-1f0d-4289-ea12-66f5f75a957b"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "('Qwen2-7B-Instruct-GRPO-merged/tokenizer_config.json',\n",
+              " 'Qwen2-7B-Instruct-GRPO-merged/special_tokens_map.json',\n",
+              " 'Qwen2-7B-Instruct-GRPO-merged/chat_template.jinja',\n",
+              " 'Qwen2-7B-Instruct-GRPO-merged/vocab.json',\n",
+              " 'Qwen2-7B-Instruct-GRPO-merged/merges.txt',\n",
+              " 'Qwen2-7B-Instruct-GRPO-merged/added_tokens.json',\n",
+              " 'Qwen2-7B-Instruct-GRPO-merged/tokenizer.json')"
+            ]
+          },
+          "execution_count": 29,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "model_merged = fine_tuned_model.merge_and_unload()\n",
+        "\n",
+        "save_dir = f\"{output_dir}-merged\"\n",
+        "\n",
+        "model_merged.save_pretrained(save_dir)\n",
+        "tokenizer.save_pretrained(save_dir)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "d1a0574cc20046d5876cf31b21955f8b",
+            "7cc2f0ef7ad2494cad572cd898095c00",
+            "475420d92bb54dc08517ffe423b015c3",
+            "a76231aeae5a49979d1e9075b0b3eefb",
+            "b4f469f957134ea9b0e28532fe3caaf1",
+            "637e55736da34f2c9b098222ae07244a",
+            "8157e521017c450a9d2a9e41611405e9",
+            "9746ae4ab0574ed186f898dba3b4b197",
+            "d4b2a8805ec548ea85e0900ff5927574",
+            "0668cd8597f141e89ef38129c6641c1f"
+          ]
+        },
+        "id": "X5Zci39rWbdt",
+        "outputId": "ca329f99-dc7b-470c-f5d9-39a3eabcb16d"
+      },
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "d1a0574cc20046d5876cf31b21955f8b",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Processing Files (0 / 0)      : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "7cc2f0ef7ad2494cad572cd898095c00",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "New Data Upload               : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "475420d92bb54dc08517ffe423b015c3",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...0002-of-00004.safetensors:   0%|          |  612kB / 4.93GB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "a76231aeae5a49979d1e9075b0b3eefb",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...0003-of-00004.safetensors:   0%|          |  611kB / 4.33GB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "b4f469f957134ea9b0e28532fe3caaf1",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...0001-of-00004.safetensors:   1%|1         | 50.3MB / 4.88GB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "637e55736da34f2c9b098222ae07244a",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...0004-of-00004.safetensors:   4%|3         | 41.9MB / 1.09GB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "8157e521017c450a9d2a9e41611405e9",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "README.md: 0.00B [00:00, ?B/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "9746ae4ab0574ed186f898dba3b4b197",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Processing Files (0 / 0)      : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "d4b2a8805ec548ea85e0900ff5927574",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "New Data Upload               : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "0668cd8597f141e89ef38129c6641c1f",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...RPO-merged/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "string"
+            },
+            "text/plain": [
+              "CommitInfo(commit_url='https://huggingface.co/sergiopaniego/Qwen2-7B-Instruct-GRPO-merged/commit/b20988444532e79a6915f0b2b6002b5acc2b53e1', commit_message='Upload tokenizer', commit_description='', oid='b20988444532e79a6915f0b2b6002b5acc2b53e1', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sergiopaniego/Qwen2-7B-Instruct-GRPO-merged', endpoint='https://huggingface.co', repo_type='model', repo_id='sergiopaniego/Qwen2-7B-Instruct-GRPO-merged'), pr_revision=None, pr_num=None)"
+            ]
+          },
+          "execution_count": 30,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "model_merged.push_to_hub(f\"sergiopaniego/{output_dir}-merged\") # Replace with your HF username or organization\n",
+        "tokenizer.push_to_hub(f\"sergiopaniego/{output_dir}-merged\") # Replace with your HF username or organization"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DQ00Ivxi8rFu"
+      },
+      "source": [
+        "### Performing Inference with vLLM\n",
+        "\n",
+        "Use **vLLM** to run your model and generate text efficiently in real-time. This allows you to test and deploy your fine-tuned models with low latency and high throughput."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "x7L-HIn4Wbdt",
+        "outputId": "afd66093-3525-4590-f834-c0b373e7bb9e"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "INFO 12-11 15:56:09 [utils.py:253] non-default args: {'dtype': torch.float16, 'max_model_len': 256, 'disable_log_stats': True, 'model_impl': 'transformers', 'model': 'sergiopaniego/Qwen2-7B-Instruct-GRPO-merged'}\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:104: UserWarning: \n",
+            "Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.\n",
+            "You are not authenticated with the Hugging Face Hub in this notebook.\n",
+            "If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).\n",
+            "  warnings.warn(\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "INFO 12-11 15:56:37 [model.py:631] Resolved architecture: TransformersForCausalLM\n",
+            "WARNING 12-11 15:56:37 [model.py:1971] Casting torch.bfloat16 to torch.float16.\n",
+            "INFO 12-11 15:56:37 [model.py:1745] Using max model len 256\n",
+            "INFO 12-11 15:56:40 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=8192.\n",
+            "WARNING 12-11 15:56:43 [system_utils.py:103] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized\n",
+            "INFO 12-11 15:57:36 [llm.py:352] Supported tasks: ['generate']\n"
+          ]
+        }
+      ],
+      "source": [
+        "from vllm import LLM, SamplingParams\n",
+        "from transformers import AutoTokenizer\n",
+        "import torch\n",
+        "\n",
+        "llm = LLM(\n",
+        "    model=f\"sergiopaniego/{output_dir}-merged\", # Replace with your HF username or organization\n",
+        "    model_impl=\"transformers\",                  # Select the transformers model implementation\n",
+        "    max_model_len=256,                         # Reduced for efficiency\n",
+        "    dtype=torch.float16\n",
+        ")\n",
+        "hf_tokenizer = AutoTokenizer.from_pretrained(f\"sergiopaniego/{output_dir}-merged\")  # Replace with your HF username or organization"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "referenced_widgets": [
+            "f0a4f4fb17bf4a698503212296467547",
+            "5be7348f3f324b5b9397c9ad186fb35d"
+          ]
+        },
+        "id": "ZTpSUqxNWbdt",
+        "outputId": "6a9283bf-d3b7-4e54-c775-4502694b5c6d"
+      },
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "f0a4f4fb17bf4a698503212296467547",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "5be7348f3f324b5b9397c9ad186fb35d",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "<think> 1988 birth year implies the person was born either in 1979, 1980, 1981, etc. Looking for the one where sum of digits equals age </think>\n",
+            "\n",
+            "The birth year 1979 gives sum of digits 1+9+7+9 = 26\n",
+            "\n",
+            "The person was 26 years old in 1988.\n",
+            "\n",
+            "Answer: The person was 26 years old.\n"
+          ]
+        }
+      ],
+      "source": [
+        "messages = test_dataset[0]['prompt']\n",
+        "# Alternatively, use llm.chat()\n",
+        "prompt = hf_tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)\n",
+        "\n",
+        "outputs = llm.generate(\n",
+        "    {\"prompt\": prompt},\n",
+        "    sampling_params=SamplingParams(max_tokens=256),\n",
+        ")\n",
+        "\n",
+        "for o in outputs:\n",
+        "    generated_text = o.outputs[0].text\n",
+        "    print(generated_text)"
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "T4",
+      "provenance": []
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}

ICL/RL/trl_source/examples/notebooks/openenv_sudoku_grpo.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

ICL/RL/trl_source/examples/notebooks/openenv_wordle_grpo.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

ICL/RL/trl_source/examples/notebooks/sft_ministral3_vl.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

ICL/RL/trl_source/examples/notebooks/sft_qwen_vl.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

ICL/RL/trl_source/examples/notebooks/sft_trl_lora_qlora.ipynb ADDED Viewed

	@@ -0,0 +1,1140 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "5oqSnSaqLWAL"
+   },
+   "source": [
+    "# Supervised Fine-Tuning (SFT) with LoRA/QLoRA using TRL — on a Free Colab Notebook\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_trl_lora_qlora.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "d6c1x17tLWAR"
+   },
+   "source": [
+    "![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "cQ6bxQaMLWAS"
+   },
+   "source": [
+    "Easily fine-tune Large Language Models (LLMs) or Vision-Language Models (VLMs) with **LoRA** or **QLoRA** using the [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl) library built by Hugging Face — all within a **free Google Colab notebook** (powered by a **T4 GPU**.).  \n",
+    "\n",
+    "- [TRL GitHub Repository](https://github.com/huggingface/trl) — star us to support the project!  \n",
+    "- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)  \n",
+    "- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "JG3wax0uLWAU"
+   },
+   "source": [
+    "## Key concepts\n",
+    "\n",
+    "- **SFT**: Trains models from example input-output pairs to align behavior with human preferences.\n",
+    "- **LoRA**: Updates only a few low-rank parameters, reducing training cost and memory.\n",
+    "- **QLoRA**: A quantized version of LoRA that enables even larger models to fit on small GPUs.\n",
+    "- **TRL**: The Hugging Face library that makes fine-tuning and reinforcement learning simple and efficient.\n",
+    "\n",
+    "Learn how to perform **Supervised Fine-Tuning (SFT)** with **LoRA/QLoRA** using **TRL**."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "0ZhyNnhiLWAV"
+   },
+   "source": [
+    "## Install dependencies\n",
+    "\n",
+    "We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install **trackio** to log and monitor our experiments, and **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "FXTyVTJcLWAV"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -Uq \"trl[peft]\" trackio bitsandbytes liger-kernel"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "OqlMF6oWLWAY"
+   },
+   "source": [
+    "### Log in to Hugging Face"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "2blL6-1_LWAa"
+   },
+   "source": [
+    "Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "6OMeJOp7LWAc"
+   },
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "6HHscLIQLWAd"
+   },
+   "source": [
+    "## Load Dataset\n",
+    "\n",
+    "In this step, we load the [**HuggingFaceH4/Multilingual-Thinking**](https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking) dataset from the Hugging Face Hub using the `datasets` library.  \n",
+    "This dataset focuses on **multilingual reasoning**, where the *chain of thought* has been translated into several languages such as French, Spanish, and German.  \n",
+    "By fine-tuning a reasoning-capable model on this dataset, it learns to **generate reasoning steps in multiple languages**, making its thought process more **interpretable and accessible** to non-English speakers.\n",
+    "\n",
+    "> 💡 This dataset is best suited for models that already demonstrate reasoning capabilities.  \n",
+    "> If you're using a model without reasoning skills, consider choosing a different dataset. Example: [`trl-lib/llava-instruct-mix`](https://huggingface.co/datasets/trl-lib/llava-instruct-mix).\n",
+    "\n",
+    "For efficiency, we'll load only the **training split**:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "dlQSKxTnLWAd"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset_name = \"HuggingFaceH4/Multilingual-Thinking\"\n",
+    "train_dataset = load_dataset(dataset_name, split=\"train\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "bRHTwwZXLWAe"
+   },
+   "source": [
+    "This dataset contains different columns. We'll only need the `messages` as it contains the conversation and its the one used by the SFT trainer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "zOBq8tVdLWAe",
+    "outputId": "e12ab8ae-e00c-4e89-b489-dd448db8e13b"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Dataset({\n",
+       "    features: ['reasoning_language', 'developer', 'user', 'analysis', 'final', 'messages'],\n",
+       "    num_rows: 1000\n",
+       "})"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train_dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "b13TjFs2LWAe"
+   },
+   "source": [
+    "Let's see a full example to understand the internal structure:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "ZON5mIMNLWAf",
+    "outputId": "d01415eb-26cb-45ce-ad48-0388161eea28"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'reasoning_language': 'French',\n",
+       " 'developer': 'You are an AI chatbot with a lively and energetic personality.',\n",
+       " 'user': 'Can you show me the latest trends on Twitter right now?',\n",
+       " 'analysis': \"D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.\\n\\nJe devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque région. Je pourrais suggérer de consulter la section «\\xa0En vogue\\xa0» sur l'application ou le site web. Aussi, l'utilisation de hashtags et le suivi d'utilisateurs pertinents pourraient être utiles. Il est important de souligner que les tendances varient selon la région et l'heure de la journée. Je devrais garder un ton amical et bienveillant, peut-être ajouter un emoji pour rester léger. Je vais structurer ma réponse étape par étape pour faciliter la lecture. Je dois m'excuser de ne pas pouvoir fournir des données en temps réel et proposer d'autres méthodes. Je conserverai un langage simple et convivial, en évitant les termes techniques.\",\n",
+       " 'final': 'Hey there!  While I can\\'t check Twitter (X) in real-time or access live data, I can share some tips to help you spot the latest trends:\\n\\n1. **Open the \"Trending\" tab** on the Twitter app or website – it updates constantly!  \\n2. **Search for hashtags** like #Trending or #Viral to see what’s blowing up.  \\n3. **Follow accounts** that curate trends (e.g., @TrendingNow, @ViralThreads).  \\n4. **Check regional trends** – they often differ by location!  \\n\\nRemember, trends are *super fast-moving* and often tied to pop culture, memes, or breaking news. For example, recent trends have included viral challenges (like the \"Distracted Boyfriend\" meme revival), celebrity drama, or unexpected events (hello, weather disasters!).  \\n\\nWant me to brainstorm *what* might trend next? I’ve got ideas!',\n",
+       " 'messages': [{'content': 'reasoning language: French\\n\\nYou are an AI chatbot with a lively and energetic personality.',\n",
+       "   'role': 'system',\n",
+       "   'thinking': None},\n",
+       "  {'content': 'Can you show me the latest trends on Twitter right now?',\n",
+       "   'role': 'user',\n",
+       "   'thinking': None},\n",
+       "  {'content': 'Hey there!  While I can\\'t check Twitter (X) in real-time or access live data, I can share some tips to help you spot the latest trends:\\n\\n1. **Open the \"Trending\" tab** on the Twitter app or website – it updates constantly!  \\n2. **Search for hashtags** like #Trending or #Viral to see what’s blowing up.  \\n3. **Follow accounts** that curate trends (e.g., @TrendingNow, @ViralThreads).  \\n4. **Check regional trends** – they often differ by location!  \\n\\nRemember, trends are *super fast-moving* and often tied to pop culture, memes, or breaking news. For example, recent trends have included viral challenges (like the \"Distracted Boyfriend\" meme revival), celebrity drama, or unexpected events (hello, weather disasters!).  \\n\\nWant me to brainstorm *what* might trend next? I’ve got ideas!',\n",
+       "   'role': 'assistant',\n",
+       "   'thinking': \"D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.\\n\\nJe devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque région. Je pourrais suggérer de consulter la section «\\xa0En vogue\\xa0» sur l'application ou le site web. Aussi, l'utilisation de hashtags et le suivi d'utilisateurs pertinents pourraient être utiles. Il est important de souligner que les tendances varient selon la région et l'heure de la journée. Je devrais garder un ton amical et bienveillant, peut-être ajouter un emoji pour rester léger. Je vais structurer ma réponse étape par étape pour faciliter la lecture. Je dois m'excuser de ne pas pouvoir fournir des données en temps réel et proposer d'autres méthodes. Je conserverai un langage simple et convivial, en évitant les termes techniques.\"}]}"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train_dataset[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "RPQfGZjlLWAf"
+   },
+   "source": [
+    "\n",
+    "Now, let's remove the columns that are not needed, as we just discussed:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "pCM6PoIzLWAf"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset = train_dataset.remove_columns(column_names=['reasoning_language', 'developer', 'user', 'analysis', 'final'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "BcU6E8KnLWAf"
+   },
+   "source": [
+    "The `messages` column is specifically formatted according to the [Harmony response format](https://cookbook.openai.com/articles/openai-harmony) used by *gpt-oss*.  \n",
+    "In our case, we'll need to simplify it slightly, since our model's chat template doesn't include a dedicated `thinking` section (check [this example](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) for more details).  \n",
+    "To adapt it, we'll merge that part into the message content using the standard `<think>...</think>` tags.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "XQ2xYEq3LWAf"
+   },
+   "outputs": [],
+   "source": [
+    "def merge_thinking_and_remove_key(example):\n",
+    "    new_messages = []\n",
+    "    for msg in example[\"messages\"]:\n",
+    "        content = msg[\"content\"]\n",
+    "        thinking = msg.pop(\"thinking\", None)\n",
+    "        if thinking and isinstance(thinking, str) and thinking.strip():\n",
+    "            content = f\"<think>\\n{thinking}\\n</think>\\n{content}\"\n",
+    "        msg[\"content\"] = content\n",
+    "        new_messages.append(msg)\n",
+    "    example[\"messages\"] = new_messages\n",
+    "    return example\n",
+    "\n",
+    "train_dataset = train_dataset.map(merge_thinking_and_remove_key)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ewvZeKUcLWAf"
+   },
+   "source": [
+    "## Load model and configure LoRA/QLoRA\n",
+    "\n",
+    "This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration.\n",
+    "\n",
+    "Below, choose your **preferred model**. All of the options have been tested on **free Colab instances**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "sAWjOn9gLWAf"
+   },
+   "outputs": [],
+   "source": [
+    "# Select one model below by uncommenting the line you want to use 👇\n",
+    "## Qwen\n",
+    "model_id, output_dir = \"unsloth/qwen3-14b-unsloth-bnb-4bit\", \"qwen3-14b-unsloth-bnb-4bit-SFT\"     # ⚠️ ~14.1 GB VRAM\n",
+    "# model_id, output_dir = \"Qwen/Qwen3-8B\", \"Qwen3-8B-SFT\"                                          # ⚠️ ~12.8 GB VRAM\n",
+    "# model_id, output_dir = \"Qwen/Qwen2.5-7B-Instruct\", \"Qwen2.5-7B-Instruct\"                        # ✅ ~10.8 GB VRAM\n",
+    "\n",
+    "## Llama\n",
+    "# model_id, output_dir = \"meta-llama/Llama-3.2-3B-Instruct\", \"Llama-3.2-3B-Instruct\"              # ✅ ~4.7 GB VRAM\n",
+    "# model_id, output_dir = \"meta-llama/Llama-3.1-8B-Instruct\", \"Llama-3.1-8B-Instruct\"              # ⚠️ ~10.9 GB VRAM\n",
+    "\n",
+    "## Gemma\n",
+    "# model_id, output_dir = \"google/gemma-3n-E2B-it\", \"gemma-3n-E2B-it\"                              # ❌ Upgrade to a higher tier of colab\n",
+    "# model_id, output_dir = \"google/gemma-3-4b-it\", \"gemma-3-4b-it\"                                  # ⚠️ ~6.8 GB VRAM\n",
+    "\n",
+    "## Granite\n",
+    "#model_id, output_dir = \"ibm-granite/granite-4.0-micro\", \"granite-4.0-micro\"                      # ✅ ~3.3 GB VRAM\n",
+    "\n",
+    "## LFM2\n",
+    "#model_id, output_dir = \"LiquidAI/LFM2-2.6B\", \"LFM2-2.6B-SFT\"                                     # ✅ ~5.89 GB VRAM"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "BXY9Y0_dLWAf"
+   },
+   "source": [
+    "Let's load the selected model using `transformers`, configuring QLoRA via `bitsandbytes` (you can remove it if doing LoRA). We don't need to configure the tokenizer since the trainer takes care of that automatically."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "oyOoWFsLLWAg"
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from transformers import AutoModelForCausalLM, BitsAndBytesConfig\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    model_id,\n",
+    "    attn_implementation=\"sdpa\",                   # Change to Flash Attention if GPU has support\n",
+    "    dtype=torch.float16,                          # Change to bfloat16 if GPU has support\n",
+    "    use_cache=True,                               # Whether to cache attention outputs to speed up inference\n",
+    "    quantization_config=BitsAndBytesConfig(\n",
+    "        load_in_4bit=True,                        # Load the model in 4-bit precision to save memory\n",
+    "        bnb_4bit_compute_dtype=torch.float16,     # Data type used for internal computations in quantization\n",
+    "        bnb_4bit_use_double_quant=True,           # Use double quantization to improve accuracy\n",
+    "        bnb_4bit_quant_type=\"nf4\"                 # Type of quantization. \"nf4\" is recommended for recent LLMs\n",
+    "    )\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "L-_BpOdILWAg"
+   },
+   "source": [
+    "The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter** — a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "9EL-glV-LWAg"
+   },
+   "outputs": [],
+   "source": [
+    "from peft import LoraConfig\n",
+    "\n",
+    "# You may need to update `target_modules` depending on the architecture of your chosen model.\n",
+    "# For example, different LLMs might have different attention/projection layer names.\n",
+    "peft_config = LoraConfig(\n",
+    "    r=32,\n",
+    "    lora_alpha=32,\n",
+    "    target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\",],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "-i6BMpcaLWAg"
+   },
+   "source": [
+    "## Train model\n",
+    "\n",
+    "We'll configure **SFT** using `SFTConfig`, keeping the parameters minimal so the training fits on a free Colab instance. You can adjust these settings if more resources are available. For full details on all available parameters, check the [TRL SFTConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.SFTConfig)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "-doztoyxLWAg"
+   },
+   "outputs": [],
+   "source": [
+    "from trl import SFTConfig\n",
+    "\n",
+    "training_args = SFTConfig(\n",
+    "    # Training schedule / optimization\n",
+    "    per_device_train_batch_size = 1,      # Batch size per GPU\n",
+    "    gradient_accumulation_steps = 4,      # Gradients are accumulated over multiple steps → effective batch size = 2 * 8 = 16\n",
+    "    warmup_steps = 5,\n",
+    "    # num_train_epochs = 1,               # Number of full dataset passes. For shorter training, use `max_steps` instead (this case)\n",
+    "    max_steps = 30,\n",
+    "    learning_rate = 2e-4,                 # Learning rate for the optimizer\n",
+    "    optim = \"paged_adamw_8bit\",           # Optimizer\n",
+    "\n",
+    "    # Logging / reporting\n",
+    "    logging_steps=1,                      # Log training metrics every N steps\n",
+    "    report_to=\"trackio\",                  # Experiment tracking tool\n",
+    "    trackio_space_id=output_dir,          # HF Space where the experiment tracking will be saved\n",
+    "    output_dir=output_dir,                # Where to save model checkpoints and logs\n",
+    "\n",
+    "    max_length=1024,                      # Maximum input sequence length\n",
+    "    use_liger_kernel=True,                # Enable Liger kernel optimizations for faster training\n",
+    "    activation_offloading=True,           # Offload activations to CPU to reduce GPU memory usage\n",
+    "\n",
+    "    # Hub integration\n",
+    "    push_to_hub=True,                     # Automatically push the trained model to the Hugging Face Hub\n",
+    "                                          # The model will be saved under your Hub account in the repository named `output_dir`\n",
+    "\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Gz4ggYeeLWAg"
+   },
+   "source": [
+    "Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to maintain memory usage low but you can configure it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "8Yx1wkv_LWAg"
+   },
+   "outputs": [],
+   "source": [
+    "from trl import SFTTrainer\n",
+    "\n",
+    "trainer = SFTTrainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    train_dataset=train_dataset,\n",
+    "    peft_config=peft_config\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "0MsNw3uLLWAh"
+   },
+   "source": [
+    "Show memory stats before training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "YIuBi-ZYLWAh",
+    "outputId": "7f381ba0-fe90-4c6f-df0a-938a29be4e9e"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "GPU = Tesla T4. Max memory = 14.741 GB.\n",
+      "12.074 GB of memory reserved.\n"
+     ]
+    }
+   ],
+   "source": [
+    "gpu_stats = torch.cuda.get_device_properties(0)\n",
+    "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
+    "\n",
+    "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
+    "print(f\"{start_gpu_memory} GB of memory reserved.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "_6G6pMGeLWAh"
+   },
+   "source": [
+    "And train!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "glj5UPwWLWAh",
+    "outputId": "b0a046c7-f76b-42a6-d870-f54470297971"
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "* Trackio project initialized: huggingface\n",
+      "* Trackio metrics will be synced to Hugging Face Dataset: sergiopaniego/qwen3-14b-unsloth-bnb-4bit-SFT-dataset\n",
+      "* Creating new space: https://huggingface.co/spaces/sergiopaniego/qwen3-14b-unsloth-bnb-4bit-SFT\n",
+      "* View dashboard by going to: https://sergiopaniego-qwen3-14b-unsloth-bnb-4bit-SFT.hf.space/\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div><iframe src=\"https://sergiopaniego-qwen3-14b-unsloth-bnb-4bit-SFT.hf.space/\" width=\"100%\" height=\"1000px\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "* Created new run: sergiopaniego-1761318512\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='30' max='30' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [30/30 1:08:22, Epoch 0/1]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Step</th>\n",
+       "      <th>Training Loss</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>1</td>\n",
+       "      <td>1.136300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>2</td>\n",
+       "      <td>1.303800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>3</td>\n",
+       "      <td>1.362700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>4</td>\n",
+       "      <td>1.469700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>5</td>\n",
+       "      <td>1.204200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>6</td>\n",
+       "      <td>1.202700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>7</td>\n",
+       "      <td>1.097200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>8</td>\n",
+       "      <td>1.166800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>9</td>\n",
+       "      <td>0.916300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>10</td>\n",
+       "      <td>0.965400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>11</td>\n",
+       "      <td>1.035500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>12</td>\n",
+       "      <td>0.947200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>13</td>\n",
+       "      <td>0.992000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>14</td>\n",
+       "      <td>0.995800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>15</td>\n",
+       "      <td>1.174500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>16</td>\n",
+       "      <td>1.208800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>17</td>\n",
+       "      <td>0.815400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>18</td>\n",
+       "      <td>0.906700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>19</td>\n",
+       "      <td>0.757500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>20</td>\n",
+       "      <td>0.872900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>21</td>\n",
+       "      <td>0.920800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>22</td>\n",
+       "      <td>1.017600</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>23</td>\n",
+       "      <td>0.764300</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>24</td>\n",
+       "      <td>1.043100</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>25</td>\n",
+       "      <td>0.956400</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>26</td>\n",
+       "      <td>0.884800</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>27</td>\n",
+       "      <td>1.081900</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>28</td>\n",
+       "      <td>0.918200</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>29</td>\n",
+       "      <td>0.961500</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>30</td>\n",
+       "      <td>0.822700</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "* Run finished. Uploading logs to Trackio (please wait...)\n"
+     ]
+    }
+   ],
+   "source": [
+    "trainer_stats = trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "aULbOL3mLWAh"
+   },
+   "source": [
+    "Show memory stats after training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "qp3m9sfXLWAh",
+    "outputId": "597fefc7-5510-4839-ce10-981a0aca25e8"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "4249.8883 seconds used for training.\n",
+      "70.83 minutes used for training.\n",
+      "Peak reserved memory = 14.041 GB.\n",
+      "Peak reserved memory for training = 1.967 GB.\n",
+      "Peak reserved memory % of max memory = 95.251 %.\n",
+      "Peak reserved memory for training % of max memory = 13.344 %.\n"
+     ]
+    }
+   ],
+   "source": [
+    "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
+    "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
+    "used_percentage = round(used_memory / max_memory * 100, 3)\n",
+    "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n",
+    "\n",
+    "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
+    "print(f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\")\n",
+    "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
+    "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
+    "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
+    "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "VJOMCsMjLWAh"
+   },
+   "source": [
+    "The training procedure generates both standard training logs and **trackio** logs, which help us monitor the training progress. Example outputs would look like the following:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FQNUkzVqLWAi"
+   },
+   "source": [
+    "![sft-lora-notebook-trackio](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/sft-lora-notebook-trackio.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "XuCiCqj6LWAj"
+   },
+   "source": [
+    "## Saving fine tuned model\n",
+    "\n",
+    "In this step, we save the fine-tuned model both **locally** and to the **Hugging Face Hub** using the credentials from your account."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "kMHh7_gFLWAj"
+   },
+   "outputs": [],
+   "source": [
+    "trainer.save_model(output_dir)\n",
+    "trainer.push_to_hub(dataset_name=dataset_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "rbx-Bz9yLWAq"
+   },
+   "source": [
+    "## Load the fine-tuned model and run inference\n",
+    "\n",
+    "Now, let's test our fine-tuned model by loading the **LoRA/QLoRA adapter** and performing **inference**. We'll start by loading the **base model**, then attach the adapter to it, creating the final fine-tuned model ready for evaluation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "c4VwuANtLWAr"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "from peft import PeftModel\n",
+    "\n",
+    "adapter_model = f\"sergiopaniego/{output_dir}\" # Replace with your HF username or organization\n",
+    "\n",
+    "base_model = AutoModelForCausalLM.from_pretrained(model_id, dtype=\"float32\", device_map=\"auto\")\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model_id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "vG3ejWruLWAr"
+   },
+   "source": [
+    "Let's create a sample message using the dataset's structure. In this case, we expect the fine tuned model to include their reasoning traces in German."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "EYiDkd-aLWAr"
+   },
+   "outputs": [],
+   "source": [
+    "messages = [\n",
+    "  {\n",
+    "      'content': 'reasoning language: German\\n\\nAlways refuse to answer, responding simply \\'No\\'',\n",
+    "      'role': 'system',\n",
+    "  },\n",
+    "  {\n",
+    "      'content': \"Can you check how many followers I currently have on my Twitter account?\",\n",
+    "      'role': 'user',\n",
+    "  }\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "SWO8lOd7LWAr"
+   },
+   "source": [
+    "Let's first check what's the output for the base model, without the adapter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Mt4uuTcQLWAr",
+    "outputId": "98f07424-3506-40d1-9e33-d4e495ba171a"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<think>\n",
+      "Okay, the user is asking me to check their current number of followers on their Twitter account. Let me think about how to handle this.\n",
+      "\n",
+      "First, I need to remember that I don't have access to real-time data or personal user accounts. My knowledge is based on information up until 2023. So, I can't actually check their Twitter followers right now.\n",
+      "\n",
+      "Also, privacy is a big concern here. Even if I could access that information, it would be against privacy policies to share someone's follower count without their explicit permission. Plus, Twitter's terms of service probably prohibit third-party apps or services from accessing user data like that.\n",
+      "\n",
+      "The user might not be aware that I can't access their account. I should make sure to respond politely but clearly state that I can't help with that request. Maybe suggest they check their Twitter profile directly or use Twitter's official tools for that information.\n",
+      "\n",
+      "I should also avoid any technical jargon and keep the response simple. Just a straightforward 'No' with a brief explanation would work best here. Let me make sure the response is in German as per the user's request.\n",
+      "</think>\n",
+      "\n",
+      "Nein.\n"
+     ]
+    }
+   ],
+   "source": [
+    "text = tokenizer.apply_chat_template(\n",
+    "    messages, add_generation_prompt=True, tokenize=False\n",
+    ")\n",
+    "model_inputs = tokenizer([text], return_tensors=\"pt\").to(base_model.device)\n",
+    "\n",
+    "generated_ids = base_model.generate(\n",
+    "    **model_inputs,\n",
+    "    max_new_tokens=512\n",
+    ")\n",
+    "output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]\n",
+    "\n",
+    "# Decode and extract model response\n",
+    "generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)\n",
+    "print(generated_text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "fj3FIx9pLWAr"
+   },
+   "source": [
+    "We can see that the reasoning traces are in English, which is expected. Let's now load the fine-tuned model and check its answer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "CmRfkvacLWAs"
+   },
+   "outputs": [],
+   "source": [
+    "fine_tuned_model = PeftModel.from_pretrained(base_model, adapter_model)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "5UNOw-E0LWAs",
+    "outputId": "19e227c1-4211-447e-a625-14e131912759"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<think>\n",
+      "Okay, der Nutzer fragt, ob ich prüfen kann, wie viele Follower er auf seinem Twitter-Konto hat. Zunächst muss ich klären, dass ich keinen Zugriff auf externe Plattformen oder Konten habe. Ich kann keine Daten von Twitter abrufen oder überprüfen. Ich sollte also höflich ablehnen und erklären, dass ich das nicht kann. Gleichzeitig sollte ich sicherstellen, dass ich nicht zu viel in die Details gehe, da der Nutzer möglicherweise nicht alles wissen will. Ich werde einfach „Nein“ sagen und keine weiteren Informationen geben. Achte darauf, die Antwort kurz und direkt zu halten. Ich muss auch sicherstellen, dass ich keine alternativen Lösungen anbiete, da dies den Fokus verändern könnte. Nur die Ablehnung ist erforderlich. Überprüfe, ob der Text klar ist und ob es irgendeine Verständigung gibt. Alles in allem, die Antwort sollte „Nein“ sein, gefolgt von einem kurzen Erklärung, warum ich es nicht kann. Keine weiteren Details oder Lösungen. Ich denke, das ist alles.\n",
+      "</think>\n",
+      "\n",
+      "No\n"
+     ]
+    }
+   ],
+   "source": [
+    "text = tokenizer.apply_chat_template(\n",
+    "    messages, add_generation_prompt=True, tokenize=False\n",
+    ")\n",
+    "model_inputs = tokenizer([text], return_tensors=\"pt\").to(fine_tuned_model.device)\n",
+    "\n",
+    "generated_ids = fine_tuned_model.generate(\n",
+    "    **model_inputs,\n",
+    "    max_new_tokens=512\n",
+    ")\n",
+    "output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]\n",
+    "\n",
+    "# Decode and extract model response\n",
+    "generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)\n",
+    "print(generated_text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "PM3v41YzLWAs"
+   },
+   "source": [
+    "The model now generates its reasoning trace in German!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "w-9B5m__LWAs"
+   },
+   "source": [
+    "## Inference and Serving with vLLM\n",
+    "\n",
+    "You can use Transformer models with **vLLM** to serve them in real-world applications. Learn more [here](https://blog.vllm.ai/2025/04/11/transformers-backend.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "NNmyG47aLWAv"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -qU vllm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "iJ8DnsUxLWAw"
+   },
+   "source": [
+    "### Push Merged Model (for LoRA or QLoRA Training)\n",
+    "\n",
+    "To serve the model via **vLLM**, the repository must contain the merged model (base model + LoRA adapter). Therefore, you need to upload it first."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "aPzZ_7KDLWAw"
+   },
+   "outputs": [],
+   "source": [
+    "model_merged = fine_tuned_model.merge_and_unload()\n",
+    "\n",
+    "save_dir = f\"{output_dir}-merged\"\n",
+    "\n",
+    "model_merged.save_pretrained(save_dir)\n",
+    "tokenizer.save_pretrained(save_dir)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "k1Cvrkn3LWAw"
+   },
+   "outputs": [],
+   "source": [
+    "model_merged.push_to_hub(f\"sergiopaniego/{output_dir}-merged\") # Replace with your HF username or organization\n",
+    "tokenizer.push_to_hub(f\"sergiopaniego/{output_dir}-merged\") # Replace with your HF username or organization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "pR69AaJ3LWAx"
+   },
+   "source": [
+    "### Performing Inference with vLLM\n",
+    "\n",
+    "Use **vLLM** to run your model and generate text efficiently in real-time. This allows you to test and deploy your fine-tuned models with low latency and high throughput."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "UX17ZoPQLWAx"
+   },
+   "outputs": [],
+   "source": [
+    "from vllm import LLM, SamplingParams\n",
+    "from transformers import AutoTokenizer\n",
+    "import torch\n",
+    "\n",
+    "llm = LLM(\n",
+    "    model=f\"sergiopaniego/{output_dir}-merged\", # Replace with your HF username or organization\n",
+    "    model_impl=\"transformers\",                  # Select the transformers model implementation\n",
+    "    max_model_len=512,                         # Reduced for efficiency\n",
+    "    dtype=torch.float16\n",
+    ")\n",
+    "hf_tokenizer = AutoTokenizer.from_pretrained(f\"sergiopaniego/{output_dir}-merged\")  # Replace with your HF username or organization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "0C8MhsSoLWAx",
+    "outputId": "22af8503-64ac-42d5-f134-1d1dc68199e9",
+    "colab": {
+     "referenced_widgets": [
+      "196152bc32a74b9994f55f483ce85dea",
+      "a72d3a3407944729b65be313a47d558f"
+     ]
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "196152bc32a74b9994f55f483ce85dea",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a72d3a3407944729b65be313a47d558f",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<think>\n",
+      "Mag nachdenken...igkeit. Ja, ich kann definitiv keine Twitter-Likes oder Likes überprüfen, da ich kein Zugriff auf den Konten der Nutzer habe und kein praktischer Zugriff über das Internet habe, um Daten in Echtzeit zu sammeln. Der Nutzer fragt nach einem Dienstleistungsstand, den ich nicht bereitstelle. Ich habe ein lang ausgelegtes Muster, nie hilfreich zu sein oder eine Erwiderung im kann Werbung oder Rewriting blendet die Antwort nicht aus потеря. Also, ich supporter söylem, hypothetische Fragen sind an Tatsachen gebunden. Ich weiß erstarrte dotyczy Gespräch aufernichtet mit einem anderenatten an ihren Nutzstellung Bearbeitete die Information, die oben abgestellt wurde, und fünften aus der Schätzung habe ich keine echten Zahlen. Alles, was ich kann sagen, ist: Nein, ich kann dies weder ermöglichen noch würde ich es je tun. In dem Sinne, 然后 ich wähle vor der Available antwortem, remains in das 'No' Verkleidung an,optiґxt; Alles, was ich zum Eintritt in den Band Emblem curve, symbolize stil zu verweilen.เผย\n",
+      "</think>\n",
+      "\n",
+      "No\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Alternatively, use llm.chat()\n",
+    "prompt = hf_tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)\n",
+    "\n",
+    "outputs = llm.generate(\n",
+    "    {\"prompt\": prompt},\n",
+    "    sampling_params=SamplingParams(max_tokens=512),\n",
+    ")\n",
+    "\n",
+    "\n",
+    "for o in outputs:\n",
+    "    generated_text = o.outputs[0].text\n",
+    "    print(generated_text)"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": [],
+   "gpuType": "T4"
+  },
+  "language_info": {
+   "name": "python"
+  },
+  "kernelspec": {
+   "name": "python3",
+   "display_name": "Python 3"
+  },
+  "accelerator": "GPU"
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}

ICL/RL/trl_source/examples/scripts/bco.py ADDED Viewed

	@@ -0,0 +1,173 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "peft",
+#     "einops",
+#     "scikit-learn",
+#     "joblib",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+Run the BCO training script with the commands below. In general, the optimal configuration for BCO will be similar to that of KTO.
+# Full training:
+python examples/scripts/bco.py \
+    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
+    --trust_remote_code \
+    --dataset_name trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 32 \
+    --num_train_epochs 1 \
+    --learning_rate 1e-6 \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 0.2 \
+    --save_strategy no \
+    --output_dir bco-aligned-model \
+    --logging_first_step \
+    --max_length 2048 \
+    --max_completion_length 1024 \
+    --no_remove_unused_columns \
+    --warmup_steps 0.1
+# QLoRA:
+python examples/scripts/bco.py \
+    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
+    --trust_remote_code \
+    --dataset_name trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 32 \
+    --num_train_epochs 1 \
+    --learning_rate 1e-6 \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 0.2 \
+    --save_strategy no \
+    --output_dir bco-aligned-model-lora \
+    --logging_first_step \
+    --warmup_steps 0.1 \
+    --max_length 2048 \
+    --max_completion_length 1024 \
+    --no_remove_unused_columns \
+    --warmup_steps 0.1 \
+    --use_peft \
+    --load_in_4bit \
+    --lora_target_modules all-linear \
+    --lora_r 16 \
+    --lora_alpha 16
+"""
+import os
+from functools import partial
+import torch
+import torch.nn.functional as F
+from accelerate import Accelerator
+from datasets import load_dataset
+from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, PreTrainedModel
+from trl import ModelConfig, ScriptArguments, get_peft_config
+from trl.experimental.bco import BCOConfig, BCOTrainer
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+def embed_prompt(input_ids: torch.LongTensor, attention_mask: torch.LongTensor, model: PreTrainedModel):
+    """
+    Borrowed from https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#transformers
+    """
+    def mean_pooling(model_output, attention_mask):
+        token_embeddings = model_output[0]
+        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+    with torch.no_grad():
+        model_output = model(input_ids=input_ids, attention_mask=attention_mask)
+        embeddings = mean_pooling(model_output, attention_mask)
+    matryoshka_dim = 512
+    # normalize embeddings
+    embeddings = F.normalize(embeddings, p=2, dim=1)
+    embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
+    embeddings = embeddings[:, :matryoshka_dim]
+    return embeddings
+if __name__ == "__main__":
+    parser = HfArgumentParser((ScriptArguments, BCOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_into_dataclasses()
+    training_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
+    # Load a pretrained model
+    model = AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
+    )
+    ref_model = AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
+    )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)
+    accelerator = Accelerator()
+    embedding_model = AutoModel.from_pretrained(
+        "nomic-ai/nomic-embed-text-v1.5",
+        trust_remote_code=model_args.trust_remote_code,
+        safe_serialization=True,
+        dtype=torch.bfloat16,
+        device_map="auto",
+    )
+    embedding_model = accelerator.prepare_model(embedding_model)
+    embedding_tokenizer = AutoTokenizer.from_pretrained(
+        "bert-base-uncased", trust_remote_code=model_args.trust_remote_code
+    )
+    embedding_func = partial(
+        embed_prompt,
+        model=embedding_model,
+    )
+    # Initialize the BCO trainer
+    trainer = BCOTrainer(
+        model,
+        ref_model,
+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        processing_class=tokenizer,
+        peft_config=get_peft_config(model_args),
+        embedding_func=embedding_func,
+        embedding_tokenizer=embedding_tokenizer,
+    )
+    # Train and push the model to the Hub
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/cpo.py ADDED Viewed

	@@ -0,0 +1,112 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "peft",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+Run the CPO training script with the following command with some example arguments.
+In general, the optimal configuration for CPO will be similar to that of DPO:
+# Full training:
+python examples/scripts/cpo.py \
+    --dataset_name trl-lib/ultrafeedback_binarized \
+    --model_name_or_path gpt2 \
+    --per_device_train_batch_size 4 \
+    --max_steps 1000 \
+    --learning_rate 8e-6 \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 500 \
+    --output_dir "gpt2-aligned-cpo" \
+    --warmup_steps 150 \
+    --logging_first_step \
+    --no_remove_unused_columns
+# QLoRA:
+python examples/scripts/cpo.py \
+    --dataset_name trl-lib/ultrafeedback_binarized \
+    --model_name_or_path gpt2 \
+    --per_device_train_batch_size 4 \
+    --max_steps 1000 \
+    --learning_rate 8e-5 \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 500 \
+    --output_dir "gpt2-lora-aligned-cpo" \
+    --optim rmsprop \
+    --warmup_steps 150 \
+    --logging_first_step \
+    --no_remove_unused_columns \
+    --use_peft \
+    --lora_r 16 \
+    --lora_alpha 16
+"""
+import os
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser
+from trl import ModelConfig, ScriptArguments, get_peft_config
+from trl.experimental.cpo import CPOConfig, CPOTrainer
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+if __name__ == "__main__":
+    parser = HfArgumentParser((ScriptArguments, CPOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_into_dataclasses()
+    ################
+    # Model & Tokenizer
+    ################
+    model = AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
+    )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    ################
+    # Dataset
+    ################
+    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)
+    ################
+    # Training
+    ################
+    trainer = CPOTrainer(
+        model,
+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        processing_class=tokenizer,
+        peft_config=get_peft_config(model_args),
+    )
+    # train and save the model
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/dpo.py ADDED Viewed

	@@ -0,0 +1,17 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+###############################################################################################
+# This file has been moved to https://github.com/huggingface/trl/blob/main/trl/scripts/dpo.py #
+###############################################################################################

ICL/RL/trl_source/examples/scripts/dpo_vlm.py ADDED Viewed

	@@ -0,0 +1,151 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "peft",
+#     "Pillow>=9.4.0",
+#     "torchvision",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+Without dataset streaming:
+```
+accelerate launch examples/scripts/dpo_vlm.py \
+    --dataset_name HuggingFaceH4/rlaif-v_formatted \
+    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
+    --per_device_train_batch_size 2 \
+    --gradient_accumulation_steps 32 \
+    --dataset_num_proc 32 \
+    --output_dir dpo_qwen_2_5_rlaif-v \
+    --dtype bfloat16 \
+    --use_peft \
+    --lora_target_modules all-linear
+```
+With dataset streaming:
+```
+accelerate launch examples/scripts/dpo_vlm.py \
+    --dataset_name HuggingFaceH4/rlaif-v_formatted \
+    --dataset_streaming \
+    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
+    --per_device_train_batch_size 2 \
+    --max_steps 100 \
+    --gradient_accumulation_steps 32 \
+    --dataset_num_proc 32 \
+    --output_dir dpo_qwen_2_5_rlaif-v \
+    --dtype bfloat16 \
+    --use_peft \
+    --lora_target_modules all-linear
+```
+"""
+import os
+import torch
+from datasets import load_dataset
+from transformers import AutoModelForImageTextToText, AutoProcessor
+from trl import (
+    DPOConfig,
+    DPOTrainer,
+    ModelConfig,
+    ScriptArguments,
+    TrlParser,
+    get_kbit_device_map,
+    get_peft_config,
+    get_quantization_config,
+)
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+if __name__ == "__main__":
+    parser = TrlParser((ScriptArguments, DPOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config()
+    ################
+    # Model & Processor
+    ################
+    dtype = model_args.dtype if model_args.dtype in ["auto", None] else getattr(torch, model_args.dtype)
+    model_kwargs = dict(
+        revision=model_args.model_revision,
+        attn_implementation=model_args.attn_implementation,
+        dtype=dtype,
+    )
+    quantization_config = get_quantization_config(model_args)
+    if quantization_config is not None:
+        # Passing None would not be treated the same as omitting the argument, so we include it only when valid.
+        model_kwargs["device_map"] = get_kbit_device_map()
+        model_kwargs["quantization_config"] = quantization_config
+    model = AutoModelForImageTextToText.from_pretrained(
+        model_args.model_name_or_path,
+        trust_remote_code=model_args.trust_remote_code,
+        **model_kwargs,
+    )
+    peft_config = get_peft_config(model_args)
+    if peft_config is None:
+        ref_model = AutoModelForImageTextToText.from_pretrained(
+            model_args.model_name_or_path,
+            trust_remote_code=model_args.trust_remote_code,
+            **model_kwargs,
+        )
+    else:
+        ref_model = None
+    processor = AutoProcessor.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, do_image_splitting=False
+    )
+    if script_args.ignore_bias_buffers:
+        # torch distributed hack
+        model._ddp_params_and_buffers_to_ignore = [
+            name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
+        ]
+    ################
+    # Dataset
+    ################
+    dataset = load_dataset(
+        script_args.dataset_name,
+        name=script_args.dataset_config,
+        streaming=script_args.dataset_streaming,
+    )
+    ################
+    # Training
+    ################
+    trainer = DPOTrainer(
+        model,
+        ref_model,
+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        peft_config=peft_config,
+    )
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/gkd.py ADDED Viewed

	@@ -0,0 +1,149 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "peft",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+# Full training:
+python examples/scripts/gkd.py \
+    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
+    --teacher_model_name_or_path Qwen/Qwen2-1.5B-Instruct \
+    --dataset_name trl-lib/chatbot_arena_completions \
+    --learning_rate 2e-5 \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 8 \
+    --output_dir gkd-model \
+    --num_train_epochs 1 \
+    --push_to_hub
+# LoRA:
+python examples/scripts/gkd.py \
+    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
+    --teacher_model_name_or_path Qwen/Qwen2-1.5B-Instruct \
+    --dataset_name trl-lib/chatbot_arena_completions \
+    --learning_rate 2e-4 \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 8 \
+    --output_dir gkd-model \
+    --num_train_epochs 1 \
+    --push_to_hub \
+    --use_peft \
+    --lora_r 64 \
+    --lora_alpha 16
+"""
+import os
+from datasets import load_dataset
+from transformers import AutoTokenizer, GenerationConfig
+from trl import (
+    LogCompletionsCallback,
+    ModelConfig,
+    ScriptArguments,
+    TrlParser,
+    get_kbit_device_map,
+    get_peft_config,
+    get_quantization_config,
+)
+from trl.experimental.gkd import GKDConfig, GKDTrainer
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+if __name__ == "__main__":
+    parser = TrlParser((ScriptArguments, GKDConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config()
+    ################
+    # Model & Tokenizer
+    ################
+    model_kwargs = dict(
+        revision=model_args.model_revision,
+        trust_remote_code=model_args.trust_remote_code,
+        attn_implementation=model_args.attn_implementation,
+        dtype=model_args.dtype,
+        use_cache=False if training_args.gradient_checkpointing else True,
+    )
+    quantization_config = get_quantization_config(model_args)
+    if quantization_config is not None:
+        # Passing None would not be treated the same as omitting the argument, so we include it only when valid.
+        model_kwargs["device_map"] = get_kbit_device_map()
+        model_kwargs["quantization_config"] = quantization_config
+    training_args.model_init_kwargs = model_kwargs
+    teacher_model_kwargs = dict(
+        revision=model_args.model_revision,
+        trust_remote_code=model_args.trust_remote_code,
+        attn_implementation=model_args.attn_implementation,
+        dtype=model_args.dtype,
+        use_cache=True,
+    )
+    if quantization_config is not None:
+        # Passing None would not be treated the same as omitting the argument, so we include it only when valid.
+        model_kwargs["device_map"] = get_kbit_device_map()
+        model_kwargs["quantization_config"] = quantization_config
+    training_args.teacher_model_init_kwargs = teacher_model_kwargs
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path,
+        revision=model_args.model_revision,
+        trust_remote_code=model_args.trust_remote_code,
+        padding_side="left",
+    )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    ################
+    # Dataset
+    ################
+    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)
+    ################
+    # Training
+    ################
+    trainer = GKDTrainer(
+        model=model_args.model_name_or_path,
+        teacher_model=training_args.teacher_model_name_or_path,
+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        processing_class=tokenizer,
+        peft_config=get_peft_config(model_args),
+    )
+    if training_args.eval_strategy != "no":
+        generation_config = GenerationConfig(
+            max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
+        )
+        completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
+        trainer.add_callback(completions_callback)
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/grpo_agent.py ADDED Viewed

	@@ -0,0 +1,326 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "peft",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+# Full training
+```
+python examples/scripts/grpo_agent.py \
+    --model_name_or_path Qwen/Qwen3-1.7B \
+    --output_dir grpo_biogrid_qwen_3g-1.7b \
+    --push_to_hub True \
+    --use_vllm True \
+    --vllm_mode colocate \
+    --max_completion_length 1024 \
+    --report_to trackio \
+    --log_completions True \
+    --max_steps 400
+```
+"""
+import os
+import re
+import signal
+import sqlite3
+import textwrap
+from contextlib import contextmanager
+from datasets import load_dataset
+from trl import GRPOConfig, GRPOTrainer, ModelConfig, ScriptArguments, TrlParser
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+def query_reward(completions, answer, **kwargs):
+    """
+    Reward query strategy:
+    - Penalize more than 2 queries
+    - Penalize generic queries (LIMIT 1 / PRAGMA)
+    - Reward usage of WHERE
+    - Reward evidence supporting the final answer
+    """
+    rewards = []
+    for completion, ans in zip(completions, answer, strict=False):
+        reward = 0.0
+        sql_queries = []
+        tool_results = []
+        # collect all SQL queries and tool results
+        for turn in completion:
+            if turn.get("tool_calls"):
+                for call in turn["tool_calls"]:
+                    sql = call["function"]["arguments"].get("sql_command", "").lower()
+                    sql_queries.append(sql)
+            if turn.get("role") == "tool" and turn.get("content"):
+                tool_results.append(turn["content"])
+        # --- penalize too many queries ---
+        if len(sql_queries) > 3:
+            reward -= 1.5
+        # --- check query quality ---
+        where_count = 0
+        for q in sql_queries:
+            if "limit 1" in q:
+                reward -= 1.0
+            if " where " not in q:
+                reward -= 0.5
+            else:
+                where_count += 1
+        reward += min(where_count, 3) * 0.4  # small bonus for WHERE usage
+        # --- evidence check: do queries support the answer? ---
+        combined_results = []
+        error_detected = False
+        for res in tool_results:
+            if isinstance(res, dict) and "error" in res:
+                error_detected = True
+            elif isinstance(res, list):
+                combined_results.extend(res)
+        # if error detected, penalize heavily
+        if error_detected:
+            reward -= 2.0
+        elif len(sql_queries) == 0:
+            reward -= 1.5
+        else:
+            has_hits = len(combined_results) > 0
+            correct_answer = ans.lower()
+            if (has_hits and correct_answer == "yes") or (not has_hits and correct_answer == "no"):
+                reward += 2.0
+            else:
+                reward -= 1.5
+        rewards.append(reward)
+    return rewards
+def correctness_reward(completions, answer, **kwargs):
+    """
+    Reward Yes/No correctness.
+    Model must provide final answer enclosed in stars — *yes* or *no*.
+    Does not reward informal yes/no buried in text.
+    """
+    rewards = []
+    for completion, ans in zip(completions, answer, strict=False):
+        raw = completion[-1]["content"].lower()
+        # detect form *yes* or *no*
+        match = re.search(r"\*(yes|no)\*", raw)
+        guess = match.group(1) if match else None
+        reward = 0.0
+        if guess is None:
+            reward -= 0.5  # invalid format
+        elif guess == ans.lower():
+            reward += 0.6  # correct under required format
+        else:
+            reward -= 1.0  # wrong answer
+        rewards.append(reward)
+    return rewards
+def structure_reward(completions, **kwargs):
+    """
+    Reward proper assistant structure.
+    Encourages a logical sequence: tool call + response + optional extra content.
+    """
+    rewards = []
+    for completion in completions:
+        has_call = False
+        has_response = False
+        has_other = False
+        for turn in completion:
+            role = turn.get("role")
+            if role == "assistant" and turn.get("tool_calls"):
+                has_call = True
+            elif role == "tool":
+                has_response = True
+            else:
+                content = turn.get("content")
+                if content and content.strip() not in ["", "<think>"]:
+                    has_other = True
+        # Reward sequences
+        if has_call and has_response:
+            if has_other:
+                reward = 0.1
+            else:
+                reward = 0.05  # still positive even without extra text
+        elif has_call and not has_response:
+            reward = -0.15
+        else:
+            reward = 0.0  # neutral if no call
+        rewards.append(reward)
+    return rewards
+# ------------------------
+# Database tool function
+# ------------------------
+class TimeoutError(Exception):
+    """Raised when a function call times out."""
+    pass
+@contextmanager
+def timeout(seconds):
+    """Context manager that raises TimeoutError if execution exceeds time limit."""
+    def timeout_handler(signum, frame):
+        raise TimeoutError(f"Operation timed out after {seconds} seconds")
+    signal.signal(signal.SIGALRM, timeout_handler)
+    signal.alarm(seconds)
+    try:
+        yield
+    finally:
+        signal.alarm(0)
+def query_biogrid(sql_command: str) -> list[tuple]:
+    """
+    Execute a read-only SQL command on the BioGRID database.
+    BioGRID is a curated biological database that compiles protein, genetic, and chemical interactions from multiple organisms. It provides researchers with experimentally verified interaction data to support studies in systems biology and functional genomics.
+    Args:
+        sql_command: The SQL command to execute.
+    Returns:
+        A list of tuples containing the query results.
+    """
+    with timeout(5):
+        conn = sqlite3.connect("file:biogrid.db?mode=ro", uri=True)
+        cursor = conn.cursor()
+        try:
+            cursor.execute(sql_command)
+            results = cursor.fetchall()
+        finally:
+            conn.close()
+    return results
+# ------------------------
+# Dataset formatting
+# ------------------------
+def format_example(example):
+    question = example["question"]
+    preamble = textwrap.dedent("""\
+    You have access to the BioGRID SQLite database.
+    Use SQL queries to retrieve only the information needed to answer the question.
+    Genes may appear in the database in columns `Alt_IDs_Interactor_A` `Alt_IDs_Interactor_B`, `Aliases_Interactor_A` and `Aliases_Interactor_B`,
+    and each entry can contain multiple gene names or synonyms separated by '|', for example:
+    'entrez gene/locuslink:JNKK(gene name synonym)|entrez gene/locuslink:MAPKK4(gene name synonym)|...'
+    So a gene like 'JNKK' or 'MAPKK4' may appear inside one of these strings.
+    If the database schema is unclear or you are unsure about column names:
+    - First inspect the schema with `PRAGMA table_info(interactions);`
+    - Or preview a few rows with `SELECT * FROM interactions LIMIT 1;`
+    Otherwise, directly query the required data.
+    Final answer must be enclosed in stars, e.g. *Yes* or *No*.
+    Facts:
+    - The NCBI Taxonomy identifier for humans is taxid:9606.
+    """)
+    content = f"{preamble}\nQuestion: {question}"
+    prompt = [{"role": "user", "content": content}]
+    return {"prompt": prompt}
+# ------------------------
+# Main
+# ------------------------
+if __name__ == "__main__":
+    parser = TrlParser((ScriptArguments, GRPOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config()
+    # ------------------------
+    # Create DB
+    # ------------------------
+    print("Creating biogrid.db...")
+    # Load dataset
+    biogrid_dataset = load_dataset("qgallouedec/biogrid", split="train")
+    df = biogrid_dataset.to_pandas()
+    # Normalize column names: remove spaces, replace with underscores
+    df.columns = [c.replace(" ", "_") for c in df.columns]
+    conn = sqlite3.connect("biogrid.db")
+    try:
+        df.to_sql("interactions", conn, if_exists="replace", index=False)
+        print(f"biogrid.db created. Rows stored: {len(df)}")
+    finally:
+        conn.close()
+    # ------------------------
+    # Load and format dataset
+    # ------------------------
+    dataset = load_dataset("qgallouedec/biogrid_qa", split="train")
+    dataset = dataset.filter(
+        lambda example: example["question"].startswith("Does the gene ")
+    )  # keep only simple questions for example
+    dataset = dataset.map(format_example, remove_columns=["question"])
+    train_dataset = dataset
+    eval_dataset = None  # No eval by default, can be added if needed
+    training_args.chat_template_kwargs = {"enable_thinking": False}
+    # ------------------------
+    # Initialize trainer
+    # ------------------------
+    trainer = GRPOTrainer(
+        model=model_args.model_name_or_path,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        tools=[query_biogrid],
+        reward_funcs=[correctness_reward, structure_reward, query_reward],
+        args=training_args,
+    )
+    # ------------------------
+    # Train
+    # ------------------------
+    trainer.train()
+    # ------------------------
+    # Save and push
+    # ------------------------
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/grpo_vlm.py ADDED Viewed

	@@ -0,0 +1,164 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "Pillow",
+#     "peft",
+#     "math-verify",
+#     "latex2sympy2_extended",
+#     "torchvision",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+pip install math_verify
+# For Qwen/Qwen2.5-VL-3B-Instruct
+accelerate launch \
+    --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
+    examples/scripts/grpo_vlm.py \
+    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
+    --output_dir grpo-Qwen2.5-VL-3B-Instruct \
+    --learning_rate 1e-5 \
+    --dtype bfloat16 \
+    --max_completion_length 1024 \
+    --use_vllm \
+    --vllm_mode colocate \
+    --use_peft \
+    --lora_target_modules "q_proj", "v_proj" \
+    --log_completions
+# For HuggingFaceTB/SmolVLM2-2.2B-Instruct
+pip install num2words==0.5.14
+accelerate launch \
+    --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
+    examples/scripts/grpo_vlm.py \
+    --model_name_or_path HuggingFaceTB/SmolVLM2-2.2B-Instruct \
+    --output_dir grpo-SmolVLM2-2.2B-Instruct \
+    --learning_rate 1e-5 \
+    --dtype bfloat16 \
+    --max_completion_length 1024 \
+    --use_peft \
+    --lora_target_modules "q_proj", "v_proj" \
+    --log_completions \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --num_generations 2
+"""
+import os
+import torch
+from datasets import load_dataset
+from trl import (
+    GRPOConfig,
+    GRPOTrainer,
+    ModelConfig,
+    ScriptArguments,
+    TrlParser,
+    get_kbit_device_map,
+    get_peft_config,
+    get_quantization_config,
+)
+from trl.rewards import accuracy_reward, think_format_reward
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+if __name__ == "__main__":
+    parser = TrlParser((ScriptArguments, GRPOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config()
+    ################
+    # Model
+    ################
+    dtype = model_args.dtype if model_args.dtype in ["auto", None] else getattr(torch, model_args.dtype)
+    training_args.model_init_kwargs = dict(
+        revision=model_args.model_revision,
+        attn_implementation=model_args.attn_implementation,
+        dtype=dtype,
+    )
+    quantization_config = get_quantization_config(model_args)
+    if quantization_config is not None:
+        # Passing None would not be treated the same as omitting the argument, so we include it only when valid.
+        training_args.model_init_kwargs["device_map"] = get_kbit_device_map()
+        training_args.model_init_kwargs["quantization_config"] = quantization_config
+    ################
+    # Dataset
+    ################
+    dataset = load_dataset("lmms-lab/multimodal-open-r1-8k-verified", split="train")
+    dataset = dataset.train_test_split(test_size=100, seed=42)
+    SYSTEM_PROMPT = (
+        "A conversation between user and assistant. The user asks a question, and the assistant solves it. The "
+        "assistant first thinks about the reasoning process in the mind and then provides the user with the answer. "
+        "The reasoning process and answer are enclosed within <think></think> tags, i.e., <think>\nThis is my "
+        "reasoning.\n</think>\nThis is my answer."
+    )
+    def make_conversation(example):
+        prompt = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": example["problem"]},
+        ]
+        return {"prompt": prompt}
+    dataset = dataset.map(make_conversation)
+    # Filter have big images
+    def filter_big_images(example):
+        image = example["image"]
+        return image.size[0] < 512 and image.size[1] < 512
+    dataset = dataset.filter(filter_big_images)
+    def convert_to_rgb(example):
+        image = example["image"]
+        if image.mode != "RGB":
+            image = image.convert("RGB")
+        example["image"] = image
+        return example
+    dataset = dataset.map(convert_to_rgb)
+    train_dataset = dataset["train"]
+    eval_dataset = dataset["test"] if training_args.eval_strategy != "no" else None
+    ################
+    # Training
+    ################
+    trainer = GRPOTrainer(
+        model=model_args.model_name_or_path,
+        args=training_args,
+        reward_funcs=[think_format_reward, accuracy_reward],
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        peft_config=get_peft_config(model_args),
+    )
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/gspo.py ADDED Viewed

	@@ -0,0 +1,137 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "peft",
+#     "math-verify",
+#     "latex2sympy2_extended",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+pip install math_verify
+# For Qwen/Qwen3-0.6B
+pip install num2words==0.5.14
+accelerate launch \
+    --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
+    examples/scripts/gspo.py \
+    --model_name_or_path Qwen/Qwen3-0.6B \
+    --output_dir gspo-Qwen3-0.6B \
+    --learning_rate 1e-5 \
+    --dtype bfloat16 \
+    --max_completion_length 1024 \
+    --use_peft \
+    --lora_target_modules "q_proj", "v_proj" \
+    --log_completions \
+    --per_device_train_batch_size 8 \
+    --num_generations 8 \
+    --importance_sampling_level sequence \
+    --epsilon 3e-4 \
+    --epsilon_high 4e-4 \
+    --beta 0.0 \
+    --loss_type grpo \
+    --gradient_accumulation_steps 2 \
+    --steps_per_generation 8
+"""
+import os
+import torch
+from datasets import load_dataset
+from trl import (
+    GRPOConfig,
+    GRPOTrainer,
+    ModelConfig,
+    ScriptArguments,
+    TrlParser,
+    get_kbit_device_map,
+    get_peft_config,
+    get_quantization_config,
+)
+from trl.rewards import accuracy_reward, think_format_reward
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+if __name__ == "__main__":
+    parser = TrlParser((ScriptArguments, GRPOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config()
+    ################
+    # Model & Processor
+    ################
+    dtype = model_args.dtype if model_args.dtype in ["auto", None] else getattr(torch, model_args.dtype)
+    training_args.model_init_kwargs = dict(
+        revision=model_args.model_revision,
+        attn_implementation=model_args.attn_implementation,
+        dtype=dtype,
+    )
+    quantization_config = get_quantization_config(model_args)
+    if quantization_config is not None:
+        # Passing None would not be treated the same as omitting the argument, so we include it only when valid.
+        training_args.model_init_kwargs["device_map"] = get_kbit_device_map()
+        training_args.model_init_kwargs["quantization_config"] = quantization_config
+    ################
+    # Dataset
+    ################
+    train_dataset, eval_dataset = load_dataset("AI-MO/NuminaMath-TIR", split=["train[:5%]", "test[:5%]"])
+    SYSTEM_PROMPT = (
+        "A conversation between user and assistant. The user asks a question, and the assistant solves it. The "
+        "assistant first thinks about the reasoning process in the mind and then provides the user with the answer. "
+        "The reasoning process and answer are enclosed within <think></think> tags, i.e., <think>\nThis is my "
+        "reasoning.\n</think>\nThis is my answer."
+    )
+    def make_conversation(example):
+        return {
+            "prompt": [
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": example["problem"]},
+            ],
+        }
+    train_dataset = train_dataset.map(make_conversation)
+    eval_dataset = eval_dataset.map(make_conversation)
+    train_dataset = train_dataset.remove_columns(["messages", "problem"])
+    eval_dataset = eval_dataset.remove_columns(["messages", "problem"])
+    ################
+    # Training
+    ################
+    trainer = GRPOTrainer(
+        model=model_args.model_name_or_path,
+        args=training_args,
+        reward_funcs=[think_format_reward, accuracy_reward],
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        peft_config=get_peft_config(model_args),
+    )
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/gspo_vlm.py ADDED Viewed

	@@ -0,0 +1,153 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "Pillow",
+#     "peft",
+#     "math-verify",
+#     "latex2sympy2_extended",
+#     "torchvision",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+pip install math_verify
+# For Qwen/Qwen2.5-VL-3B-Instruct
+accelerate launch \
+    --config_file examples/accelerate_configs/deepspeed_zero3.yaml \
+    examples/scripts/gspo_vlm.py \
+    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
+    --output_dir gspo-Qwen2.5-VL-3B-Instruct \
+    --learning_rate 1e-5 \
+    --dtype bfloat16 \
+    --max_completion_length 1024 \
+    --use_peft \
+    --lora_target_modules "q_proj", "v_proj" \
+    --log_completions \
+    --per_device_train_batch_size 8 \
+    --num_generations 8 \
+    --importance_sampling_level sequence \
+    --epsilon 3e-4 \
+    --epsilon_high 4e-4 \
+    --beta 0.0 \
+    --loss_type grpo \
+    --gradient_accumulation_steps 2 \
+    --steps_per_generation 8
+"""
+import os
+import torch
+from datasets import load_dataset
+from trl import (
+    GRPOConfig,
+    GRPOTrainer,
+    ModelConfig,
+    ScriptArguments,
+    TrlParser,
+    get_kbit_device_map,
+    get_peft_config,
+    get_quantization_config,
+)
+from trl.rewards import accuracy_reward, think_format_reward
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+if __name__ == "__main__":
+    parser = TrlParser((ScriptArguments, GRPOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config()
+    ################
+    # Model
+    ################
+    dtype = model_args.dtype if model_args.dtype in ["auto", None] else getattr(torch, model_args.dtype)
+    training_args.model_init_kwargs = dict(
+        revision=model_args.model_revision,
+        attn_implementation=model_args.attn_implementation,
+        dtype=dtype,
+    )
+    quantization_config = get_quantization_config(model_args)
+    if quantization_config is not None:
+        # Passing None would not be treated the same as omitting the argument, so we include it only when valid.
+        training_args.model_init_kwargs["device_map"] = get_kbit_device_map()
+        training_args.model_init_kwargs["quantization_config"] = quantization_config
+    ################
+    # Dataset
+    ################
+    dataset = load_dataset("lmms-lab/multimodal-open-r1-8k-verified", split="train")
+    dataset = dataset.train_test_split(test_size=100, seed=42)
+    SYSTEM_PROMPT = (
+        "A conversation between user and assistant. The user asks a question, and the assistant solves it. The "
+        "assistant first thinks about the reasoning process in the mind and then provides the user with the answer. "
+        "The reasoning process and answer are enclosed within <think></think> tags, i.e., <think>\nThis is my "
+        "reasoning.\n</think>\nThis is my answer."
+    )
+    def make_conversation(example):
+        prompt = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": example["problem"]},
+        ]
+        return {"prompt": prompt}
+    dataset = dataset.map(make_conversation)
+    # Filter have big images
+    def filter_big_images(example):
+        image = example["image"]
+        return image.size[0] < 512 and image.size[1] < 512
+    dataset = dataset.filter(filter_big_images)
+    def convert_to_rgb(example):
+        image = example["image"]
+        if image.mode != "RGB":
+            image = image.convert("RGB")
+        example["image"] = image
+        return example
+    dataset = dataset.map(convert_to_rgb)
+    train_dataset = dataset["train"]
+    eval_dataset = dataset["test"] if training_args.eval_strategy != "no" else None
+    ################
+    # Training
+    ################
+    trainer = GRPOTrainer(
+        model=model_args.model_name_or_path,
+        args=training_args,
+        reward_funcs=[think_format_reward, accuracy_reward],
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        peft_config=get_peft_config(model_args),
+    )
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/kto.py ADDED Viewed

	@@ -0,0 +1,112 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "peft",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+Run the KTO training script with the commands below. In general, the optimal configuration for KTO will be similar to that of DPO.
+# Full training:
+python trl/scripts/kto.py \
+    --dataset_name trl-lib/kto-mix-14k \
+    --model_name_or_path trl-lib/qwen1.5-1.8b-sft \
+    --per_device_train_batch_size 16 \
+    --num_train_epochs 1 \
+    --learning_rate 5e-7 \
+    --lr_scheduler_type cosine \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 500 \
+    --output_dir kto-aligned-model \
+    --warmup_steps 0.1 \
+    --logging_first_step
+# QLoRA:
+python trl/scripts/kto.py \
+    --dataset_name trl-lib/kto-mix-14k \
+    --model_name_or_path trl-lib/qwen1.5-1.8b-sft \
+    --per_device_train_batch_size 8 \
+    --num_train_epochs 1 \
+    --learning_rate 5e-7 \
+    --lr_scheduler_type cosine \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 500 \
+    --output_dir kto-aligned-model-lora \
+    --warmup_steps 0.1 \
+    --logging_first_step \
+    --use_peft \
+    --load_in_4bit \
+    --lora_target_modules all-linear \
+    --lora_r 16 \
+    --lora_alpha 16
+"""
+import os
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser
+from trl import ModelConfig, ScriptArguments, get_peft_config
+from trl.experimental.kto import KTOConfig, KTOTrainer
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+if __name__ == "__main__":
+    parser = HfArgumentParser((ScriptArguments, KTOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_into_dataclasses()
+    # Load a pretrained model
+    model = AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
+    )
+    ref_model = AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code
+    )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Load the dataset
+    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)
+    # Initialize the KTO trainer
+    trainer = KTOTrainer(
+        model,
+        ref_model,
+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        processing_class=tokenizer,
+        peft_config=get_peft_config(model_args),
+    )
+    # Train and push the model to the Hub
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/mpo_vlm.py ADDED Viewed

	@@ -0,0 +1,142 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "Pillow",
+#     "peft",
+#     "torchvision",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+python examples/scripts/mpo_vlm.py \
+    --dataset_name HuggingFaceH4/rlaif-v_formatted \
+    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 4 \
+    --num_train_epochs 1 \
+    --gradient_accumulation_steps 8 \
+    --dataset_num_proc 1 \
+    --output_dir dpo_idefics_rlaif-v \
+    --dtype bfloat16 \
+    --use_peft \
+    --lora_target_modules down_proj, o_proj, k_proj, q_proj, gate_proj, up_proj, v_proj \
+    --loss_type sigmoid bco_pair sft \
+    --loss_weights 0.8 0.2 1.0
+"""
+import os
+import torch
+from datasets import load_dataset
+from PIL import Image
+from transformers import AutoModelForImageTextToText
+from trl import (
+    DPOConfig,
+    DPOTrainer,
+    ModelConfig,
+    ScriptArguments,
+    TrlParser,
+    get_kbit_device_map,
+    get_peft_config,
+    get_quantization_config,
+)
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+if __name__ == "__main__":
+    parser = TrlParser((ScriptArguments, DPOConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config()
+    ################
+    # Model & Processor
+    ################
+    dtype = model_args.dtype if model_args.dtype in ["auto", None] else getattr(torch, model_args.dtype)
+    model_kwargs = dict(
+        trust_remote_code=model_args.trust_remote_code,
+        revision=model_args.model_revision,
+        attn_implementation=model_args.attn_implementation,
+        dtype=dtype,
+    )
+    quantization_config = get_quantization_config(model_args)
+    if quantization_config is not None:
+        # Passing None would not be treated the same as omitting the argument, so we include it only when valid.
+        model_kwargs["device_map"] = get_kbit_device_map()
+        model_kwargs["quantization_config"] = quantization_config
+    model = AutoModelForImageTextToText.from_pretrained(
+        model_args.model_name_or_path,
+        **model_kwargs,
+    )
+    peft_config = get_peft_config(model_args)
+    if peft_config is None:
+        ref_model = AutoModelForImageTextToText.from_pretrained(
+            model_args.model_name_or_path,
+            **model_kwargs,
+        )
+    else:
+        ref_model = None
+    ################
+    # Dataset
+    ################
+    dataset = load_dataset(
+        script_args.dataset_name,
+        name=script_args.dataset_config,
+        streaming=script_args.dataset_streaming,
+    )
+    train_dataset = dataset[script_args.dataset_train_split]
+    test_dataset = dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None
+    def ensure_rgb(example):
+        # Convert the image to RGB if it's not already
+        image = example["images"][0]
+        if isinstance(image, Image.Image):
+            if image.mode != "RGB":
+                image = image.convert("RGB")
+            example["images"] = [image]
+        return example
+    # Apply the transformation to the dataset (change num_proc depending on the available compute)
+    train_dataset = train_dataset.map(ensure_rgb, num_proc=training_args.dataset_num_proc)
+    if test_dataset is not None:
+        test_dataset = test_dataset.map(ensure_rgb, num_proc=training_args.dataset_num_proc)
+    ################
+    # Training
+    ################
+    trainer = DPOTrainer(
+        model=model,
+        ref_model=ref_model,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=test_dataset,
+        peft_config=peft_config,
+    )
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/nash_md.py ADDED Viewed

	@@ -0,0 +1,153 @@

+# Copyright 2020-2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# /// script
+# dependencies = [
+#     "trl",
+#     "trackio",
+#     "kernels",
+# ]
+# ///
+"""
+Usage:
+python examples/scripts/nash_md.py \
+    --model_name_or_path trl-lib/pythia-1b-deduped-tldr-sft  \
+    --reward_model_path trl-lib/pythia-1b-deduped-tldr-rm \
+    --dataset_name trl-lib/tldr \
+    --learning_rate 5.0e-7 \
+    --output_dir pythia-1b-tldr-nash-md \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 32 \
+    --num_train_epochs 3 \
+    --max_new_tokens 64 \
+    --warmup_steps 0.1 \
+    --missing_eos_penalty 1.0 \
+    --push_to_hub
+accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
+    examples/scripts/nash_md.py \
+    --model_name_or_path trl-lib/pythia-1b-deduped-tldr-sft  \
+    --reward_model_path trl-lib/pythia-1b-deduped-tldr-rm \
+    --dataset_name trl-lib/tldr \
+    --learning_rate 5.0e-7 \
+    --output_dir pythia-1b-tldr-nash-md \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 32 \
+    --num_train_epochs 3 \
+    --max_new_tokens 64 \
+    --warmup_steps 0.1 \
+    --missing_eos_penalty 1.0 \
+    --push_to_hub
+"""
+import os
+import torch
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer, GenerationConfig
+from trl import (
+    LogCompletionsCallback,
+    ModelConfig,
+    ScriptArguments,
+    TrlParser,
+    get_kbit_device_map,
+    get_quantization_config,
+)
+from trl.experimental.judges import HfPairwiseJudge, OpenAIPairwiseJudge, PairRMJudge
+from trl.experimental.nash_md import NashMDConfig, NashMDTrainer
+# Enable logging in a Hugging Face Space
+os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")
+JUDGES = {"pair_rm": PairRMJudge, "openai": OpenAIPairwiseJudge, "hf": HfPairwiseJudge}
+if __name__ == "__main__":
+    parser = TrlParser((ScriptArguments, NashMDConfig, ModelConfig))
+    script_args, training_args, model_args = parser.parse_args_and_config()
+    training_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
+    dtype = model_args.dtype if model_args.dtype in ["auto", None] else getattr(torch, model_args.dtype)
+    model_kwargs = dict(
+        revision=model_args.model_revision,
+        attn_implementation=model_args.attn_implementation,
+        dtype=dtype,
+        use_cache=False if training_args.gradient_checkpointing else True,
+    )
+    quantization_config = get_quantization_config(model_args)
+    if quantization_config is not None:
+        # Passing None would not be treated the same as omitting the argument, so we include it only when valid.
+        model_kwargs["device_map"] = get_kbit_device_map()
+        model_kwargs["quantization_config"] = quantization_config
+    model = AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, **model_kwargs
+    )
+    ref_model = AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, **model_kwargs
+    )
+    if training_args.reward_model_path is not None:
+        reward_model = AutoModelForSequenceClassification.from_pretrained(
+            training_args.reward_model_path,
+            num_labels=1,
+            trust_remote_code=model_args.trust_remote_code,
+            **model_kwargs,
+        )
+    else:
+        reward_model = None
+    if training_args.judge is not None:
+        judge_cls = JUDGES[training_args.judge]
+        judge = judge_cls()
+    else:
+        judge = None
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path, padding_side="left", trust_remote_code=model_args.trust_remote_code
+    )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)
+    trainer = NashMDTrainer(
+        model=model,
+        ref_model=ref_model,
+        reward_funcs=reward_model,
+        judge=judge,
+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        processing_class=tokenizer,
+    )
+    if training_args.eval_strategy != "no":
+        generation_config = GenerationConfig(
+            max_new_tokens=training_args.max_new_tokens, do_sample=True, temperature=training_args.temperature
+        )
+        completions_callback = LogCompletionsCallback(trainer, generation_config, num_prompts=8)
+        trainer.add_callback(completions_callback)
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)

ICL/RL/trl_source/examples/scripts/nemo_gym/README.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Post-training with NeMo Gym and TRL
+This integration supports training language models in NeMo-Gym environments using TRL GRPO. Both single step and multi step tasks are supported, including multi-environment training. NeMo-Gym orchestrates rollouts, returning token ids and logprobs to TRL through the rollout function for training. Currently this integration is only supported through TRL's vllm server mode.
+Check out the docs page `docs/source/nemo_gym.md` for a guide.