| | --- |
| | tags: |
| | - reinforcement-learning |
| | - game-theory |
| | - codenames |
| | - neurips-2025 |
| | - graph-neural-networks |
| | - preference-learning |
| | - llm-distillation |
| | license: mit |
| | --- |
| | |
| | # Codenames: Graph-Based RL with LLM-Guided Preference Distillation |
| |
|
| |  |
| |  |
| |  |
| |
|
| | This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**. |
| | The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness. |
| |
|
| | --- |
| |
|
| | ## Overview |
| |
|
| | The approach integrates: |
| |
|
| | - **Graph Neural Networks** for structured board and history representation |
| | - **Proximal Policy Optimization (PPO)** for policy learning |
| | - **Role-conditioned decoding** for spymaster and operative behaviors |
| | - **Rollout-grounded preference learning** using large language models |
| | - **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment |
| | - **Knowledge distillation** from the aligned teacher back into a compact policy |
| |
|
| | The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play. |
| |
|
| | --- |
| |
|
| | ## Game Configuration |
| |
|
| | - **Game**: Codenames |
| | - **Board size**: 25 words |
| | - **Roles**: Spymaster and Operative |
| | - **Evaluation games**: 600 full episodes |
| | - **Opponents**: Scripted baseline agents |
| |
|
| | --- |
| |
|
| | ## Policy Architecture |
| |
|
| | ### Graph-Based State Encoder |
| | - Heterogeneous graph with **30–40 nodes** |
| | - Node types include: |
| | - Word nodes with semantic and state features |
| | - Historical clue nodes |
| | - Global summary node |
| | - Node feature dimension: **35** |
| | - Encoder: |
| | - 3 Graph Attention layers |
| | - 6 attention heads |
| | - Hidden size 192 |
| |
|
| | ### Role Conditioning |
| | - Shared policy trunk |
| | - Role-conditioned action decoding: |
| | - Clue generation and constraint handling for spymaster |
| | - Guess selection and stopping decisions for operative |
| |
|
| | ### Model Size |
| | - Total parameters: **~6.8M** |
| | - Enables fast inference under competitive constraints |
| |
|
| | --- |
| |
|
| | ## Training Pipeline |
| |
|
| | Training follows a multi-stage curriculum: |
| |
|
| | 1. **Graph PPO Pretraining** |
| | - PPO with clip ratio 0.2 |
| | - Discount factor γ = 0.99 |
| | - GAE λ = 0.95 |
| | - Trained against scripted Codenames agents |
| |
|
| | 2. **Preference Generation via Rollouts** |
| | - ~800 intermediate states sampled |
| | - Candidate actions proposed by: |
| | - Llama 3.1 Instruct |
| | - Qwen 2.5 Instruct |
| | - Each proposal evaluated using multiple stochastic rollouts |
| | - Higher-return actions labeled preferred |
| |
|
| | 3. **Teacher Alignment** |
| | - Supervised Fine Tuning on chosen actions |
| | - Direct Preference Optimization using frozen reference model |
| |
|
| | 4. **Policy Distillation** |
| | - Aligned teacher generates state-and-role to action labels |
| | - Graph policy trained via cross-entropy imitation |
| |
|
| | 5. **PPO Refinement** |
| | - PPO resumes using environment rewards |
| | - Stabilizes policy after distillation |
| |
|
| | --- |
| |
|
| | ## Evaluation Results |
| |
|
| | Evaluation uses **600 full games** against scripted opponents. |
| |
|
| | | Agent | Win Rate | Assassin Rate | |
| | |------|---------|---------------| |
| | | Graph PPO | 44.8% | 12.6% | |
| | | PPO + Distillation | 52.9% | 6.9% | |
| |
|
| | - Distillation yields an **8.1 point** absolute win-rate improvement |
| | - Assassin-triggered losses are reduced by **45%** |
| | - Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness |
| |
|
| | --- |
| |
|
| | ## Repository Contents |
| |
|
| | ### Policy Checkpoints |
| | - `policy_models/policy_after_ppo.pt` |
| | - `policy_models/policy_after_distill.pt` |
| |
|
| | ### Teacher Models |
| | - `sft_model/` – supervised fine-tuned teacher |
| | - `dpo_model/` – preference-aligned teacher |
| |
|
| | ### Configuration and Logs |
| | - `master_config.json` |
| | - `evaluation_results.json` |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ### Load Policy |
| |
|
| | ```python |
| | import torch |
| | from policy import GraphPolicy |
| | |
| | policy = GraphPolicy(...) |
| | policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt")) |
| | policy.eval() |
| | ``` |
| |
|
| | ### Loading Fine-tuned LLM |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | # Load SFT or DPO model |
| | tokenizer = AutoTokenizer.from_pretrained("./sft_model") |
| | model = AutoModelForCausalLM.from_pretrained("./sft_model") |
| | |
| | # Use for inference |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=32) |
| | ``` |
| |
|
| | ## 🎓 Research Context |
| |
|
| | This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on: |
| |
|
| | - Language models provide useful strategic priors when grounded by rollouts |
| | - Graph-based representations enable structured reasoning in semantic games |
| | - Distillation transfers high-level reasoning into efficient, deployable agents |
| |
|
| | ### Key Innovations |
| |
|
| | 1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states |
| | 2. **Ground-truth Counterfactual Learning**: Exploiting game determinism |
| | 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings |
| | 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies |
| |
|
| |
|
| | ## 📄 License |
| |
|
| | MIT License - See LICENSE file for details |
| |
|
| | ## 🙏 Acknowledgments |
| |
|
| | - Built for **NeurIPS 2025 MindGames Workshop** |
| | - Uses PyTorch, HuggingFace Transformers, and PEFT |
| | - Training infrastructure: NVIDIA H200 GPU |
| |
|
| | --- |
| |
|
| | **Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} |
| | **Uploaded from**: Notebook Environment |
| |
|