Update README.md

5c80055 verified 2 months ago

5.6 kB

	---
	tags:
	- reinforcement-learning
	- game-theory
	- codenames
	- neurips-2025
	- graph-neural-networks
	- preference-learning
	- llm-distillation
	license: mit
	---

	# Codenames: Graph-Based RL with LLM-Guided Preference Distillation

	![Status](https://img.shields.io/badge/status-trained-success)
	![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
	![License](https://img.shields.io/badge/license-MIT-blue)

	This repository contains trained Codenames agents developed for the NeurIPS 2025 MindGames Workshop.
	The system combines a structured graph-based reinforcement learning policy with LLM-guided preference learning and distillation, targeting improved risk calibration and decision robustness.

	---

	## Overview

	The approach integrates:

	- Graph Neural Networks for structured board and history representation
	- Proximal Policy Optimization (PPO) for policy learning
	- Role-conditioned decoding for spymaster and operative behaviors
	- Rollout-grounded preference learning using large language models
	- Supervised fine tuning (SFT) and Direct Preference Optimization (DPO) for teacher alignment
	- Knowledge distillation from the aligned teacher back into a compact policy

	The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play.

	---

	## Game Configuration

	- Game: Codenames
	- Board size: 25 words
	- Roles: Spymaster and Operative
	- Evaluation games: 600 full episodes
	- Opponents: Scripted baseline agents

	---

	## Policy Architecture

	### Graph-Based State Encoder
	- Heterogeneous graph with 30–40 nodes
	- Node types include:
	- Word nodes with semantic and state features
	- Historical clue nodes
	- Global summary node
	- Node feature dimension: 35
	- Encoder:
	- 3 Graph Attention layers
	- 6 attention heads
	- Hidden size 192

	### Role Conditioning
	- Shared policy trunk
	- Role-conditioned action decoding:
	- Clue generation and constraint handling for spymaster
	- Guess selection and stopping decisions for operative

	### Model Size
	- Total parameters: ~6.8M
	- Enables fast inference under competitive constraints

	---

	## Training Pipeline

	Training follows a multi-stage curriculum:

	1. Graph PPO Pretraining
	- PPO with clip ratio 0.2
	- Discount factor γ = 0.99
	- GAE λ = 0.95
	- Trained against scripted Codenames agents

	2. Preference Generation via Rollouts
	- ~800 intermediate states sampled
	- Candidate actions proposed by:
	- Llama 3.1 Instruct
	- Qwen 2.5 Instruct
	- Each proposal evaluated using multiple stochastic rollouts
	- Higher-return actions labeled preferred

	3. Teacher Alignment
	- Supervised Fine Tuning on chosen actions
	- Direct Preference Optimization using frozen reference model

	4. Policy Distillation
	- Aligned teacher generates state-and-role to action labels
	- Graph policy trained via cross-entropy imitation

	5. PPO Refinement
	- PPO resumes using environment rewards
	- Stabilizes policy after distillation

	---

	## Evaluation Results

	Evaluation uses 600 full games against scripted opponents.

	\| Agent \| Win Rate \| Assassin Rate \|
	\|------\|---------\|---------------\|
	\| Graph PPO \| 44.8% \| 12.6% \|
	\| PPO + Distillation \| 52.9% \| 6.9% \|

	- Distillation yields an 8.1 point absolute win-rate improvement
	- Assassin-triggered losses are reduced by 45%
	- Improvements arise primarily from better risk calibration, not increased guessing aggressiveness

	---

	## Repository Contents

	### Policy Checkpoints
	- `policy_models/policy_after_ppo.pt`
	- `policy_models/policy_after_distill.pt`

	### Teacher Models
	- `sft_model/` – supervised fine-tuned teacher
	- `dpo_model/` – preference-aligned teacher

	### Configuration and Logs
	- `master_config.json`
	- `evaluation_results.json`

	---

	## Usage

	### Load Policy

	```python
	import torch
	from policy import GraphPolicy

	policy = GraphPolicy(...)
	policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt"))
	policy.eval()
	```

	### Loading Fine-tuned LLM

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load SFT or DPO model
	tokenizer = AutoTokenizer.from_pretrained("./sft_model")
	model = AutoModelForCausalLM.from_pretrained("./sft_model")

	# Use for inference
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=32)
	```

	## 🎓 Research Context

	This work targets the NeurIPS 2025 MindGames Workshop with a focus on:

	- Language models provide useful strategic priors when grounded by rollouts
	- Graph-based representations enable structured reasoning in semantic games
	- Distillation transfers high-level reasoning into efficient, deployable agents

	### Key Innovations

	1. Heterogeneous Graph Representation: Novel graph structure for Blotto game states
	2. Ground-truth Counterfactual Learning: Exploiting game determinism
	3. Multi-scale Representation: Field-level, round-level, and game-level embeddings
	4. LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies


	## 📄 License

	MIT License - See LICENSE file for details

	## 🙏 Acknowledgments

	- Built for NeurIPS 2025 MindGames Workshop
	- Uses PyTorch, HuggingFace Transformers, and PEFT
	- Training infrastructure: NVIDIA H200 GPU

	---

	Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
	Uploaded from: Notebook Environment