Neurips / README.md

Upload model card

b2a4f73 verified 2 months ago

5.31 kB

	---
	tags:
	- reinforcement-learning
	- game-theory
	- colonel-blotto
	- neurips-2025
	- graph-neural-networks
	- meta-learning
	license: mit
	---

	# Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025

	![Status](https://img.shields.io/badge/status-trained-success)
	![Framework](https://img.shields.io/badge/framework-PyTorch-orange)
	![License](https://img.shields.io/badge/license-MIT-blue)

	This repository contains trained models for the Colonel Blotto game, targeting the NeurIPS 2025 MindGames workshop. The system combines cutting-edge reinforcement learning with large language model fine-tuning.

	## 🎯 Model Overview

	This is an advanced system that achieves strong performance on Colonel Blotto through:

	- Graph Neural Networks for game state representation
	- FiLM layers for fast opponent adaptation
	- Meta-learning for strategy portfolios
	- LLM fine-tuning (SFT + DPO) for strategic reasoning
	- Distillation from LLMs back to efficient RL policies

	### Game Configuration

	- Fields: 3
	- Units per round: 20
	- Rounds per game: 5
	- Training episodes: N/A

	## 📊 Performance Results

	### Against Scripted Opponents

	Overall Win Rate: 0.00%

	### LLM Elo Ratings

	\| Model \| Elo Rating \|
	\|-------\|------------\|


	## 🏗️ Architecture

	### Policy Network

	The core policy network uses a sophisticated architecture:

	1. Graph Encoder: Multi-layer Graph Attention Networks (GAT)
	- Heterogeneous nodes: field nodes, round nodes, summary node
	- Multi-head attention with 6 heads
	- 3 layers of message passing

	2. Opponent Encoder: MLP-based encoder for opponent modeling
	- Processes opponent history
	- Learns behavioral patterns

	3. FiLM Layers: Feature-wise Linear Modulation
	- Fast adaptation to opponent behavior
	- Conditioned on opponent encoding

	4. Portfolio Head: Multi-strategy selection
	- 6 specialist strategy heads
	- Soft attention-based mixing

	### Training Pipeline

	The models were trained through a comprehensive 7-phase pipeline:

	1. Phase A: Environment setup and action space generation
	2. Phase B: PPO training against diverse scripted opponents
	3. Phase C: Preference dataset generation (LLM vs LLM rollouts)
	4. Phase D: Supervised Fine-Tuning (SFT) of base LLM
	5. Phase E: Direct Preference Optimization (DPO)
	6. Phase F: Knowledge distillation from LLM to policy
	7. Phase G: PPO refinement after distillation

	## 📦 Repository Contents

	### Policy Models

	- `policy_models/policy_final.pt`: PyTorch checkpoint
	- `policy_models/policy_after_distill.pt`: PyTorch checkpoint
	- `policy_models/policy_after_ppo.pt`: PyTorch checkpoint

	### Fine-tuned LLM Models

	- `sft_model/`: SFT model (HuggingFace Transformers compatible)
	- `dpo_model/`: DPO model (HuggingFace Transformers compatible)


	### Configuration & Results

	- `master_config.json`: Complete training configuration
	- `battleground_eval.json`: Comprehensive evaluation results
	- `eval_scripted_after_ppo.json`: Post-PPO evaluation

	## 🚀 Usage

	### Loading Policy Model

	```python
	import torch
	from your_policy_module import PolicyNet

	# Load configuration
	with open("master_config.json", "r") as f:
	config = json.load(f)

	# Initialize policy
	policy = PolicyNet(
	Ff=config["F"],
	n_actions=231, # For F=3, U=20
	hidden=config["hidden"],
	gnn_layers=config["gnn_layers"],
	gnn_heads=config["gnn_heads"],
	n_strat=config["n_strat"]
	)

	# Load trained weights
	policy.load_state_dict(torch.load("policy_models/policy_final.pt"))
	policy.eval()
	```

	### Loading Fine-tuned LLM

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load SFT or DPO model
	tokenizer = AutoTokenizer.from_pretrained("./sft_model")
	model = AutoModelForCausalLM.from_pretrained("./sft_model")

	# Use for inference
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=32)
	```

	## 🎓 Research Context

	This work targets the NeurIPS 2025 MindGames Workshop with a focus on:

	- Strategic game AI beyond traditional game-theoretic approaches
	- Hybrid systems combining neural RL and LLM reasoning
	- Fast adaptation to diverse opponents through meta-learning
	- Efficient deployment via distillation

	### Key Innovations

	1. Heterogeneous Graph Representation: Novel graph structure for Blotto game states
	2. Ground-truth Counterfactual Learning: Exploiting game determinism
	3. Multi-scale Representation: Field-level, round-level, and game-level embeddings
	4. LLM-to-RL Distillation: Transferring strategic reasoning to efficient policies

	## 📝 Citation

	If you use this work, please cite:

	```bibtex
	@misc{colonelblotto2025neurips,
	title={{Advanced Reinforcement Learning System for Colonel Blotto Games}},
	author={{NeurIPS 2025 MindGames Submission}},
	year={2025},
	publisher={HuggingFace Hub},
	howpublished={{\url{{https://huggingface.co/{repo_id}}}}},
	}
	```

	## 📄 License

	MIT License - See LICENSE file for details

	## 🙏 Acknowledgments

	- Built for NeurIPS 2025 MindGames Workshop
	- Uses PyTorch, HuggingFace Transformers, and PEFT
	- Training infrastructure: NVIDIA H200 GPU

	---

	Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
	Uploaded from: Notebook Environment