| | --- |
| | tags: |
| | - reinforcement-learning |
| | - game-theory |
| | - colonel-blotto |
| | - neurips-2025 |
| | - graph-neural-networks |
| | - meta-learning |
| | license: mit |
| | --- |
| | |
| | # Colonel Blotto: Advanced RL + LLM System for NeurIPS 2025 |
| |
|
| |  |
| |  |
| |  |
| |
|
| | This repository contains trained models for the **Colonel Blotto game**, targeting the **NeurIPS 2025 MindGames workshop**. The system combines cutting-edge reinforcement learning with large language model fine-tuning. |
| |
|
| | ## π― Model Overview |
| |
|
| | This is an advanced system that achieves strong performance on Colonel Blotto through: |
| |
|
| | - **Graph Neural Networks** for game state representation |
| | - **FiLM layers** for fast opponent adaptation |
| | - **Meta-learning** for strategy portfolios |
| | - **LLM fine-tuning** (SFT + DPO) for strategic reasoning |
| | - **Distillation** from LLMs back to efficient RL policies |
| |
|
| | ### Game Configuration |
| |
|
| | - **Fields**: 3 |
| | - **Units per round**: 20 |
| | - **Rounds per game**: 5 |
| | - **Training episodes**: N/A |
| |
|
| | ## π Performance Results |
| |
|
| | ### Against Scripted Opponents |
| |
|
| | **Overall Win Rate**: 0.00% |
| |
|
| | ### LLM Elo Ratings |
| |
|
| | | Model | Elo Rating | |
| | |-------|------------| |
| |
|
| |
|
| | ## ποΈ Architecture |
| |
|
| | ### Policy Network |
| |
|
| | The core policy network uses a sophisticated architecture: |
| |
|
| | 1. **Graph Encoder**: Multi-layer Graph Attention Networks (GAT) |
| | - Heterogeneous nodes: field nodes, round nodes, summary node |
| | - Multi-head attention with 6 heads |
| | - 3 layers of message passing |
| |
|
| | 2. **Opponent Encoder**: MLP-based encoder for opponent modeling |
| | - Processes opponent history |
| | - Learns behavioral patterns |
| |
|
| | 3. **FiLM Layers**: Feature-wise Linear Modulation |
| | - Fast adaptation to opponent behavior |
| | - Conditioned on opponent encoding |
| |
|
| | 4. **Portfolio Head**: Multi-strategy selection |
| | - 6 specialist strategy heads |
| | - Soft attention-based mixing |
| |
|
| | ### Training Pipeline |
| |
|
| | The models were trained through a comprehensive 7-phase pipeline: |
| |
|
| | 1. **Phase A**: Environment setup and action space generation |
| | 2. **Phase B**: PPO training against diverse scripted opponents |
| | 3. **Phase C**: Preference dataset generation (LLM vs LLM rollouts) |
| | 4. **Phase D**: Supervised Fine-Tuning (SFT) of base LLM |
| | 5. **Phase E**: Direct Preference Optimization (DPO) |
| | 6. **Phase F**: Knowledge distillation from LLM to policy |
| | 7. **Phase G**: PPO refinement after distillation |
| |
|
| | ## π¦ Repository Contents |
| |
|
| | ### Policy Models |
| |
|
| | - `policy_models/policy_final.pt`: PyTorch checkpoint |
| | - `policy_models/policy_after_distill.pt`: PyTorch checkpoint |
| | - `policy_models/policy_after_ppo.pt`: PyTorch checkpoint |
| |
|
| | ### Fine-tuned LLM Models |
| |
|
| | - `sft_model/`: SFT model (HuggingFace Transformers compatible) |
| | - `dpo_model/`: DPO model (HuggingFace Transformers compatible) |
| |
|
| |
|
| | ### Configuration & Results |
| |
|
| | - `master_config.json`: Complete training configuration |
| | - `battleground_eval.json`: Comprehensive evaluation results |
| | - `eval_scripted_after_ppo.json`: Post-PPO evaluation |
| |
|
| | ## π Usage |
| |
|
| | ### Loading Policy Model |
| |
|
| | ```python |
| | import torch |
| | from your_policy_module import PolicyNet |
| | |
| | # Load configuration |
| | with open("master_config.json", "r") as f: |
| | config = json.load(f) |
| | |
| | # Initialize policy |
| | policy = PolicyNet( |
| | Ff=config["F"], |
| | n_actions=231, # For F=3, U=20 |
| | hidden=config["hidden"], |
| | gnn_layers=config["gnn_layers"], |
| | gnn_heads=config["gnn_heads"], |
| | n_strat=config["n_strat"] |
| | ) |
| | |
| | # Load trained weights |
| | policy.load_state_dict(torch.load("policy_models/policy_final.pt")) |
| | policy.eval() |
| | ``` |
| |
|
| | ### Loading Fine-tuned LLM |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | # Load SFT or DPO model |
| | tokenizer = AutoTokenizer.from_pretrained("./sft_model") |
| | model = AutoModelForCausalLM.from_pretrained("./sft_model") |
| | |
| | # Use for inference |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=32) |
| | ``` |
| |
|
| | ## π Research Context |
| |
|
| | This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on: |
| |
|
| | - **Strategic game AI** beyond traditional game-theoretic approaches |
| | - **Hybrid systems** combining neural RL and LLM reasoning |
| | - **Fast adaptation** to diverse opponents through meta-learning |
| | - **Efficient deployment** via distillation |
| |
|
| | ### Key Innovations |
| |
|
| | 1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states |
| | 2. **Ground-truth Counterfactual Learning**: Exploiting game determinism |
| | 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings |
| | 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies |
| |
|
| | ## π Citation |
| |
|
| | If you use this work, please cite: |
| |
|
| | ```bibtex |
| | @misc{colonelblotto2025neurips, |
| | title={{Advanced Reinforcement Learning System for Colonel Blotto Games}}, |
| | author={{NeurIPS 2025 MindGames Submission}}, |
| | year={2025}, |
| | publisher={HuggingFace Hub}, |
| | howpublished={{\url{{https://huggingface.co/{repo_id}}}}}, |
| | } |
| | ``` |
| |
|
| | ## π License |
| |
|
| | MIT License - See LICENSE file for details |
| |
|
| | ## π Acknowledgments |
| |
|
| | - Built for **NeurIPS 2025 MindGames Workshop** |
| | - Uses PyTorch, HuggingFace Transformers, and PEFT |
| | - Training infrastructure: NVIDIA H200 GPU |
| |
|
| | --- |
| |
|
| | **Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} |
| | **Uploaded from**: Notebook Environment |
| |
|