YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
PAWN: Playstyle-Agnostic World-model Network for Chess
A small causal transformer trained on random chess games that learns legal moves, board state representations, and game dynamics purely from random legal move sequences absent any form of strategic play.
I've found PAWN to be a viable testbed for finetuning and augmentation methods at small scale. Since it is entirely unopinionated, it's a blank slate ready to be adapted, augmented, and finetuned into arbitrary player models with unique playstyles.
Finetuning PAWN has proven significantly more parameter-efficient than training new models from scratch and requires minimal compute resources.
Feel free to use PAWN in your own experiments. Note that PAWN was developed as a personal project by a single developer and has not been published or audited. If you spot a bug, please help out by creating an issue or PR.
PAWN is under active development and is not yet stable.
Model Variants
To aid in exploring how model size affects different finetuning methods, I trained three versions of PAWN:
| Variant | d_model | Layers | Heads | Params | Download |
|---|---|---|---|---|---|
| PAWN | 512 | 8 | 8 | ~35.8M | |
| PAWN-Small | 256 | 8 | 4 | ~9.5M | |
| PAWN-Large | 640 | 10 | 8 | ~68.4M |
All variants share the same architecture: RMSNorm, SwiGLU FFN, RoPE, factored move embeddings, and a 4278-token vocabulary covering:
- all possible (src, dst) pairs for an 8x8 grid (the chess board),
- promotion moves: 4 piece types (queen, bishop, rook, knight) Γ 44 eligible (source square, destination square) pairs for pawns reaching the 1st & 8th ranks,
- a token for each game outcome (
WHITE_CHECKMATES,BLACK_CHECKMATES,STALEMATE,DRAW_BY_RULE,PLY_LIMIT), - and a padding token.
Notably, the vocabulary includes impossible moves like a1a1 and b1a5. PAWN naturally learns to avoid these since they don't appear in its training examples.
Conceptually, each token is best thought of as a move in UCI notation--they are effectively coordinates. They do not include any information on the type of piece, side to play, or any direct geometric or board state information other than the factored nature of the embeddings (see the architecture section below for details).
For example, e2e4 is the token that represents the king's pawn opening, but only when it's the first ply in the sequence (moving a rook from e2 to e4 in the late game would use the same token). The model learns to track which type of piece is on each square at any given moment entirely of its own accord.
For that matter, it isn't told what piece types exist, what movement patterns they follow, or indeed the concept of a piece. All of that 'understanding' comes purely from observation and can be isolated via linear probes (Alain & Bengio, 2016).
Quickstart
# Clone and build
git clone https://github.com/thomas-schweich/PAWN.git && cd PAWN
# Build the Rust chess engine
cd engine && uv run --with maturin maturin develop --release && cd ..
# Install core dependencies
uv sync --extra cu128 # NVIDIA GPU (or --extra rocm for AMD)
# Install dependencies for running tests, performing analysis on the results, and running the training monitoring dashboard (optional but recommended)
uv sync --extra dev --extra eval --extra dashboard
# Train an adapter on a pre-trained checkpoint
git submodule update --init checkpoints/pawn-base
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base \
--pgn data/lichess_1800_1900.pgn \
--bottleneck-dim 32 --lr 1e-4 --local-checkpoints
# Or pretrain a PAWN variant from scratch (generates random games on-the-fly; no dataset required)
uv run python scripts/train.py --variant base --local-checkpoints
# Or train all three variants simultaneously on shared data
uv run python scripts/train_all.py --local-checkpoints
# Launch the real-time monitoring dashboard (optional dashboard dependency must be installed)
uv run python -m pawn.dashboard --log-dir logs --port 8765
Architecture
More info: docs/ARCHITECTURE.md
PAWN is a standard decoder-only transformer trained with next-token prediction on chess move sequences. Each training example is:
[outcome] [ply_1] [ply_2] ... [ply_N] [PAD] ... [PAD]
The outcome token is one of WHITE_CHECKMATES, BLACK_CHECKMATES, STALEMATE, DRAW_BY_RULE, or PLY_LIMIT.
Ply tokens use a factored embedding: each move is decomposed into source square + destination square + promotion piece, with embeddings summed. This gives the model some degree of explicit spatial structure while keeping the vocabulary compact.
The summed embeddings effectively represent UCI strings like e2e4 (a piece moves from e2 to e4) or f7f8q (promotion to queen on f8). In factored form, the vector e2e4 is given by (e2xx + xxe4 + no_promotion). Likewise, f7f8q is given by (f7xx + xxf8 + xxxxq).
The context window of all variants is 256 tokens wide. Training examples all include the outcome token followed by up to 255 ply or padding tokens.
During training, simulated games are retroactively prepended with their actual outcome. During inference, the outcome token has a measurable impact on subsequent completions.
The model's predictions are not masked to legal moves during training; it has to determine what moves are currently legal based on the sequence of moves so far.
No attempt is made to provide the model with information about other pieces. In other words, it only thinks in moves. There is no equivalent of the multi-plane 8Γ8ΓN board representation used by e.g. AlphaZero (Silver et al., 2018) and Lc0. Any and all state representation and geometry is learned by the model internally.
Adapter Methods
More info: docs/ADAPTERS.md
PAWN ships with five adapter implementations for fine-tuning the frozen backbone:
| Method | Params (typical) | Accuracy (1800 Elo) | Description |
|---|---|---|---|
| Bottleneck | 131K | 41.7% | Houlsby-style residual MLP adapters |
| Sparse | 503K-2.7M | 40.2-44.7% | Random binary mask on frozen weights |
| LoRA | ~65K | 34.1% | Low-rank attention projection adapters |
| Hybrid | ~65K | 34.1% | LoRA + FiLM combined |
| FiLM | ~17K | 30.3% | Per-channel affine modulation |
Preliminary results show that a 524K bottleneck adapter on PAWN achieves 42.2% accuracy when predicting moves by 1800-level players on Lichess vs. 30.9% for a standalone model with the same architecture and parameter count. Thus the frozen backbone provides ~11 percentage point "free" accuracy lift.
See docs/ADAPTERS.md for more info.
Repository Structure
pawn/
βββ pawn/ # Core Python package
β βββ config.py # Model configs (small/base/large)
β βββ model.py # PAWN transformer
β βββ data.py # Random game data pipeline
β βββ lichess_data.py # Lichess PGN data pipeline
β βββ trainer.py # Pretraining loop
β βββ gpu.py # GPU auto-detection
β βββ adapters/ # Bottleneck, LoRA, FiLM, sparse, hybrid
β βββ eval_suite/ # Probes, generation tests, diagnostics
βββ engine/ # Rust chess engine (PyO3 bindings)
βββ scripts/ # Training and evaluation scripts
βββ deploy/ # Runpod deployment scripts
βββ tests/ # Unit tests
βββ docs/ # Architecture, training, adapter docs
Chess Engine
PAWN includes a bundled Rust chess engine (engine/) that handles all game simulation, move generation, legal move computation, and PGN parsing. The engine extensively uses shakmaty under the hood, with PyO3 bindings to Python. No Python chess libraries are used. The engine generates training data on-the-fly via chess_engine.generate_random_games(), which is capable of producing well over 100 million random games per hour on my CPU (AMD Ryzen 7800X3D).
More info
- Architecture -- model design, embeddings, training objective
- Training -- pretraining, adapter training, deployment
- Adapters -- adapter methods, results, quick start
Acknowledgments
PAWN builds on ideas and tools from the following projects and publications:
License
Apache 2.0. See LICENSE.