Pyre PPO Agent β krooz/pyre-ppo-agent
PPO-trained actor-critic agent for the Pyre
fire-evacuation environment (OpenEnv Hackathon, Apr 2026).
β οΈ This is a raw PyTorch checkpoint, not a transformers model.
The Hugging Face hosted Inference API cannot run it directly.
Use the inference code below to load and run it locally.
Training summary (artifact run: pyre_ppo_hard_v2)
Values below are from artifacts/pyre_ppo_hard_v2.csv, pyre_ppo_hard_v2_eval.csv,
pyre_ppo_hard_v2.png (MA-20 curves match save_training_graph_png in train_torch_ppo.py),
and artifacts/pyre_ppo_hard_v2_training.log (HTTP trainer via train_torch_ppo_http.py, env at http://localhost:8000).
| Metric |
Value |
| Total episodes |
600 |
| Wall-clock training time |
~227 s (~2.6 eps/s) |
| Final success rate (MA-20, training graph title) |
55% |
| Final reward mean (MA-20) |
+3.21 |
Final success rate (rolling last 30 ep, CSV s30 / log) |
47% |
| Overall evacuation rate (all 600 ep, CSV) |
52.7% |
| Per-difficulty evacuation (easy / medium / hard) |
67.7% / 59.5% / 10.5% |
| Curriculum |
easy β medium β hard with patience gate (0.70 over 20 ep); hard-phase mix hard:0.4, medium:0.4, easy:0.2 |
| Eval cadence |
Every 25 episodes, 5 deterministic rollouts |
| Eval difficulty |
hard (pyre_ppo_hard_v2_eval.csv) |
Training command (this run)
uv run python training/ppo/train_torch_ppo_http.py \
--episodes 600 \
--difficulty-schedule easy,medium,hard \
--patience-threshold 0.70 \
--patience-window 20 \
--hard-mix-dist hard:0.4,medium:0.4,easy:0.2 \
--update-every 8 \
--update-epochs 6 \
--eval-every 25 \
--eval-difficulty hard \
--eval-episodes 5 \
--checkpoint-every 50 \
--entropy-coef 0.05 \
--step-delay 0 \
--viz-after-ep 500 \
--output artifacts/pyre_ppo_hard_v2.pt \
--log-file artifacts/pyre_ppo_hard_v2_training.log
Network architecture (from training log)
| Property |
Value |
| Total parameters |
12,065,650 |
| Input vector dim |
23,140 (encoder base_dim 5785 Γ 4 stacked frames) |
| Action dim |
41 (4 move + 4 look + 1 wait + 16 door open + 16 door close) |
| Hidden MLP |
512 β 256 β 128 |
Hyperparameters (this run)
| Param |
Value |
| Learning rate |
3Γ10β»β΄ (with LR decay toward 0.1Γ end factor unless disabled) |
| PPO clip Ξ΅ |
0.2 |
| Entropy coeff |
0.05 |
| Value coeff |
0.5 |
| Gamma |
0.99 |
| GAE Ξ» |
0.95 |
| PPO update every |
8 episodes |
| PPO epochs / minibatch |
6 / 256 |
| Max grad norm |
0.5 |
| Observation mode |
visible (partial observability) |
| Device |
cuda (train_torch_ppo.py default; set --device cpu if needed) |
Periodic eval on hard (from pyre_ppo_hard_v2_eval.csv)
| Episode |
Difficulty |
Success rate |
Reward mean |
Steps mean |
| 25 |
hard |
0% |
β10.124 |
58.0 |
| 50 |
hard |
0% |
β11.184 |
58.4 |
| 75 |
hard |
0% |
β11.468 |
35.6 |
| 100 |
hard |
0% |
β9.827 |
74.0 |
| 125 |
hard |
20% |
β7.792 |
25.0 |
| 150 |
hard |
40% |
β4.238 |
28.0 |
| 175 |
hard |
20% |
β6.674 |
35.2 |
| 200 |
hard |
0% |
β12.304 |
74.6 |
| 225 |
hard |
0% |
β11.080 |
100.0 |
| 250 |
hard |
20% |
β5.648 |
38.4 |
| 275 |
hard |
0% |
β10.368 |
76.2 |
| 300 |
hard |
20% |
β4.421 |
72.8 |
| 325 |
hard |
0% |
β11.180 |
48.2 |
| 350 |
hard |
0% |
β9.845 |
74.0 |
| 375 |
hard |
0% |
β11.320 |
26.4 |
| 400 |
hard |
0% |
β12.256 |
34.0 |
| 425 |
hard |
20% |
β7.024 |
36.4 |
| 450 |
hard |
0% |
β10.726 |
56.4 |
| 475 |
hard |
0% |
β9.072 |
88.6 |
| 500 |
hard |
0% |
β12.050 |
66.6 |
| 525 |
hard |
20% |
β5.528 |
41.6 |
| 550 |
hard |
0% |
β11.274 |
52.4 |
| 575 |
hard |
0% |
β10.578 |
58.4 |
| 600 |
hard |
0% |
β12.068 |
36.6 |
Files in this repository
| File |
Description |
model.pt |
PyTorch checkpoint (network_state, optimizer_state, scheduler_state, args, episode) |
training_graph.png |
Training curves (reward + success rate vs episode) |
episode_metrics.csv |
Per-episode training metrics |
eval_metrics.csv |
Periodic eval aggregates |
training.log |
Full console transcript of the HTTP training run |
Running inference locally
import sys
import torch
from huggingface_hub import hf_hub_download
sys.path.insert(0, "pyre_env")
from training.ppo.train_torch_ppo import (
ActorCritic,
ObservationEncoder,
action_index_to_env_action,
build_action_mask,
)
ckpt_path = hf_hub_download(repo_id="krooz/pyre-ppo-agent", filename="model.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
saved_args = ckpt["args"]
encoder = ObservationEncoder(mode=saved_args.get("observation_mode", "visible"))
hidden_sizes = tuple(int(x) for x in saved_args.get("hidden_sizes", "512,256,128").split(","))
history_length = saved_args.get("history_length", 4)
input_dim = encoder.base_dim * history_length
network = ActorCritic(input_dim, 41, hidden_sizes)
network.load_state_dict(ckpt["network_state"])
network.eval()
print(f"Loaded checkpoint from episode {ckpt.get('episode', '?')}")
from openenv_pyre import PyreEnvironment
from collections import deque
import numpy as np
env = PyreEnvironment()
obs = env.reset(difficulty="medium")
frames = deque([np.zeros(encoder.base_dim, dtype=np.float32)] * history_length, maxlen=history_length)
frames.append(encoder.encode(obs))
total_reward = 0.0
with torch.no_grad():
while True:
state_vec = np.concatenate(list(frames), dtype=np.float32)
obs_t = torch.tensor(state_vec, dtype=torch.float32).unsqueeze(0)
mask_t = torch.tensor(build_action_mask(obs, exclude_look=True), dtype=torch.float32).unsqueeze(0)
action_t, _, _ = network.act(obs_t, mask_t, deterministic=True)
obs = env.step(action_index_to_env_action(int(action_t.item())))
total_reward += float(obs.reward or 0.0)
frames.append(encoder.encode(obs))
if obs.done:
break
print(f"Episode finished β evacuated={obs.agent_evacuated} reward={total_reward:.3f}")
Environment & training resources
- HF Space (live env): Krooz/pyre_env
- PPO training in Colab (HTTP to Space): Pyre PPO training β Google Colab
- Local HTTP trainer:
training/ppo/train_torch_ppo_http.py
- Local in-process trainer:
training/ppo/train_torch_ppo.py
- Notebook source:
training/ppo/pyre_ppo_training.ipynb