Usage

This model outputs a reward for each reasoning step evaluating it.

Babelscape/Qwen2.5-Math-7B-PRM800k-PDDL-r is a Process Reward Model (PRM) based on Qwen2.5-Math-7B-Instruct. It is trained with process-supervision data from PRM800K and with the planning-based supervision introduced in PDDL2PRM.

PDDL2PRM is the dataset introduced in:

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
Raffaele Pisano and Roberto Navigli, ACL 2026

Project page & paper: https://babelscape.github.io/prm-meets-planning/

arXiv: https://arxiv.org/abs/2604.17957

The paper proposes using symbolic planning problems written in Planning Domain Definition Language (PDDL) to generate precise step-level rewards for reasoning trajectories. In PDDL, actions, states, preconditions, effects, and goals are explicitly defined, so intermediate reasoning steps can be evaluated automatically.

Example

import torch
from transformers import AutoTokenizer, AutoModel
repo_id = "rpisano/Qwen2.5-Math-7B-PRM800k-PDDL-r"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
def build_prompt(problem, steps):
    steps_text = "\n".join([f"Step {i+1}: {step}\nки" for i, step in enumerate(steps)])
    return f"Problem: {problem}\nSteps:\n{steps_text}"
problem = "If x + 3 = 10, find x."
steps = [
    "Subtract 3 from both sides: x = 10 - 3.",
    "So x = 7."
]
prompt = build_prompt(problem, steps)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
pred_scalar = outputs["pred_scalar"]
marker_id = tokenizer.encode("ки", add_special_tokens=False)[0]
marker_positions = (inputs["input_ids"][0] == marker_id).nonzero(as_tuple=True)[0]
step_scores = torch.sigmoid(pred_scalar[0, marker_positions]).cpu().tolist()
print("Step scores:", step_scores)
first_bad = next((i for i, score in enumerate(step_scores) if score < 0.5), -1)
print("First failing step index:", first_bad)

Notes

The marker "ки" must appear after every reasoning step.
pred_scalar contains one scalar per token, so only values at marker positions should be used as step scores.
A threshold such as 0.5 can be used to identify potentially incorrect steps.

Citation

If you use this model or the PDDL2PRM dataset in your work, please cite:

@inproceedings{pisano2026prmplanning,
  title={Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards},
  author={Pisano, Raffaele and Navigli, Roberto},
  booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2026},
  note={Accepted}
}

Downloads last month: 20

Safetensors

Model size

8B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Babelscape/Qwen2.5-Math-7B-PRM800k-PDDL-r

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

Paper • 2604.17957 • Published Apr 20