Usage

This model outputs a reward for each reasoning step evaluating it.

Babelscape/Qwen2.5-Math-7B-PRM800k-r is a Process Reward Model (PRM) based on Qwen2.5-Math-7B-Instruct. It is trained with process-supervision data from PRM800K.

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
Raffaele Pisano and Roberto Navigli, ACL 2026

Project page & paper: https://babelscape.github.io/prm-meets-planning/
arXiv: https://arxiv.org/abs/2604.17957

Example

import torch
from transformers import AutoTokenizer, AutoModel
repo_id = "Babelscape/Qwen2.5-Math-7B-PRM800k-r"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
def build_prompt(problem, steps):
    steps_text = "\n".join([f"Step {i+1}: {step}\nки" for i, step in enumerate(steps)])
    return f"Problem: {problem}\nSteps:\n{steps_text}"
problem = "If x + 3 = 10, find x."
steps = [
    "Subtract 3 from both sides: x = 10 - 3.",
    "So x = 7."
]
prompt = build_prompt(problem, steps)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
pred_scalar = outputs["pred_scalar"]
marker_id = tokenizer.encode("ки", add_special_tokens=False)[0]
marker_positions = (inputs["input_ids"][0] == marker_id).nonzero(as_tuple=True)[0]
step_scores = torch.sigmoid(pred_scalar[0, marker_positions]).cpu().tolist()
print("Step scores:", step_scores)
first_bad = next((i for i, score in enumerate(step_scores) if score < 0.5), -1)
print("First failing step index:", first_bad)

Notes

The marker "ки" must appear after every reasoning step.
pred_scalar contains one scalar per token, so only values at marker positions should be used as step scores.
A threshold such as 0.5 can be used to identify potentially incorrect steps.

Citation

If you use this model or the PDDL2PRM dataset in your work, please cite:

@inproceedings{pisano2026prmplanning,
  title={Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards},
  author={Pisano, Raffaele and Navigli, Roberto},
  booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2026},
  note={Accepted}
}

Downloads last month: 9

Safetensors

Model size

8B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Babelscape/Qwen2.5-Math-7B-PRM800k-r

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

Paper • 2604.17957 • Published Apr 20