File size: 5,991 Bytes

---
library_name: transformers
tags:
- reward
- RM
- Code
- CodeScaler
license: mit
datasets:
- LARK-Lab/CodeScalerPair-51K
language:
- en
base_model:
- Skywork/Skywork-Reward-V2-Qwen3-8B
---

<h2 align="center">
  CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models
</h2>

<p align="center">
  <a href="">
    <img
      src="https://img.shields.io/badge/Paper-Arxiv-red?logo=arxiv&logoColor=red"
      alt="CodeScaler Paper on arXiv"
    />
  <a href="https://github.com/LARK-AI-Lab/CodeScaler">
    <img 
        src="https://img.shields.io/badge/GitHub-Code-181717?logo=github&logoColor=white" 
        alt="GitHub Code"
    />
  </a>
  <a href="https://lark-ai-lab.github.io/codescaler.github.io/">
    <img 
        src="https://img.shields.io/badge/GitHub-Page-4078c0?logo=github&logoColor=white" 
        alt="GitHub Page"
    />
  </a>
  <a href="https://huggingface.co/collections/LARK-Lab/codescaler">
    <img 
        src="https://img.shields.io/badge/Datasets-Hugging%20Face%20Data-orange?logo=huggingface&logoColor=yellow" 
        alt="Datasets on Hugging Face"
    />
  </a>
  <a href="https://huggingface.co/collections/LARK-Lab/codescaler">
    <img 
        src="https://img.shields.io/badge/CodeScaler-Hugging%20Face%20Model-FFCC00?logo=huggingface&logoColor=yellow" 
        alt="CodeScaler on Hugging Face"
    />
  </a>

  
</p>

## Overview


We propose **CodeScaler**, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. **CodeScaler** is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. 

This model is the official CodeScaler-8B trained from Skywork/Skywork-Reward-V2-Qwen3-8B on [LARK-Lab/CodeScalerPair-51K](https://huggingface.co/datasets/LARK-Lab/CodeScalerPair-51K).

## Performance on RM-Bench
| Model                                | Code | Chat  | Math  | Safety | Easy  | Normal | Hard | Avg  |
| ------------------------------------ | ---- | ----- | ----- | ------ | ----- | ------ | ---- | ---- |
| Skywork/Skywork-Reward-Llama-3.1-8B  | 54.5 | 69.5  | 60.6  | 95.7   | 89    | 74.7   | 46.6 | 70.1 |
| TIGER-Lab/AceCodeRM-7B                       | 66.9 | 66.7  | 65.3  | 89.9   | 79.9  | 74.4   | 62.2 | 72.2 |
| TIGER-Lab/AceCoder-RM-32B                      | 72.1 | 73.7  | 70.5  | 88     | 84.5  | 78.3   | 65.5 | 76.1 |
| Skywork/Skywork-Reward-V2-Qwen3-1.7B                      | 72.3 | 69.6  | 71.4  | 92.9     | 92.8  | 82.3   | 54.5 | 76.6 |
| Skywork/Skywork-Reward-V2-Qwen3-4B                      | 74.4 | 78.2  | 73.6  | 95.7     | 92.1  | 85   | 64.4 | 80.5 |
| Skywork/Skywork-Reward-V2-Qwen3-8B                      | 73.6 | 80.6  | 75  | 96.5     | 91.8  |  85.5  | 67 | 80.5 |
| CodeScaler-1.7B                     | 73.1 | 74.4  | 74.7  | 93.1     | 91.7  | 83.2   | 61.5 | 78.8 |
| CodeScaler-4B                      | 76.3 | 80.4  | 79  | 95.8     | 92.9  | 86.5   | 69.2 | 82.9 |
| **CodeScaler-8B (this model)**                      | 76.9 | 83  | 79.9  | 96.4     | 92.5  | 87.9   | 71.8 | 84.1 |

## Usage

### RM Scoring
````python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification



device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = 'LARK-Lab/CodeScaler-8B'

tokenizer = AutoTokenizer.from_pretrained(model_path)
reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
reward_model.eval()

question = """\
Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k.
A subarray is a contiguous part of the array.
For example:
```
Input:
nums = [1, 1, 1], k = 2

Output:
2
```
"""

program_correct = """\
from collections import defaultdict

def subarraySum(nums, k):
    prefix = 0
    count = 0
    freq = defaultdict(int)
    freq[0] = 1  # Important: subarray starting from index 0

    for num in nums:
        prefix += num

        if prefix - k in freq:
            count += freq[prefix - k]

        freq[prefix] += 1

    return count
"""

program_wrong = """\
def subarraySum(nums, k):
    left = 0
    curr_sum = 0
    count = 0

    for right in range(len(nums)):
        curr_sum += nums[right]

        while curr_sum > k and left <= right:
            curr_sum -= nums[left]
            left += 1

        if curr_sum == k:
            count += 1

    return count
"""


convs = [
    [
        {
            "content": question,
            "role": "user",
        },
        {
            "role": "assistant",
            "content": program
        }
    ] for program in [program_correct, program_wrong]
]


texts = [
    tokenizer.apply_chat_template(conv, tokenize=False)
    for conv in convs
]

toks = tokenizer(
    texts,
    truncation=True,
    padding=True,
    max_length=2048,
    return_tensors="pt",
)

with torch.no_grad():
    outputs = reward_model(
        input_ids=toks["input_ids"].to(device),
        attention_mask=toks["attention_mask"].to(device),
    )
    scores = outputs.logits.squeeze(-1).cpu().tolist()


print("RM Scores:", scores)
# RM Scores: [6.5424089431762695, -0.0312652587890625]
````

### RL Training
Please refer to [https://github.com/LARK-AI-Lab/CodeScaler](https://github.com/LARK-AI-Lab/CodeScaler) for rl training details.

## Citation
If you find our work helpful, please consider citing:
```
@misc{zhu2026codescalerscalingcodellm,
      title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models}, 
      author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
      year={2026},
      eprint={2602.17684},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.17684}, 
}
```