| | --- |
| | library_name: transformers |
| | tags: |
| | - reward |
| | - RM |
| | - Code |
| | - CodeScaler |
| | license: mit |
| | datasets: |
| | - LARK-Lab/CodeScalerPair-51K |
| | language: |
| | - en |
| | base_model: |
| | - Skywork/Skywork-Reward-V2-Qwen3-1.7B |
| | --- |
| | |
| | <h2 align="center"> |
| | CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models |
| | </h2> |
| |
|
| | <p align="center"> |
| | <a href=""> |
| | <img |
| | src="https://img.shields.io/badge/Paper-Arxiv-red?logo=arxiv&logoColor=red" |
| | alt="CodeScaler Paper on arXiv" |
| | /> |
| | <a href="https://github.com/LARK-AI-Lab/CodeScaler"> |
| | <img |
| | src="https://img.shields.io/badge/GitHub-Code-181717?logo=github&logoColor=white" |
| | alt="GitHub Code" |
| | /> |
| | </a> |
| | <a href="https://lark-ai-lab.github.io/codescaler.github.io/"> |
| | <img |
| | src="https://img.shields.io/badge/GitHub-Page-4078c0?logo=github&logoColor=white" |
| | alt="GitHub Page" |
| | /> |
| | </a> |
| | <a href="https://huggingface.co/collections/LARK-Lab/codescaler"> |
| | <img |
| | src="https://img.shields.io/badge/Datasets-Hugging%20Face%20Data-orange?logo=huggingface&logoColor=yellow" |
| | alt="Datasets on Hugging Face" |
| | /> |
| | </a> |
| | <a href="https://huggingface.co/collections/LARK-Lab/codescaler"> |
| | <img |
| | src="https://img.shields.io/badge/CodeScaler-Hugging%20Face%20Model-FFCC00?logo=huggingface&logoColor=yellow" |
| | alt="CodeScaler on Hugging Face" |
| | /> |
| | </a> |
| | |
| | |
| | </p> |
| |
|
| | ## Overview |
| |
|
| |
|
| | We propose **CodeScaler**, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. **CodeScaler** is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. |
| |
|
| | This model is the official CodeScaler-1.7B trained from Skywork/Skywork-Reward-V2-Qwen3-1.7B on [LARK-Lab/CodeScalerPair-51K](https://huggingface.co/datasets/LARK-Lab/CodeScalerPair-51K). |
| |
|
| | ## Performance on RM-Bench |
| | | Model | Code | Chat | Math | Safety | Easy | Normal | Hard | Avg | |
| | | ------------------------------------ | ---- | ----- | ----- | ------ | ----- | ------ | ---- | ---- | |
| | | Skywork/Skywork-Reward-Llama-3.1-8B | 54.5 | 69.5 | 60.6 | 95.7 | 89 | 74.7 | 46.6 | 70.1 | |
| | | TIGER-Lab/AceCodeRM-7B | 66.9 | 66.7 | 65.3 | 89.9 | 79.9 | 74.4 | 62.2 | 72.2 | |
| | | TIGER-Lab/AceCoder-RM-32B | 72.1 | 73.7 | 70.5 | 88 | 84.5 | 78.3 | 65.5 | 76.1 | |
| | | Skywork/Skywork-Reward-V2-Qwen3-1.7B | 72.3 | 69.6 | 71.4 | 92.9 | 92.8 | 82.3 | 54.5 | 76.6 | |
| | | Skywork/Skywork-Reward-V2-Qwen3-4B | 74.4 | 78.2 | 73.6 | 95.7 | 92.1 | 85 | 64.4 | 80.5 | |
| | | Skywork/Skywork-Reward-V2-Qwen3-8B | 73.6 | 80.6 | 75 | 96.5 | 91.8 | 85.5 | 67 | 80.5 | |
| | | **CodeScaler-1.7B (this model)** | 73.1 | 74.4 | 74.7 | 93.1 | 91.7 | 83.2 | 61.5 | 78.8 | |
| | | CodeScaler-4B | 76.3 | 80.4 | 79 | 95.8 | 92.9 | 86.5 | 69.2 | 82.9 | |
| | | CodeScaler-8B | 76.9 | 83 | 79.9 | 96.4 | 92.5 | 87.9 | 71.8 | 84.1 | |
| |
|
| | ## Usage |
| |
|
| | ### RM Scoring |
| | ````python |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | |
| | |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | |
| | model_path = 'LARK-Lab/CodeScaler-1.7B' |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_path) |
| | reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device) |
| | reward_model.eval() |
| | |
| | question = """\ |
| | Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k. |
| | A subarray is a contiguous part of the array. |
| | For example: |
| | ``` |
| | Input: |
| | nums = [1, 1, 1], k = 2 |
| | |
| | Output: |
| | 2 |
| | ``` |
| | """ |
| | |
| | program_correct = """\ |
| | from collections import defaultdict |
| | |
| | def subarraySum(nums, k): |
| | prefix = 0 |
| | count = 0 |
| | freq = defaultdict(int) |
| | freq[0] = 1 # Important: subarray starting from index 0 |
| | |
| | for num in nums: |
| | prefix += num |
| | |
| | if prefix - k in freq: |
| | count += freq[prefix - k] |
| | |
| | freq[prefix] += 1 |
| | |
| | return count |
| | """ |
| | |
| | program_wrong = """\ |
| | def subarraySum(nums, k): |
| | left = 0 |
| | curr_sum = 0 |
| | count = 0 |
| | |
| | for right in range(len(nums)): |
| | curr_sum += nums[right] |
| | |
| | while curr_sum > k and left <= right: |
| | curr_sum -= nums[left] |
| | left += 1 |
| | |
| | if curr_sum == k: |
| | count += 1 |
| | |
| | return count |
| | """ |
| | |
| | |
| | convs = [ |
| | [ |
| | { |
| | "content": question, |
| | "role": "user", |
| | }, |
| | { |
| | "role": "assistant", |
| | "content": program |
| | } |
| | ] for program in [program_correct, program_wrong] |
| | ] |
| | |
| | |
| | texts = [ |
| | tokenizer.apply_chat_template(conv, tokenize=False) |
| | for conv in convs |
| | ] |
| | |
| | toks = tokenizer( |
| | texts, |
| | truncation=True, |
| | padding=True, |
| | max_length=2048, |
| | return_tensors="pt", |
| | ) |
| | |
| | with torch.no_grad(): |
| | outputs = reward_model( |
| | input_ids=toks["input_ids"].to(device), |
| | attention_mask=toks["attention_mask"].to(device), |
| | ) |
| | scores = outputs.logits.squeeze(-1).cpu().tolist() |
| | |
| | |
| | print("RM Scores:", scores) |
| | # RM Scores: [12.513851165771484, -0.46548914909362793] |
| | ```` |
| |
|
| | ### RL Training |
| | Please refer to [https://github.com/LARK-AI-Lab/CodeScaler](https://github.com/LARK-AI-Lab/CodeScaler) for rl training details. |
| |
|
| | ## Citation |
| | If you find our work helpful, please consider citing: |
| | ``` |
| | @misc{zhu2026codescalerscalingcodellm, |
| | title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models}, |
| | author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo}, |
| | year={2026}, |
| | eprint={2602.17684}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.LG}, |
| | url={https://arxiv.org/abs/2602.17684}, |
| | } |
| | ``` |
| |
|
| |
|