--- library_name: transformers tags: - reward - RM - Code - CodeScaler license: mit datasets: - LARK-Lab/CodeScalerPair-51K language: - en base_model: - Skywork/Skywork-Reward-V2-Qwen3-1.7B ---

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

## Overview We propose **CodeScaler**, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. **CodeScaler** is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. This model is the official CodeScaler-1.7B trained from Skywork/Skywork-Reward-V2-Qwen3-1.7B on [LARK-Lab/CodeScalerPair-51K](https://huggingface.co/datasets/LARK-Lab/CodeScalerPair-51K). ## Performance on RM-Bench | Model | Code | Chat | Math | Safety | Easy | Normal | Hard | Avg | | ------------------------------------ | ---- | ----- | ----- | ------ | ----- | ------ | ---- | ---- | | Skywork/Skywork-Reward-Llama-3.1-8B | 54.5 | 69.5 | 60.6 | 95.7 | 89 | 74.7 | 46.6 | 70.1 | | TIGER-Lab/AceCodeRM-7B | 66.9 | 66.7 | 65.3 | 89.9 | 79.9 | 74.4 | 62.2 | 72.2 | | TIGER-Lab/AceCoder-RM-32B | 72.1 | 73.7 | 70.5 | 88 | 84.5 | 78.3 | 65.5 | 76.1 | | Skywork/Skywork-Reward-V2-Qwen3-1.7B | 72.3 | 69.6 | 71.4 | 92.9 | 92.8 | 82.3 | 54.5 | 76.6 | | Skywork/Skywork-Reward-V2-Qwen3-4B | 74.4 | 78.2 | 73.6 | 95.7 | 92.1 | 85 | 64.4 | 80.5 | | Skywork/Skywork-Reward-V2-Qwen3-8B | 73.6 | 80.6 | 75 | 96.5 | 91.8 | 85.5 | 67 | 80.5 | | **CodeScaler-1.7B (this model)** | 73.1 | 74.4 | 74.7 | 93.1 | 91.7 | 83.2 | 61.5 | 78.8 | | CodeScaler-4B | 76.3 | 80.4 | 79 | 95.8 | 92.9 | 86.5 | 69.2 | 82.9 | | CodeScaler-8B | 76.9 | 83 | 79.9 | 96.4 | 92.5 | 87.9 | 71.8 | 84.1 | ## Usage ### RM Scoring ````python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification device = "cuda" if torch.cuda.is_available() else "cpu" model_path = 'LARK-Lab/CodeScaler-1.7B' tokenizer = AutoTokenizer.from_pretrained(model_path) reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device) reward_model.eval() question = """\ Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k. A subarray is a contiguous part of the array. For example: ``` Input: nums = [1, 1, 1], k = 2 Output: 2 ``` """ program_correct = """\ from collections import defaultdict def subarraySum(nums, k): prefix = 0 count = 0 freq = defaultdict(int) freq[0] = 1 # Important: subarray starting from index 0 for num in nums: prefix += num if prefix - k in freq: count += freq[prefix - k] freq[prefix] += 1 return count """ program_wrong = """\ def subarraySum(nums, k): left = 0 curr_sum = 0 count = 0 for right in range(len(nums)): curr_sum += nums[right] while curr_sum > k and left <= right: curr_sum -= nums[left] left += 1 if curr_sum == k: count += 1 return count """ convs = [ [ { "content": question, "role": "user", }, { "role": "assistant", "content": program } ] for program in [program_correct, program_wrong] ] texts = [ tokenizer.apply_chat_template(conv, tokenize=False) for conv in convs ] toks = tokenizer( texts, truncation=True, padding=True, max_length=2048, return_tensors="pt", ) with torch.no_grad(): outputs = reward_model( input_ids=toks["input_ids"].to(device), attention_mask=toks["attention_mask"].to(device), ) scores = outputs.logits.squeeze(-1).cpu().tolist() print("RM Scores:", scores) # RM Scores: [12.513851165771484, -0.46548914909362793] ```` ### RL Training Please refer to [https://github.com/LARK-AI-Lab/CodeScaler](https://github.com/LARK-AI-Lab/CodeScaler) for rl training details. ## Citation If you find our work helpful, please consider citing: ``` @misc{zhu2026codescalerscalingcodellm, title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models}, author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo}, year={2026}, eprint={2602.17684}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.17684}, } ```