File size: 11,959 Bytes
7d8f00c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 | ---
license: apache-2.0
language:
- en
- zh
base_model:
- Qwen/Qwen3.6-27B
pipeline_tag: reinforcement-learning
tags:
- CUDA
- MUSA
- GPU-Kernel
- Reinforcement-Learning
---
<div align="left">
<img src="./assets/moore_threads_logo.png" width="120" alt="Moore Threads Logo" />
</div>
<!-- <h1 align="center">MusaCoder-27B</h1> -->
<h1 align="center">
<strong>MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU</strong>
</h1>
<!-- <p align="center">
Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, <br>
Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang
</p> -->
<p align="center">
<a href="https://arxiv.org/abs/2606.04847">📄 Paper</a>
</p>
---
<div align="center">
<img src="./assets/kernelbench_bar.png" width="900" alt="KernelBench Benchmark Results" />
</div>
# MusaCoder-27B
> This repository contains model weights and configuration files for **MusaCoder-27B**, a specialized code generation model for native GPU kernel synthesis.
>
> MusaCoder-27B is designed to generate CUDA/MUSA native kernels from PyTorch reference implementations, with a focus on compilability, numerical correctness, anti-fallback legality, and empirical speedup.
## Introduction
**MusaCoder-27B** is a 27B-parameter code model developed by Moore Threads for **PyTorch-to-CUDA/MUSA native kernel generation**. Unlike general-purpose code models, MusaCoder focuses on low-level GPU programming tasks, including tensor shape reasoning, thread/block mapping, memory indexing, boundary handling, reduction strategies, numerical stability, and performance-oriented kernel optimization.
The model is trained through a full-stack post-training pipeline consisting of:
* multi-source supervised fine-tuning data construction;
* verifier-filtered rejection fine-tuning;
* execution-feedback reinforcement learning;
* strict native-kernel verification with MooreEval;
* CUDA/MUSA-oriented kernel repair and optimization data.
MusaCoder-27B is released to promote the development of the MUSA open-source ecosystem, facilitate research on LLM-based code generation and GPU kernel synthesis, and encourage the community to explore cross-platform native kernel optimization.
## Highlights
### Native CUDA/MUSA Kernel Generation
MusaCoder-27B is optimized for generating native GPU kernels from PyTorch reference code. The model is not intended for generic business code generation; instead, it targets low-level kernel authoring where generated code must compile, run correctly, satisfy task constraints, and achieve measurable speedup.
### MUSA-Oriented Kernel Synthesis
MusaCoder-27B supports PyTorch-to-MUSA kernel generation scenarios and can be used to explore automatic generation of MUSA native kernels from PyTorch reference programs. This provides a foundation model capability for the MUSA developer community and lowers the barrier to writing, validating, and optimizing MUSA kernels.
### Full-Stack Training Pipeline
MusaCoder-27B is trained with a full-stack pipeline:
* **SFT** teaches the model PyTorch-to-kernel task format, common kernel implementation patterns, GPU programming knowledge, review capability, and performance analysis.
* **RFT** uses execution-based verification to select correct model-generated implementations while preserving implementation diversity.
* **RL** uses real compilation, execution, correctness checking, anti-fallback detection, and runtime measurement as reward signals.
### Execution-Based Verification
MusaCoder is developed together with **MooreEval**, an execution-based verifier and reward environment. MooreEval checks whether generated kernels:
* can be parsed and compiled;
* pass randomized correctness tests against PyTorch reference outputs;
* avoid forbidden PyTorch/ATen computational fallbacks;
* achieve real runtime speedup under synchronized event timing.
### RL Stabilization Techniques
The training pipeline incorporates three stabilization techniques:
* **PrimeEcho**: first-turn-anchored multi-turn reward for balancing repair ability and first-attempt quality.
* **Buffered Dynamic Retry**: converts all-failed groups into feedback-conditioned repair tasks.
* **MirrorPop**: sequence-level off-policy filtering based on absolute log-ratio deviation.
## Model Details
| Item | Description |
| --------------------- | -------------------------------------------------------- |
| Model name | MusaCoder-27B |
| Developer | Moore Threads |
| Base model | Qwen3.6-27B |
| Model type | Causal language model |
| Primary use | PyTorch-to-CUDA/MUSA native kernel generation |
| License | Apache License 2.0 |
| Training precision | bf16 |
| Recommended framework | Transformers / vLLM / SGLang-compatible inference |
## Intended Use
MusaCoder-27B is intended for research and development in:
* PyTorch-to-CUDA/MUSA kernel generation;
* native GPU kernel synthesis;
* code generation for accelerator programming;
* automatic kernel repair and optimization;
* MUSA ecosystem development;
* execution-feedback reinforcement learning for code models.
A typical input contains a PyTorch reference implementation, input constraints, and generation requirements. The model is expected to produce a `ModelNew` implementation using custom native CUDA/MUSA kernels.
## Quickstart
### Installation
```bash
pip install transformers accelerate torch
```
For high-throughput inference, users may also use vLLM or SGLang depending on their deployment environment.
### Basic Usage with Transformers
````python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "MooreThreads/MusaCoder-27B"
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
prompt = r"""
You are given a PyTorch reference implementation. Write a replacement ModelNew
that implements the same computation using a custom native CUDA/MUSA kernel.
Reference:
```python
import torch
import torch.nn as nn
class Model(nn.Module):
def forward(self, x):
return torch.relu(x)
```
Requirements:
* Define class ModelNew(nn.Module).
* Do not use forbidden PyTorch/ATen compute fallback in ModelNew.forward().
* The implementation must be compilable and numerically correct.
"""
messages = [
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=32000,
temperature=0.7,
top_p=0.95,
do_sample=True,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
````
## Prompt Format
We recommend using a structured prompt that includes:
1. PyTorch reference code;
2. input shape and dtype constraints;
3. target backend, e.g., CUDA or MUSA;
4. explicit instruction to define `ModelNew`;
5. anti-fallback constraints;
6. optional correctness and performance requirements.
Example:
```text
Given the following PyTorch reference model, generate a new implementation
class ModelNew(nn.Module) that uses custom native CUDA/MUSA kernels.
The generated implementation must:
- match the PyTorch reference numerically;
- compile successfully;
- avoid forbidden PyTorch/ATen compute fallback in forward();
- handle boundary cases correctly;
- prefer native kernel implementations over high-level library calls.
````
## Evaluation
MusaCoder-27B is evaluated using the MooreEval protocol on KernelBench-style tasks.
The evaluation checks:
* code extraction and interface validity;
* compilation success;
* randomized correctness against PyTorch reference;
* forbidden PyTorch/ATen fallback detection;
* synchronized runtime measurement;
* Faster Rate with a speedup threshold of `>1.1x`.
### KernelBench Results
| Model | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager | Faster vs. Compile |
| -------------------- | -------------: | -------------: | ---------------: | -----------------: |
| Kimi K2.6 | 84.0 | 69.10 | 3.3 | 1.4 |
| GLM-5.1 | 85.6 | 76.25 | 7.4 | 3.9 |
| DeepSeek-V4_ProMax | 84.8 | 60.05 | 5.7 | 3.0 |
| Claude Opus 4.7 | 87.2 | 77.30 | 11.8 | 7.5 |
| Qwen3.6-27B | 67.2 | 35.60 | 3.4 | 1.6 |
| MusaCoder-27B-SFT | 84.8 | 79.40 | 6.3 | 4.1 |
| **MusaCoder-27B-RL** | **93.2** | **88.60** | **15.0** | **9.2** |
### MUSA KernelBench Results
| Model | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager |
| -------------------- | -------------: | -------------: | ---------------: |
| DeepSeek-V4-Pro | 92.0 | 56.9 | 5.7 |
| GLM-5.1 | 88.0 | 66.4 | 6.9 |
| MusaCoder-27B-SFT | 79.6 | 63.5 | 5.2 |
| **MusaCoder-27B-RL** | **92.4** | **81.7** | **12.5** |
## Notes on Generated Code
Generated kernels should always be compiled and tested before use. GPU kernel generation is a high-risk code generation task because small mistakes in indexing, boundary handling, dtype conversion, or memory layout can lead to incorrect outputs, runtime failures, or illegal memory access.
We recommend validating generated code with:
* randomized correctness tests;
* multiple input shapes and dtypes;
* non-contiguous tensor cases when applicable;
* runtime profiling;
* forbidden fallback detection.
## Limitations
MusaCoder-27B is specialized for GPU kernel generation and may not be optimal for general-purpose chat or application development. The model may still generate code that:
* fails to compile;
* produces incorrect results for unseen edge cases;
* uses inefficient thread/block layouts;
* relies on disallowed high-level fallback APIs;
* requires additional engineering adaptation for specific platforms or compiler versions.
Users should treat generated code as a candidate implementation that must be verified before deployment.
## License
MusaCoder-27B is released under the Apache License 2.0.
MusaCoder-27B is initialized from and trained based on Qwen3.6-27B. Users should comply with the license terms of MusaCoder-27B as well as applicable license terms of upstream models and third-party components.
## Citation
If you find MusaCoder useful, please cite:
```bibtex
@article{cheng2026musacoder,
title={MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU},
author={Cheng, Kun and Lu, Songshuo and Liao, Sicong and Li, Tankun and Zhang, Yafei and Yang, Dong and Lv, Qiheng and Wang, Hua and Chen, Zhi and Tang, Yaohua},
journal={arXiv preprint arXiv:2606.04847},
year={2026},
eprint={2606.04847},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.04847}
}
```
## Acknowledgements
MusaCoder is developed by Moore Threads AI. We thank the open-source community for advancing GPU programming, code generation, and execution-feedback learning. We also acknowledge the upstream base model and software ecosystems that make this work possible. |