File size: 11,959 Bytes

7d8f00c

---
license: apache-2.0
language:
- en
- zh
base_model:
- Qwen/Qwen3.6-27B
pipeline_tag: reinforcement-learning
tags:
- CUDA
- MUSA
- GPU-Kernel
- Reinforcement-Learning
---



<div align="left">
  <img src="./assets/moore_threads_logo.png" width="120" alt="Moore Threads Logo" />
</div>

<!-- <h1 align="center">MusaCoder-27B</h1> -->

<h1 align="center">
  <strong>MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU</strong>
</h1>

<!-- <p align="center">
  Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, <br>
  Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang
</p> -->

<p align="center">
  <a href="https://arxiv.org/abs/2606.04847">📄 Paper</a>
</p>

---

<div align="center">
  <img src="./assets/kernelbench_bar.png" width="900" alt="KernelBench Benchmark Results" />
</div>

# MusaCoder-27B

> This repository contains model weights and configuration files for **MusaCoder-27B**, a specialized code generation model for native GPU kernel synthesis.
>
> MusaCoder-27B is designed to generate CUDA/MUSA native kernels from PyTorch reference implementations, with a focus on compilability, numerical correctness, anti-fallback legality, and empirical speedup.

## Introduction

**MusaCoder-27B** is a 27B-parameter code model developed by Moore Threads for **PyTorch-to-CUDA/MUSA native kernel generation**. Unlike general-purpose code models, MusaCoder focuses on low-level GPU programming tasks, including tensor shape reasoning, thread/block mapping, memory indexing, boundary handling, reduction strategies, numerical stability, and performance-oriented kernel optimization.

The model is trained through a full-stack post-training pipeline consisting of:

* multi-source supervised fine-tuning data construction;
* verifier-filtered rejection fine-tuning;
* execution-feedback reinforcement learning;
* strict native-kernel verification with MooreEval;
* CUDA/MUSA-oriented kernel repair and optimization data.

MusaCoder-27B is released to promote the development of the MUSA open-source ecosystem, facilitate research on LLM-based code generation and GPU kernel synthesis, and encourage the community to explore cross-platform native kernel optimization.

## Highlights

### Native CUDA/MUSA Kernel Generation

MusaCoder-27B is optimized for generating native GPU kernels from PyTorch reference code. The model is not intended for generic business code generation; instead, it targets low-level kernel authoring where generated code must compile, run correctly, satisfy task constraints, and achieve measurable speedup.

### MUSA-Oriented Kernel Synthesis

MusaCoder-27B supports PyTorch-to-MUSA kernel generation scenarios and can be used to explore automatic generation of MUSA native kernels from PyTorch reference programs. This provides a foundation model capability for the MUSA developer community and lowers the barrier to writing, validating, and optimizing MUSA kernels.

### Full-Stack Training Pipeline

MusaCoder-27B is trained with a full-stack pipeline:

* **SFT** teaches the model PyTorch-to-kernel task format, common kernel implementation patterns, GPU programming knowledge, review capability, and performance analysis.
* **RFT** uses execution-based verification to select correct model-generated implementations while preserving implementation diversity.
* **RL** uses real compilation, execution, correctness checking, anti-fallback detection, and runtime measurement as reward signals.

### Execution-Based Verification

MusaCoder is developed together with **MooreEval**, an execution-based verifier and reward environment. MooreEval checks whether generated kernels:

* can be parsed and compiled;
* pass randomized correctness tests against PyTorch reference outputs;
* avoid forbidden PyTorch/ATen computational fallbacks;
* achieve real runtime speedup under synchronized event timing.

### RL Stabilization Techniques

The training pipeline incorporates three stabilization techniques:

* **PrimeEcho**: first-turn-anchored multi-turn reward for balancing repair ability and first-attempt quality.
* **Buffered Dynamic Retry**: converts all-failed groups into feedback-conditioned repair tasks.
* **MirrorPop**: sequence-level off-policy filtering based on absolute log-ratio deviation.

## Model Details

| Item                  | Description                                              |
| --------------------- | -------------------------------------------------------- |
| Model name            | MusaCoder-27B                                            |
| Developer             | Moore Threads                                            |
| Base model            | Qwen3.6-27B                                              |
| Model type            | Causal language model                                    |
| Primary use           | PyTorch-to-CUDA/MUSA native kernel generation            |
| License               | Apache License 2.0                                       |
| Training precision    | bf16                                                     |
| Recommended framework | Transformers / vLLM / SGLang-compatible inference        |

## Intended Use

MusaCoder-27B is intended for research and development in:

* PyTorch-to-CUDA/MUSA kernel generation;
* native GPU kernel synthesis;
* code generation for accelerator programming;
* automatic kernel repair and optimization;
* MUSA ecosystem development;
* execution-feedback reinforcement learning for code models.

A typical input contains a PyTorch reference implementation, input constraints, and generation requirements. The model is expected to produce a `ModelNew` implementation using custom native CUDA/MUSA kernels.

## Quickstart

### Installation

```bash
pip install transformers accelerate torch
```

For high-throughput inference, users may also use vLLM or SGLang depending on their deployment environment.

### Basic Usage with Transformers

````python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "MooreThreads/MusaCoder-27B"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = r"""
You are given a PyTorch reference implementation. Write a replacement ModelNew
that implements the same computation using a custom native CUDA/MUSA kernel.

Reference:
```python
import torch
import torch.nn as nn

class Model(nn.Module):
    def forward(self, x):
        return torch.relu(x)
```

Requirements:

* Define class ModelNew(nn.Module).
* Do not use forbidden PyTorch/ATen compute fallback in ModelNew.forward().
* The implementation must be compilable and numerically correct.
  """

messages = [
{"role": "user", "content": prompt},
]

text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
**inputs,
max_new_tokens=32000,
temperature=0.7,
top_p=0.95,
do_sample=True,
)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

````

## Prompt Format

We recommend using a structured prompt that includes:

1. PyTorch reference code;
2. input shape and dtype constraints;
3. target backend, e.g., CUDA or MUSA;
4. explicit instruction to define `ModelNew`;
5. anti-fallback constraints;
6. optional correctness and performance requirements.

Example:

```text
Given the following PyTorch reference model, generate a new implementation
class ModelNew(nn.Module) that uses custom native CUDA/MUSA kernels.

The generated implementation must:
- match the PyTorch reference numerically;
- compile successfully;
- avoid forbidden PyTorch/ATen compute fallback in forward();
- handle boundary cases correctly;
- prefer native kernel implementations over high-level library calls.
````

## Evaluation

MusaCoder-27B is evaluated using the MooreEval protocol on KernelBench-style tasks.

The evaluation checks:

* code extraction and interface validity;
* compilation success;
* randomized correctness against PyTorch reference;
* forbidden PyTorch/ATen fallback detection;
* synchronized runtime measurement;
* Faster Rate with a speedup threshold of `>1.1x`.

### KernelBench Results

| Model                | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager | Faster vs. Compile |
| -------------------- | -------------: | -------------: | ---------------: | -----------------: |
| Kimi K2.6            |           84.0 |          69.10 |              3.3 |                1.4 |
| GLM-5.1              |           85.6 |          76.25 |              7.4 |                3.9 |
| DeepSeek-V4_ProMax   |           84.8 |          60.05 |              5.7 |                3.0 |
| Claude Opus 4.7      |           87.2 |          77.30 |             11.8 |                7.5 |
| Qwen3.6-27B          |           67.2 |          35.60 |              3.4 |                1.6 |
| MusaCoder-27B-SFT    |           84.8 |          79.40 |              6.3 |                4.1 |
| **MusaCoder-27B-RL** |       **93.2** |      **88.60** |         **15.0** |            **9.2** |

### MUSA KernelBench Results

| Model                | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager |
| -------------------- | -------------: | -------------: | ---------------: |
| DeepSeek-V4-Pro      |           92.0 |           56.9 |              5.7 |
| GLM-5.1              |           88.0 |           66.4 |              6.9 |
| MusaCoder-27B-SFT    |           79.6 |           63.5 |              5.2 |
| **MusaCoder-27B-RL** |       **92.4** |       **81.7** |         **12.5** |

## Notes on Generated Code

Generated kernels should always be compiled and tested before use. GPU kernel generation is a high-risk code generation task because small mistakes in indexing, boundary handling, dtype conversion, or memory layout can lead to incorrect outputs, runtime failures, or illegal memory access.

We recommend validating generated code with:

* randomized correctness tests;
* multiple input shapes and dtypes;
* non-contiguous tensor cases when applicable;
* runtime profiling;
* forbidden fallback detection.

## Limitations

MusaCoder-27B is specialized for GPU kernel generation and may not be optimal for general-purpose chat or application development. The model may still generate code that:

* fails to compile;
* produces incorrect results for unseen edge cases;
* uses inefficient thread/block layouts;
* relies on disallowed high-level fallback APIs;
* requires additional engineering adaptation for specific platforms or compiler versions.

Users should treat generated code as a candidate implementation that must be verified before deployment.

## License

MusaCoder-27B is released under the Apache License 2.0.

MusaCoder-27B is initialized from and trained based on Qwen3.6-27B. Users should comply with the license terms of MusaCoder-27B as well as applicable license terms of upstream models and third-party components.

## Citation

If you find MusaCoder useful, please cite:

```bibtex
@article{cheng2026musacoder,
  title={MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU},
  author={Cheng, Kun and Lu, Songshuo and Liao, Sicong and Li, Tankun and Zhang, Yafei and Yang, Dong and Lv, Qiheng and Wang, Hua and Chen, Zhi and Tang, Yaohua},
  journal={arXiv preprint arXiv:2606.04847},
  year={2026},
  eprint={2606.04847},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.04847}
}
```

## Acknowledgements

MusaCoder is developed by Moore Threads AI. We thank the open-source community for advancing GPU programming, code generation, and execution-feedback learning. We also acknowledge the upstream base model and software ecosystems that make this work possible.