MooreThreads
/

MusaCoder-27B

+---
+license: apache-2.0
+language:
+- en
+- zh
+base_model:
+- Qwen/Qwen3.6-27B
+pipeline_tag: reinforcement-learning
+tags:
+- CUDA
+- MUSA
+- GPU-Kernel
+- Reinforcement-Learning
+---
+<div align="left">
+  <img src="./assets/moore_threads_logo.png" width="120" alt="Moore Threads Logo" />
+</div>
+<!-- <h1 align="center">MusaCoder-27B</h1> -->
+<h1 align="center">
+  <strong>MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU</strong>
+</h1>
+<!-- <p align="center">
+  Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, <br>
+  Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang
+</p> -->
+<p align="center">
+  <a href="https://arxiv.org/abs/2606.04847">📄 Paper</a>
+</p>
+---
+<div align="center">
+  <img src="./assets/kernelbench_bar.png" width="900" alt="KernelBench Benchmark Results" />
+</div>
+# MusaCoder-27B
+> This repository contains model weights and configuration files for **MusaCoder-27B**, a specialized code generation model for native GPU kernel synthesis.
+>
+> MusaCoder-27B is designed to generate CUDA/MUSA native kernels from PyTorch reference implementations, with a focus on compilability, numerical correctness, anti-fallback legality, and empirical speedup.
+## Introduction
+**MusaCoder-27B** is a 27B-parameter code model developed by Moore Threads for **PyTorch-to-CUDA/MUSA native kernel generation**. Unlike general-purpose code models, MusaCoder focuses on low-level GPU programming tasks, including tensor shape reasoning, thread/block mapping, memory indexing, boundary handling, reduction strategies, numerical stability, and performance-oriented kernel optimization.
+The model is trained through a full-stack post-training pipeline consisting of:
+* multi-source supervised fine-tuning data construction;
+* verifier-filtered rejection fine-tuning;
+* execution-feedback reinforcement learning;
+* strict native-kernel verification with MooreEval;
+* CUDA/MUSA-oriented kernel repair and optimization data.
+MusaCoder-27B is released to promote the development of the MUSA open-source ecosystem, facilitate research on LLM-based code generation and GPU kernel synthesis, and encourage the community to explore cross-platform native kernel optimization.
+## Highlights
+### Native CUDA/MUSA Kernel Generation
+MusaCoder-27B is optimized for generating native GPU kernels from PyTorch reference code. The model is not intended for generic business code generation; instead, it targets low-level kernel authoring where generated code must compile, run correctly, satisfy task constraints, and achieve measurable speedup.
+### MUSA-Oriented Kernel Synthesis
+MusaCoder-27B supports PyTorch-to-MUSA kernel generation scenarios and can be used to explore automatic generation of MUSA native kernels from PyTorch reference programs. This provides a foundation model capability for the MUSA developer community and lowers the barrier to writing, validating, and optimizing MUSA kernels.
+### Full-Stack Training Pipeline
+MusaCoder-27B is trained with a full-stack pipeline:
+* **SFT** teaches the model PyTorch-to-kernel task format, common kernel implementation patterns, GPU programming knowledge, review capability, and performance analysis.
+* **RFT** uses execution-based verification to select correct model-generated implementations while preserving implementation diversity.
+* **RL** uses real compilation, execution, correctness checking, anti-fallback detection, and runtime measurement as reward signals.
+### Execution-Based Verification
+MusaCoder is developed together with **MooreEval**, an execution-based verifier and reward environment. MooreEval checks whether generated kernels:
+* can be parsed and compiled;
+* pass randomized correctness tests against PyTorch reference outputs;
+* avoid forbidden PyTorch/ATen computational fallbacks;
+* achieve real runtime speedup under synchronized event timing.
+### RL Stabilization Techniques
+The training pipeline incorporates three stabilization techniques:
+* **PrimeEcho**: first-turn-anchored multi-turn reward for balancing repair ability and first-attempt quality.
+* **Buffered Dynamic Retry**: converts all-failed groups into feedback-conditioned repair tasks.
+* **MirrorPop**: sequence-level off-policy filtering based on absolute log-ratio deviation.
+## Model Details
+| Item                  | Description                                              |
+| --------------------- | -------------------------------------------------------- |
+| Model name            | MusaCoder-27B                                            |
+| Developer             | Moore Threads                                            |
+| Base model            | Qwen3.6-27B                                              |
+| Model type            | Causal language model                                    |
+| Primary use           | PyTorch-to-CUDA/MUSA native kernel generation            |
+| License               | Apache License 2.0                                       |
+| Training precision    | bf16                                                     |
+| Recommended framework | Transformers / vLLM / SGLang-compatible inference        |
+## Intended Use
+MusaCoder-27B is intended for research and development in:
+* PyTorch-to-CUDA/MUSA kernel generation;
+* native GPU kernel synthesis;
+* code generation for accelerator programming;
+* automatic kernel repair and optimization;
+* MUSA ecosystem development;
+* execution-feedback reinforcement learning for code models.
+A typical input contains a PyTorch reference implementation, input constraints, and generation requirements. The model is expected to produce a `ModelNew` implementation using custom native CUDA/MUSA kernels.
+## Quickstart
+### Installation
+```bash
+pip install transformers accelerate torch
+```
+For high-throughput inference, users may also use vLLM or SGLang depending on their deployment environment.
+### Basic Usage with Transformers
+````python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_name = "MooreThreads/MusaCoder-27B"
+tokenizer = AutoTokenizer.from_pretrained(
+    model_name,
+    trust_remote_code=True,
+)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+prompt = r"""
+You are given a PyTorch reference implementation. Write a replacement ModelNew
+that implements the same computation using a custom native CUDA/MUSA kernel.
+Reference:
+```python
+import torch
+import torch.nn as nn
+class Model(nn.Module):
+    def forward(self, x):
+        return torch.relu(x)
+```
+Requirements:
+* Define class ModelNew(nn.Module).
+* Do not use forbidden PyTorch/ATen compute fallback in ModelNew.forward().
+* The implementation must be compilable and numerically correct.
+  """
+messages = [
+{"role": "user", "content": prompt},
+]
+text = tokenizer.apply_chat_template(
+messages,
+tokenize=False,
+add_generation_prompt=True,
+)
+inputs = tokenizer([text], return_tensors="pt").to(model.device)
+outputs = model.generate(
+**inputs,
+max_new_tokens=32000,
+temperature=0.7,
+top_p=0.95,
+do_sample=True,
+)
+response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
+print(response)
+````
+## Prompt Format
+We recommend using a structured prompt that includes:
+1. PyTorch reference code;
+2. input shape and dtype constraints;
+3. target backend, e.g., CUDA or MUSA;
+4. explicit instruction to define `ModelNew`;
+5. anti-fallback constraints;
+6. optional correctness and performance requirements.
+Example:
+```text
+Given the following PyTorch reference model, generate a new implementation
+class ModelNew(nn.Module) that uses custom native CUDA/MUSA kernels.
+The generated implementation must:
+- match the PyTorch reference numerically;
+- compile successfully;
+- avoid forbidden PyTorch/ATen compute fallback in forward();
+- handle boundary cases correctly;
+- prefer native kernel implementations over high-level library calls.
+````
+## Evaluation
+MusaCoder-27B is evaluated using the MooreEval protocol on KernelBench-style tasks.
+The evaluation checks:
+* code extraction and interface validity;
+* compilation success;
+* randomized correctness against PyTorch reference;
+* forbidden PyTorch/ATen fallback detection;
+* synchronized runtime measurement;
+* Faster Rate with a speedup threshold of `>1.1x`.
+### KernelBench Results
+| Model                | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager | Faster vs. Compile |
+| -------------------- | -------------: | -------------: | ---------------: | -----------------: |
+| Kimi K2.6            |           84.0 |          69.10 |              3.3 |                1.4 |
+| GLM-5.1              |           85.6 |          76.25 |              7.4 |                3.9 |
+| DeepSeek-V4_ProMax   |           84.8 |          60.05 |              5.7 |                3.0 |
+| Claude Opus 4.7      |           87.2 |          77.30 |             11.8 |                7.5 |
+| Qwen3.6-27B          |           67.2 |          35.60 |              3.4 |                1.6 |
+| MusaCoder-27B-SFT    |           84.8 |          79.40 |              6.3 |                4.1 |
+| **MusaCoder-27B-RL** |       **93.2** |      **88.60** |         **15.0** |            **9.2** |
+### MUSA KernelBench Results
+| Model                | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager |
+| -------------------- | -------------: | -------------: | ---------------: |
+| DeepSeek-V4-Pro      |           92.0 |           56.9 |              5.7 |
+| GLM-5.1              |           88.0 |           66.4 |              6.9 |
+| MusaCoder-27B-SFT    |           79.6 |           63.5 |              5.2 |
+| **MusaCoder-27B-RL** |       **92.4** |       **81.7** |         **12.5** |
+## Notes on Generated Code
+Generated kernels should always be compiled and tested before use. GPU kernel generation is a high-risk code generation task because small mistakes in indexing, boundary handling, dtype conversion, or memory layout can lead to incorrect outputs, runtime failures, or illegal memory access.
+We recommend validating generated code with:
+* randomized correctness tests;
+* multiple input shapes and dtypes;
+* non-contiguous tensor cases when applicable;
+* runtime profiling;
+* forbidden fallback detection.
+## Limitations
+MusaCoder-27B is specialized for GPU kernel generation and may not be optimal for general-purpose chat or application development. The model may still generate code that:
+* fails to compile;
+* produces incorrect results for unseen edge cases;
+* uses inefficient thread/block layouts;
+* relies on disallowed high-level fallback APIs;
+* requires additional engineering adaptation for specific platforms or compiler versions.
+Users should treat generated code as a candidate implementation that must be verified before deployment.
+## License
+MusaCoder-27B is released under the Apache License 2.0.
+MusaCoder-27B is initialized from and trained based on Qwen3.6-27B. Users should comply with the license terms of MusaCoder-27B as well as applicable license terms of upstream models and third-party components.
+## Citation
+If you find MusaCoder useful, please cite:
+```bibtex
+@article{cheng2026musacoder,
+  title={MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU},
+  author={Cheng, Kun and Lu, Songshuo and Liao, Sicong and Li, Tankun and Zhang, Yafei and Yang, Dong and Lv, Qiheng and Wang, Hua and Chen, Zhi and Tang, Yaohua},
+  journal={arXiv preprint arXiv:2606.04847},
+  year={2026},
+  eprint={2606.04847},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV},
+  url={https://arxiv.org/abs/2606.04847}
+}
+```
+## Acknowledgements
+MusaCoder is developed by Moore Threads AI. We thank the open-source community for advancing GPU programming, code generation, and execution-feedback learning. We also acknowledge the upstream base model and software ecosystems that make this work possible.