| --- |
| license: apache-2.0 |
| language: |
| - en |
| - zh |
| base_model: |
| - Qwen/Qwen3.6-27B |
| pipeline_tag: reinforcement-learning |
| tags: |
| - CUDA |
| - MUSA |
| - GPU-Kernel |
| - Reinforcement-Learning |
| --- |
| |
|
|
|
|
| <div align="left"> |
| <img src="./assets/moore_threads_logo.png" width="120" alt="Moore Threads Logo" /> |
| </div> |
|
|
| <!-- <h1 align="center">MusaCoder-27B</h1> --> |
|
|
| <h1 align="center"> |
| <strong>MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU</strong> |
| </h1> |
|
|
| <!-- <p align="center"> |
| Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, <br> |
| Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang |
| </p> --> |
|
|
| <p align="center"> |
| <a href="https://arxiv.org/abs/2606.04847">📄 Paper</a> |
| </p> |
|
|
| --- |
|
|
| <div align="center"> |
| <img src="./assets/kernelbench_bar.png" width="900" alt="KernelBench Benchmark Results" /> |
| </div> |
|
|
| # MusaCoder-27B |
|
|
| > This repository contains model weights and configuration files for **MusaCoder-27B**, a specialized code generation model for native GPU kernel synthesis. |
| > |
| > MusaCoder-27B is designed to generate CUDA/MUSA native kernels from PyTorch reference implementations, with a focus on compilability, numerical correctness, anti-fallback legality, and empirical speedup. |
|
|
| ## Introduction |
|
|
| **MusaCoder-27B** is a 27B-parameter code model developed by Moore Threads for **PyTorch-to-CUDA/MUSA native kernel generation**. Unlike general-purpose code models, MusaCoder focuses on low-level GPU programming tasks, including tensor shape reasoning, thread/block mapping, memory indexing, boundary handling, reduction strategies, numerical stability, and performance-oriented kernel optimization. |
|
|
| The model is trained through a full-stack post-training pipeline consisting of: |
|
|
| * multi-source supervised fine-tuning data construction; |
| * verifier-filtered rejection fine-tuning; |
| * execution-feedback reinforcement learning; |
| * strict native-kernel verification with MooreEval; |
| * CUDA/MUSA-oriented kernel repair and optimization data. |
|
|
| MusaCoder-27B is released to promote the development of the MUSA open-source ecosystem, facilitate research on LLM-based code generation and GPU kernel synthesis, and encourage the community to explore cross-platform native kernel optimization. |
|
|
| ## Highlights |
|
|
| ### Native CUDA/MUSA Kernel Generation |
|
|
| MusaCoder-27B is optimized for generating native GPU kernels from PyTorch reference code. The model is not intended for generic business code generation; instead, it targets low-level kernel authoring where generated code must compile, run correctly, satisfy task constraints, and achieve measurable speedup. |
|
|
| ### MUSA-Oriented Kernel Synthesis |
|
|
| MusaCoder-27B supports PyTorch-to-MUSA kernel generation scenarios and can be used to explore automatic generation of MUSA native kernels from PyTorch reference programs. This provides a foundation model capability for the MUSA developer community and lowers the barrier to writing, validating, and optimizing MUSA kernels. |
|
|
| ### Full-Stack Training Pipeline |
|
|
| MusaCoder-27B is trained with a full-stack pipeline: |
|
|
| * **SFT** teaches the model PyTorch-to-kernel task format, common kernel implementation patterns, GPU programming knowledge, review capability, and performance analysis. |
| * **RFT** uses execution-based verification to select correct model-generated implementations while preserving implementation diversity. |
| * **RL** uses real compilation, execution, correctness checking, anti-fallback detection, and runtime measurement as reward signals. |
|
|
| ### Execution-Based Verification |
|
|
| MusaCoder is developed together with **MooreEval**, an execution-based verifier and reward environment. MooreEval checks whether generated kernels: |
|
|
| * can be parsed and compiled; |
| * pass randomized correctness tests against PyTorch reference outputs; |
| * avoid forbidden PyTorch/ATen computational fallbacks; |
| * achieve real runtime speedup under synchronized event timing. |
|
|
| ### RL Stabilization Techniques |
|
|
| The training pipeline incorporates three stabilization techniques: |
|
|
| * **PrimeEcho**: first-turn-anchored multi-turn reward for balancing repair ability and first-attempt quality. |
| * **Buffered Dynamic Retry**: converts all-failed groups into feedback-conditioned repair tasks. |
| * **MirrorPop**: sequence-level off-policy filtering based on absolute log-ratio deviation. |
|
|
| ## Model Details |
|
|
| | Item | Description | |
| | --------------------- | -------------------------------------------------------- | |
| | Model name | MusaCoder-27B | |
| | Developer | Moore Threads | |
| | Base model | Qwen3.6-27B | |
| | Model type | Causal language model | |
| | Primary use | PyTorch-to-CUDA/MUSA native kernel generation | |
| | License | Apache License 2.0 | |
| | Training precision | bf16 | |
| | Recommended framework | Transformers / vLLM / SGLang-compatible inference | |
|
|
| ## Intended Use |
|
|
| MusaCoder-27B is intended for research and development in: |
|
|
| * PyTorch-to-CUDA/MUSA kernel generation; |
| * native GPU kernel synthesis; |
| * code generation for accelerator programming; |
| * automatic kernel repair and optimization; |
| * MUSA ecosystem development; |
| * execution-feedback reinforcement learning for code models. |
|
|
| A typical input contains a PyTorch reference implementation, input constraints, and generation requirements. The model is expected to produce a `ModelNew` implementation using custom native CUDA/MUSA kernels. |
|
|
| ## Quickstart |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers accelerate torch |
| ``` |
|
|
| For high-throughput inference, users may also use vLLM or SGLang depending on their deployment environment. |
|
|
| ### Basic Usage with Transformers |
|
|
| ````python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_name = "MooreThreads/MusaCoder-27B" |
| |
| tokenizer = AutoTokenizer.from_pretrained( |
| model_name, |
| trust_remote_code=True, |
| ) |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| prompt = r""" |
| You are given a PyTorch reference implementation. Write a replacement ModelNew |
| that implements the same computation using a custom native CUDA/MUSA kernel. |
| |
| Reference: |
| ```python |
| import torch |
| import torch.nn as nn |
| |
| class Model(nn.Module): |
| def forward(self, x): |
| return torch.relu(x) |
| ``` |
| |
| Requirements: |
| |
| * Define class ModelNew(nn.Module). |
| * Do not use forbidden PyTorch/ATen compute fallback in ModelNew.forward(). |
| * The implementation must be compilable and numerically correct. |
| """ |
| |
| messages = [ |
| {"role": "user", "content": prompt}, |
| ] |
| |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| ) |
| |
| inputs = tokenizer([text], return_tensors="pt").to(model.device) |
| |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=32000, |
| temperature=0.7, |
| top_p=0.95, |
| do_sample=True, |
| ) |
| |
| response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True) |
| print(response) |
| |
| ```` |
|
|
| ## Prompt Format |
|
|
| We recommend using a structured prompt that includes: |
|
|
| 1. PyTorch reference code; |
| 2. input shape and dtype constraints; |
| 3. target backend, e.g., CUDA or MUSA; |
| 4. explicit instruction to define `ModelNew`; |
| 5. anti-fallback constraints; |
| 6. optional correctness and performance requirements. |
|
|
| Example: |
|
|
| ```text |
| Given the following PyTorch reference model, generate a new implementation |
| class ModelNew(nn.Module) that uses custom native CUDA/MUSA kernels. |
| |
| The generated implementation must: |
| - match the PyTorch reference numerically; |
| - compile successfully; |
| - avoid forbidden PyTorch/ATen compute fallback in forward(); |
| - handle boundary cases correctly; |
| - prefer native kernel implementations over high-level library calls. |
| ```` |
|
|
| ## Evaluation |
|
|
| MusaCoder-27B is evaluated using the MooreEval protocol on KernelBench-style tasks. |
|
|
| The evaluation checks: |
|
|
| * code extraction and interface validity; |
| * compilation success; |
| * randomized correctness against PyTorch reference; |
| * forbidden PyTorch/ATen fallback detection; |
| * synchronized runtime measurement; |
| * Faster Rate with a speedup threshold of `>1.1x`. |
|
|
| ### KernelBench Results |
|
|
| | Model | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager | Faster vs. Compile | |
| | -------------------- | -------------: | -------------: | ---------------: | -----------------: | |
| | Kimi K2.6 | 84.0 | 69.10 | 3.3 | 1.4 | |
| | GLM-5.1 | 85.6 | 76.25 | 7.4 | 3.9 | |
| | DeepSeek-V4_ProMax | 84.8 | 60.05 | 5.7 | 3.0 | |
| | Claude Opus 4.7 | 87.2 | 77.30 | 11.8 | 7.5 | |
| | Qwen3.6-27B | 67.2 | 35.60 | 3.4 | 1.6 | |
| | MusaCoder-27B-SFT | 84.8 | 79.40 | 6.3 | 4.1 | |
| | **MusaCoder-27B-RL** | **93.2** | **88.60** | **15.0** | **9.2** | |
| |
| ### MUSA KernelBench Results |
| |
| | Model | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager | |
| | -------------------- | -------------: | -------------: | ---------------: | |
| | DeepSeek-V4-Pro | 92.0 | 56.9 | 5.7 | |
| | GLM-5.1 | 88.0 | 66.4 | 6.9 | |
| | MusaCoder-27B-SFT | 79.6 | 63.5 | 5.2 | |
| | **MusaCoder-27B-RL** | **92.4** | **81.7** | **12.5** | |
| |
| ## Notes on Generated Code |
| |
| Generated kernels should always be compiled and tested before use. GPU kernel generation is a high-risk code generation task because small mistakes in indexing, boundary handling, dtype conversion, or memory layout can lead to incorrect outputs, runtime failures, or illegal memory access. |
| |
| We recommend validating generated code with: |
| |
| * randomized correctness tests; |
| * multiple input shapes and dtypes; |
| * non-contiguous tensor cases when applicable; |
| * runtime profiling; |
| * forbidden fallback detection. |
| |
| ## Limitations |
| |
| MusaCoder-27B is specialized for GPU kernel generation and may not be optimal for general-purpose chat or application development. The model may still generate code that: |
| |
| * fails to compile; |
| * produces incorrect results for unseen edge cases; |
| * uses inefficient thread/block layouts; |
| * relies on disallowed high-level fallback APIs; |
| * requires additional engineering adaptation for specific platforms or compiler versions. |
| |
| Users should treat generated code as a candidate implementation that must be verified before deployment. |
| |
| ## License |
| |
| MusaCoder-27B is released under the Apache License 2.0. |
| |
| MusaCoder-27B is initialized from and trained based on Qwen3.6-27B. Users should comply with the license terms of MusaCoder-27B as well as applicable license terms of upstream models and third-party components. |
| |
| ## Citation |
| |
| If you find MusaCoder useful, please cite: |
| |
| ```bibtex |
| @article{cheng2026musacoder, |
| title={MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU}, |
| author={Cheng, Kun and Lu, Songshuo and Liao, Sicong and Li, Tankun and Zhang, Yafei and Yang, Dong and Lv, Qiheng and Wang, Hua and Chen, Zhi and Tang, Yaohua}, |
| journal={arXiv preprint arXiv:2606.04847}, |
| year={2026}, |
| eprint={2606.04847}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2606.04847} |
| } |
| ``` |
| |
| ## Acknowledgements |
| |
| MusaCoder is developed by Moore Threads AI. We thank the open-source community for advancing GPU programming, code generation, and execution-feedback learning. We also acknowledge the upstream base model and software ecosystems that make this work possible. |