DashLuuu commited on
Commit
7d8f00c
·
verified ·
1 Parent(s): 7ae415a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +308 -0
README.md ADDED
@@ -0,0 +1,308 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model:
7
+ - Qwen/Qwen3.6-27B
8
+ pipeline_tag: reinforcement-learning
9
+ tags:
10
+ - CUDA
11
+ - MUSA
12
+ - GPU-Kernel
13
+ - Reinforcement-Learning
14
+ ---
15
+
16
+
17
+
18
+ <div align="left">
19
+ <img src="./assets/moore_threads_logo.png" width="120" alt="Moore Threads Logo" />
20
+ </div>
21
+
22
+ <!-- <h1 align="center">MusaCoder-27B</h1> -->
23
+
24
+ <h1 align="center">
25
+ <strong>MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU</strong>
26
+ </h1>
27
+
28
+ <!-- <p align="center">
29
+ Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, <br>
30
+ Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang
31
+ </p> -->
32
+
33
+ <p align="center">
34
+ <a href="https://arxiv.org/abs/2606.04847">📄 Paper</a>
35
+ </p>
36
+
37
+ ---
38
+
39
+ <div align="center">
40
+ <img src="./assets/kernelbench_bar.png" width="900" alt="KernelBench Benchmark Results" />
41
+ </div>
42
+
43
+ # MusaCoder-27B
44
+
45
+ > This repository contains model weights and configuration files for **MusaCoder-27B**, a specialized code generation model for native GPU kernel synthesis.
46
+ >
47
+ > MusaCoder-27B is designed to generate CUDA/MUSA native kernels from PyTorch reference implementations, with a focus on compilability, numerical correctness, anti-fallback legality, and empirical speedup.
48
+
49
+ ## Introduction
50
+
51
+ **MusaCoder-27B** is a 27B-parameter code model developed by Moore Threads for **PyTorch-to-CUDA/MUSA native kernel generation**. Unlike general-purpose code models, MusaCoder focuses on low-level GPU programming tasks, including tensor shape reasoning, thread/block mapping, memory indexing, boundary handling, reduction strategies, numerical stability, and performance-oriented kernel optimization.
52
+
53
+ The model is trained through a full-stack post-training pipeline consisting of:
54
+
55
+ * multi-source supervised fine-tuning data construction;
56
+ * verifier-filtered rejection fine-tuning;
57
+ * execution-feedback reinforcement learning;
58
+ * strict native-kernel verification with MooreEval;
59
+ * CUDA/MUSA-oriented kernel repair and optimization data.
60
+
61
+ MusaCoder-27B is released to promote the development of the MUSA open-source ecosystem, facilitate research on LLM-based code generation and GPU kernel synthesis, and encourage the community to explore cross-platform native kernel optimization.
62
+
63
+ ## Highlights
64
+
65
+ ### Native CUDA/MUSA Kernel Generation
66
+
67
+ MusaCoder-27B is optimized for generating native GPU kernels from PyTorch reference code. The model is not intended for generic business code generation; instead, it targets low-level kernel authoring where generated code must compile, run correctly, satisfy task constraints, and achieve measurable speedup.
68
+
69
+ ### MUSA-Oriented Kernel Synthesis
70
+
71
+ MusaCoder-27B supports PyTorch-to-MUSA kernel generation scenarios and can be used to explore automatic generation of MUSA native kernels from PyTorch reference programs. This provides a foundation model capability for the MUSA developer community and lowers the barrier to writing, validating, and optimizing MUSA kernels.
72
+
73
+ ### Full-Stack Training Pipeline
74
+
75
+ MusaCoder-27B is trained with a full-stack pipeline:
76
+
77
+ * **SFT** teaches the model PyTorch-to-kernel task format, common kernel implementation patterns, GPU programming knowledge, review capability, and performance analysis.
78
+ * **RFT** uses execution-based verification to select correct model-generated implementations while preserving implementation diversity.
79
+ * **RL** uses real compilation, execution, correctness checking, anti-fallback detection, and runtime measurement as reward signals.
80
+
81
+ ### Execution-Based Verification
82
+
83
+ MusaCoder is developed together with **MooreEval**, an execution-based verifier and reward environment. MooreEval checks whether generated kernels:
84
+
85
+ * can be parsed and compiled;
86
+ * pass randomized correctness tests against PyTorch reference outputs;
87
+ * avoid forbidden PyTorch/ATen computational fallbacks;
88
+ * achieve real runtime speedup under synchronized event timing.
89
+
90
+ ### RL Stabilization Techniques
91
+
92
+ The training pipeline incorporates three stabilization techniques:
93
+
94
+ * **PrimeEcho**: first-turn-anchored multi-turn reward for balancing repair ability and first-attempt quality.
95
+ * **Buffered Dynamic Retry**: converts all-failed groups into feedback-conditioned repair tasks.
96
+ * **MirrorPop**: sequence-level off-policy filtering based on absolute log-ratio deviation.
97
+
98
+ ## Model Details
99
+
100
+ | Item | Description |
101
+ | --------------------- | -------------------------------------------------------- |
102
+ | Model name | MusaCoder-27B |
103
+ | Developer | Moore Threads |
104
+ | Base model | Qwen3.6-27B |
105
+ | Model type | Causal language model |
106
+ | Primary use | PyTorch-to-CUDA/MUSA native kernel generation |
107
+ | License | Apache License 2.0 |
108
+ | Training precision | bf16 |
109
+ | Recommended framework | Transformers / vLLM / SGLang-compatible inference |
110
+
111
+ ## Intended Use
112
+
113
+ MusaCoder-27B is intended for research and development in:
114
+
115
+ * PyTorch-to-CUDA/MUSA kernel generation;
116
+ * native GPU kernel synthesis;
117
+ * code generation for accelerator programming;
118
+ * automatic kernel repair and optimization;
119
+ * MUSA ecosystem development;
120
+ * execution-feedback reinforcement learning for code models.
121
+
122
+ A typical input contains a PyTorch reference implementation, input constraints, and generation requirements. The model is expected to produce a `ModelNew` implementation using custom native CUDA/MUSA kernels.
123
+
124
+ ## Quickstart
125
+
126
+ ### Installation
127
+
128
+ ```bash
129
+ pip install transformers accelerate torch
130
+ ```
131
+
132
+ For high-throughput inference, users may also use vLLM or SGLang depending on their deployment environment.
133
+
134
+ ### Basic Usage with Transformers
135
+
136
+ ````python
137
+ from transformers import AutoModelForCausalLM, AutoTokenizer
138
+ import torch
139
+
140
+ model_name = "MooreThreads/MusaCoder-27B"
141
+
142
+ tokenizer = AutoTokenizer.from_pretrained(
143
+ model_name,
144
+ trust_remote_code=True,
145
+ )
146
+
147
+ model = AutoModelForCausalLM.from_pretrained(
148
+ model_name,
149
+ torch_dtype=torch.bfloat16,
150
+ device_map="auto",
151
+ trust_remote_code=True,
152
+ )
153
+
154
+ prompt = r"""
155
+ You are given a PyTorch reference implementation. Write a replacement ModelNew
156
+ that implements the same computation using a custom native CUDA/MUSA kernel.
157
+
158
+ Reference:
159
+ ```python
160
+ import torch
161
+ import torch.nn as nn
162
+
163
+ class Model(nn.Module):
164
+ def forward(self, x):
165
+ return torch.relu(x)
166
+ ```
167
+
168
+ Requirements:
169
+
170
+ * Define class ModelNew(nn.Module).
171
+ * Do not use forbidden PyTorch/ATen compute fallback in ModelNew.forward().
172
+ * The implementation must be compilable and numerically correct.
173
+ """
174
+
175
+ messages = [
176
+ {"role": "user", "content": prompt},
177
+ ]
178
+
179
+ text = tokenizer.apply_chat_template(
180
+ messages,
181
+ tokenize=False,
182
+ add_generation_prompt=True,
183
+ )
184
+
185
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
186
+
187
+ outputs = model.generate(
188
+ **inputs,
189
+ max_new_tokens=32000,
190
+ temperature=0.7,
191
+ top_p=0.95,
192
+ do_sample=True,
193
+ )
194
+
195
+ response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
196
+ print(response)
197
+
198
+ ````
199
+
200
+ ## Prompt Format
201
+
202
+ We recommend using a structured prompt that includes:
203
+
204
+ 1. PyTorch reference code;
205
+ 2. input shape and dtype constraints;
206
+ 3. target backend, e.g., CUDA or MUSA;
207
+ 4. explicit instruction to define `ModelNew`;
208
+ 5. anti-fallback constraints;
209
+ 6. optional correctness and performance requirements.
210
+
211
+ Example:
212
+
213
+ ```text
214
+ Given the following PyTorch reference model, generate a new implementation
215
+ class ModelNew(nn.Module) that uses custom native CUDA/MUSA kernels.
216
+
217
+ The generated implementation must:
218
+ - match the PyTorch reference numerically;
219
+ - compile successfully;
220
+ - avoid forbidden PyTorch/ATen compute fallback in forward();
221
+ - handle boundary cases correctly;
222
+ - prefer native kernel implementations over high-level library calls.
223
+ ````
224
+
225
+ ## Evaluation
226
+
227
+ MusaCoder-27B is evaluated using the MooreEval protocol on KernelBench-style tasks.
228
+
229
+ The evaluation checks:
230
+
231
+ * code extraction and interface validity;
232
+ * compilation success;
233
+ * randomized correctness against PyTorch reference;
234
+ * forbidden PyTorch/ATen fallback detection;
235
+ * synchronized runtime measurement;
236
+ * Faster Rate with a speedup threshold of `>1.1x`.
237
+
238
+ ### KernelBench Results
239
+
240
+ | Model | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager | Faster vs. Compile |
241
+ | -------------------- | -------------: | -------------: | ---------------: | -----------------: |
242
+ | Kimi K2.6 | 84.0 | 69.10 | 3.3 | 1.4 |
243
+ | GLM-5.1 | 85.6 | 76.25 | 7.4 | 3.9 |
244
+ | DeepSeek-V4_ProMax | 84.8 | 60.05 | 5.7 | 3.0 |
245
+ | Claude Opus 4.7 | 87.2 | 77.30 | 11.8 | 7.5 |
246
+ | Qwen3.6-27B | 67.2 | 35.60 | 3.4 | 1.6 |
247
+ | MusaCoder-27B-SFT | 84.8 | 79.40 | 6.3 | 4.1 |
248
+ | **MusaCoder-27B-RL** | **93.2** | **88.60** | **15.0** | **9.2** |
249
+
250
+ ### MUSA KernelBench Results
251
+
252
+ | Model | Overall Pass@8 | Overall Avg.@8 | Faster vs. Eager |
253
+ | -------------------- | -------------: | -------------: | ---------------: |
254
+ | DeepSeek-V4-Pro | 92.0 | 56.9 | 5.7 |
255
+ | GLM-5.1 | 88.0 | 66.4 | 6.9 |
256
+ | MusaCoder-27B-SFT | 79.6 | 63.5 | 5.2 |
257
+ | **MusaCoder-27B-RL** | **92.4** | **81.7** | **12.5** |
258
+
259
+ ## Notes on Generated Code
260
+
261
+ Generated kernels should always be compiled and tested before use. GPU kernel generation is a high-risk code generation task because small mistakes in indexing, boundary handling, dtype conversion, or memory layout can lead to incorrect outputs, runtime failures, or illegal memory access.
262
+
263
+ We recommend validating generated code with:
264
+
265
+ * randomized correctness tests;
266
+ * multiple input shapes and dtypes;
267
+ * non-contiguous tensor cases when applicable;
268
+ * runtime profiling;
269
+ * forbidden fallback detection.
270
+
271
+ ## Limitations
272
+
273
+ MusaCoder-27B is specialized for GPU kernel generation and may not be optimal for general-purpose chat or application development. The model may still generate code that:
274
+
275
+ * fails to compile;
276
+ * produces incorrect results for unseen edge cases;
277
+ * uses inefficient thread/block layouts;
278
+ * relies on disallowed high-level fallback APIs;
279
+ * requires additional engineering adaptation for specific platforms or compiler versions.
280
+
281
+ Users should treat generated code as a candidate implementation that must be verified before deployment.
282
+
283
+ ## License
284
+
285
+ MusaCoder-27B is released under the Apache License 2.0.
286
+
287
+ MusaCoder-27B is initialized from and trained based on Qwen3.6-27B. Users should comply with the license terms of MusaCoder-27B as well as applicable license terms of upstream models and third-party components.
288
+
289
+ ## Citation
290
+
291
+ If you find MusaCoder useful, please cite:
292
+
293
+ ```bibtex
294
+ @article{cheng2026musacoder,
295
+ title={MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU},
296
+ author={Cheng, Kun and Lu, Songshuo and Liao, Sicong and Li, Tankun and Zhang, Yafei and Yang, Dong and Lv, Qiheng and Wang, Hua and Chen, Zhi and Tang, Yaohua},
297
+ journal={arXiv preprint arXiv:2606.04847},
298
+ year={2026},
299
+ eprint={2606.04847},
300
+ archivePrefix={arXiv},
301
+ primaryClass={cs.CV},
302
+ url={https://arxiv.org/abs/2606.04847}
303
+ }
304
+ ```
305
+
306
+ ## Acknowledgements
307
+
308
+ MusaCoder is developed by Moore Threads AI. We thank the open-source community for advancing GPU programming, code generation, and execution-feedback learning. We also acknowledge the upstream base model and software ecosystems that make this work possible.