| --- |
| language: |
| - en |
| ... |
| --- |
| |
| # AscendKernelGen/KernelGen-LM-4B |
|
|
|  |
| <!-- [](https://arxiv.org/abs/2601.07160) --> |
|
|
| KernelGen-LM-4B is a state-of-the-art domain-adaptive large language model specialized for low-level NPU kernel generation, specifically for the Huawei Ascend architecture using the AscendC programming language. Built upon the Qwen3-4B backbone, it is trained on the Ascend-CoT dataset and refined via reinforcement learning with execution feedback. |
|
|
| <!-- **Other artifacts:** |
| * The **AscendKernelGen Technical Report** is published at https://arxiv.org/abs/2601.07160. |
| * The **NPUKernelBench** evaluation framework is published at https://git.openi.org.cn/PCL-Benchmark/NPUKernelBench. --> |
|
|
| ## Introduction |
|
|
| Our framework, **AscendKernelGen (AKGen)**, bridges the gap between general-purpose code generation and hardware-specific programming through a closed-loop system of data construction, training, and evaluation. Key innovations include: |
|
|
| * **Ascend-CoT Dataset:** A high-quality, domain-specific dataset incorporating **Chain-of-Thought (CoT)** reasoning. It combines documentation-based reasoning, code-centric reasoning derived from real-world kernel implementations, and general reasoning chains to capture the structured logic required for low-level NPU programming. |
| * **Domain-Adaptive Post-Training:** A two-stage optimization process that yields **KernelGen-LM**. We first employ **Supervised Fine-Tuning (SFT)** with error-derived supervision (correcting API misuse and numerical errors). This is followed by **Reinforcement Learning (RL)** using Direct Preference Optimization (DPO), driven by execution-based correctness and performance signals. |
| * **Hardware-Grounded Evaluation:** Validated using **NPUKernelBench**, a comprehensive benchmark that assesses compilation success, functional correctness, and performance (latency) on real Ascend hardware across varying complexity levels. |
| * **Performance:** The model demonstrates siginificant improvement on complex Level-2 kernels compared to baselines, and effectively solving tasks where general-purpose models (like Qwen3, Llama3.1) fail completely. |
|
|