---
language:
- en
...
---

# AscendKernelGen/KernelGen-LM-4B

![License](https://img.shields.io/badge/License-Apache-yellow)
<!-- [![arXiv](https://img.shields.io/badge/arXiv-2601.07160-b31b1b.svg)](https://arxiv.org/abs/2601.07160) -->

KernelGen-LM-4B is a state-of-the-art domain-adaptive large language model specialized for low-level NPU kernel generation, specifically for the Huawei Ascend architecture using the AscendC programming language. Built upon the Qwen3-4B backbone, it is trained on the Ascend-CoT dataset and refined via reinforcement learning with execution feedback.

<!-- **Other artifacts:**
* The **AscendKernelGen Technical Report** is published at https://arxiv.org/abs/2601.07160.
* The **NPUKernelBench** evaluation framework is published at https://git.openi.org.cn/PCL-Benchmark/NPUKernelBench. -->

## Introduction

Our framework, **AscendKernelGen (AKGen)**, bridges the gap between general-purpose code generation and hardware-specific programming through a closed-loop system of data construction, training, and evaluation. Key innovations include:

* **Ascend-CoT Dataset:** A high-quality, domain-specific dataset incorporating **Chain-of-Thought (CoT)** reasoning. It combines documentation-based reasoning, code-centric reasoning derived from real-world kernel implementations, and general reasoning chains to capture the structured logic required for low-level NPU programming.
* **Domain-Adaptive Post-Training:** A two-stage optimization process that yields **KernelGen-LM**. We first employ **Supervised Fine-Tuning (SFT)** with error-derived supervision (correcting API misuse and numerical errors). This is followed by **Reinforcement Learning (RL)** using Direct Preference Optimization (DPO), driven by execution-based correctness and performance signals.
* **Hardware-Grounded Evaluation:** Validated using **NPUKernelBench**, a comprehensive benchmark that assesses compilation success, functional correctness, and performance (latency) on real Ascend hardware across varying complexity levels.
* **Performance:** The model demonstrates siginificant improvement on complex Level-2 kernels compared to baselines, and effectively solving tasks where general-purpose models (like Qwen3, Llama3.1) fail completely.