AscendKernelGen
/

KernelGen-LM-4B

Text Generation

text-generation-inference

Model card Files Files and versions

KernelGen-LM-4B / README.md

AscendKernelGen's picture

AscendKernelGen

Update README.md

8cc2c58 verified about 1 month ago

|

2.26 kB

	---
	language:
	- en
	...
	---

	# AscendKernelGen/KernelGen-LM-4B

	![License](https://img.shields.io/badge/License-Apache-yellow)
	<!-- [![arXiv](https://img.shields.io/badge/arXiv-2601.07160-b31b1b.svg)](https://arxiv.org/abs/2601.07160) -->

	KernelGen-LM-4B is a state-of-the-art domain-adaptive large language model specialized for low-level NPU kernel generation, specifically for the Huawei Ascend architecture using the AscendC programming language. Built upon the Qwen3-4B backbone, it is trained on the Ascend-CoT dataset and refined via reinforcement learning with execution feedback.

	<!-- Other artifacts:
	* The AscendKernelGen Technical Report is published at https://arxiv.org/abs/2601.07160.
	* The NPUKernelBench evaluation framework is published at https://git.openi.org.cn/PCL-Benchmark/NPUKernelBench. -->

	## Introduction

	Our framework, AscendKernelGen (AKGen), bridges the gap between general-purpose code generation and hardware-specific programming through a closed-loop system of data construction, training, and evaluation. Key innovations include:

	* Ascend-CoT Dataset: A high-quality, domain-specific dataset incorporating Chain-of-Thought (CoT) reasoning. It combines documentation-based reasoning, code-centric reasoning derived from real-world kernel implementations, and general reasoning chains to capture the structured logic required for low-level NPU programming.
	* Domain-Adaptive Post-Training: A two-stage optimization process that yields KernelGen-LM. We first employ Supervised Fine-Tuning (SFT) with error-derived supervision (correcting API misuse and numerical errors). This is followed by Reinforcement Learning (RL) using Direct Preference Optimization (DPO), driven by execution-based correctness and performance signals.
	* Hardware-Grounded Evaluation: Validated using NPUKernelBench, a comprehensive benchmark that assesses compilation success, functional correctness, and performance (latency) on real Ascend hardware across varying complexity levels.
	* Performance: The model demonstrates siginificant improvement on complex Level-2 kernels compared to baselines, and effectively solving tasks where general-purpose models (like Qwen3, Llama3.1) fail completely.