atlas-nvfp4-dense-gemm

Dense NVFP4/FP8 GEMM kernels for the attention projections and dense FFN of Qwen3.6-27B on NVIDIA GB10 (DGX Spark, SM121).

Ops

Op	Use
`w4a16_gemm`	NVFP4 weight × BF16 activation (standard layout)
`w4a16_gemm_t`	NVFP4 weight × BF16 activation (transposed B)
`predequant_nvfp4_to_fp8`	Materialize NVFP4 weight as FP8 E4M3
`fp8_gemm_t`	BF16 act × FP8 weight (transposed B)
`bf16_to_fp8`	BF16 → FP8 E4M3 pair-wise conversion

Hardware

GB10 only (sm_121f, compute capability 12.1). Tile shapes M_TILE=64, N_TILE_SM=64, N_TILE_LG=128 are tuned for the 27B layout (hidden=5120, intermediate=17408, head_dim=256).

Model tested

Model	Hidden	Intermediate	Heads (Q:KV)
Qwen/Qwen3.6-27B	5120	17408	24:4

License

AGPL-3.0-only.

Downloads last month: 1

OS: linux

Arch: aarch64