atlas-nvfp4-dense-gemm

Dense NVFP4/FP8 GEMM kernels for the attention projections and dense FFN of Qwen3.6-27B on NVIDIA GB10 (DGX Spark, SM121).

Ops

Op Use
w4a16_gemm NVFP4 weight × BF16 activation (standard layout)
w4a16_gemm_t NVFP4 weight × BF16 activation (transposed B)
predequant_nvfp4_to_fp8 Materialize NVFP4 weight as FP8 E4M3
fp8_gemm_t BF16 act × FP8 weight (transposed B)
bf16_to_fp8 BF16 → FP8 E4M3 pair-wise conversion

Hardware

GB10 only (sm_121f, compute capability 12.1). Tile shapes M_TILE=64, N_TILE_SM=64, N_TILE_LG=128 are tuned for the 27B layout (hidden=5120, intermediate=17408, head_dim=256).

Model tested

Model Hidden Intermediate Heads (Q:KV)
Qwen/Qwen3.6-27B 5120 17408 24:4

License

AGPL-3.0-only.

Downloads last month
7
OS
linux
Arch
aarch64