atlas-nvfp4-paged-attention

Paged-KV attention kernels for the full-attention layers of Qwen3.6 hybrid models on NVIDIA GB10 (DGX Spark, SM121).

Ops

Op KV format Use
paged_decode_attn_bf16 BF16 Reference / debugging
paged_decode_attn_fp8 FP8 E5M2 + scales Mainline FP8 deployment
paged_decode_attn_nvfp4 Block-scaled E2M1 NVFP4 deployment
rms_norm BF16 Pre-attention / pre-FFN norm

Prefill counterparts (inferspark_prefill_paged*) are compiled into the shared object — Torch bindings for them ship in the next iteration once the chunked-prefill scheduling story is settled.

Hardware

GB10 only (sm_121f, compute capability 12.1).

  • The NVFP4 path uses Atlas's software E2M1 conversion since cvt.rn.satfinite.e2m1x2.f32 is missing on SM121.
  • block_size=16 and head_dim=256 are the layouts that ship today.

Models tested

Model Attention layers Heads (Q:KV) Head dim
Qwen/Qwen3.6-27B 16 24:4 256
Qwen/Qwen3.6-35B-A3B 10 16:2 256

License

AGPL-3.0-only.

Downloads last month
9
OS
linux
Arch
aarch64