atlas-nvfp4-paged-attention

Paged-KV attention kernels for the full-attention layers of Qwen3.6 hybrid models on NVIDIA GB10 (DGX Spark, SM121).

Ops

Op	KV format	Use
`paged_decode_attn_bf16`	BF16	Reference / debugging
`paged_decode_attn_fp8`	FP8 E5M2 + scales	Mainline FP8 deployment
`paged_decode_attn_nvfp4`	Block-scaled E2M1	NVFP4 deployment
`rms_norm`	BF16	Pre-attention / pre-FFN norm

Prefill counterparts (inferspark_prefill_paged*) are compiled into the shared object — Torch bindings for them ship in the next iteration once the chunked-prefill scheduling story is settled.

Hardware

GB10 only (sm_121f, compute capability 12.1).

The NVFP4 path uses Atlas's software E2M1 conversion since cvt.rn.satfinite.e2m1x2.f32 is missing on SM121.
block_size=16 and head_dim=256 are the layouts that ship today.

Models tested

Model	Attention layers	Heads (Q:KV)	Head dim
Qwen/Qwen3.6-27B	16	24:4	256
Qwen/Qwen3.6-35B-A3B	10	16:2	256

License

AGPL-3.0-only.

Downloads last month: 2

OS: linux

Arch: aarch64