Kernels:

flashrt
/

MiniMaxAI-msa-blackwell

Kernel card Files Files and versions

xet

Community

liangsu9988 commited on Jun 13

Commit

5a7fea3

1 Parent(s): 513d68d

Update Blackwell MSA native API card

Browse files

Files changed (3) hide show

CARD.md +12 -8
README.md +12 -8
VALIDATION.md +206 -0

CARD.md CHANGED Viewed

@@ -39,7 +39,7 @@ msa = get_kernel(
 |---|---|
 | `sparse_decode_atten_func` | Available. Blackwell paged BF16/FP16 single-token decode wrapper. |
 | `SparseDecodePagedAttentionWrapper` | Available. `plan(...).run(...)` wrapper for the same decode path. |
-| `build_k2q_csr` | Available. Torch CSR construction fallback. |
 | `SparseK2qCsrBuilderSm100` | Available compatibility class; `build()` delegates to `build_k2q_csr`. |
 | `Nvfp4QuantizedTensor` | Available metadata dataclass. |
 | `quantize_bf16_to_nvfp4_128x4` | Available when Transformer Engine NVFP4 support is installed. |
@@ -48,8 +48,8 @@ msa = get_kernel(
 | `swizzle_nvfp4_scale_to_128x4` | Available scale-layout helper. |
 | `nvfp4_global_scale_from_amax` | Available scale helper. |
 | `sparse_atten_func` | Available. Official CSR sparse prefill API backed by the Blackwell Triton BF16/FP16 prefill kernel. |
-| `sparse_atten_nvfp4_kv_func` | Available. NVFP4 KV compatibility path: dequantizes KV with 128x4 metadata, then calls Blackwell sparse prefill. |
-| `fp4_indexer_block_scores` | Available. Correctness-first FP4 block-score fallback returning the official `[Hq, ceil(max_seqlen_k/128), total_q]` score layout. |
 ### FlashRT Blackwell helper names
@@ -59,6 +59,7 @@ path:
 - `flash_decode_with_topk_idx`
 - `flash_decode_with_gqa_share_sparse`
 - `native_topk_from_scores`
 - `has_native_ops`
 - `naive_flash_decode_with_topk_idx`
 - `naive_flash_decode_with_gqa_share_sparse`
@@ -206,14 +207,17 @@ out = msa.flash_decode_with_gqa_share_sparse(
 This package contains:
 - native CUDA score-to-top-k helper;
-- Blackwell-validated Triton CUDA sparse decode and prefill attention;
 - MiniMaxAI/msa-compatible Python API layer for decode, prefill, CSR, NVFP4,
   and FP4 block-score helpers.
-The optimized SM100 CUTE prefill/indexer bodies are not claimed as ported here.
-For Blackwell, this package provides a validated Triton sparse prefill path and
-correctness-first compatibility fallbacks where the original API requires SM100
-FP4/NVFP4-specific machinery.
 Source provenance and validation details are documented in `SYNC.md` and
 `VALIDATION.md`.

 |---|---|
 | `sparse_decode_atten_func` | Available. Blackwell paged BF16/FP16 single-token decode wrapper. |
 | `SparseDecodePagedAttentionWrapper` | Available. `plan(...).run(...)` wrapper for the same decode path. |
+| `build_k2q_csr` | Available. CSR construction helper for the official prefill API. |
 | `SparseK2qCsrBuilderSm100` | Available compatibility class; `build()` delegates to `build_k2q_csr`. |
 | `Nvfp4QuantizedTensor` | Available metadata dataclass. |
 | `quantize_bf16_to_nvfp4_128x4` | Available when Transformer Engine NVFP4 support is installed. |
 | `swizzle_nvfp4_scale_to_128x4` | Available scale-layout helper. |
 | `nvfp4_global_scale_from_amax` | Available scale helper. |
 | `sparse_atten_func` | Available. Official CSR sparse prefill API backed by the Blackwell Triton BF16/FP16 prefill kernel. |
+| `sparse_atten_nvfp4_kv_func` | Available. Built artifacts use native CUDA swizzled NVFP4 -> BF16 dequantization, then call Blackwell sparse prefill. |
+| `fp4_indexer_block_scores` | Available. Built artifacts use the native CUDA Blackwell block-score kernel and return the official `[Hq, ceil(max_seqlen_k/128), total_q]` score layout. |
 ### FlashRT Blackwell helper names
 - `flash_decode_with_topk_idx`
 - `flash_decode_with_gqa_share_sparse`
 - `native_topk_from_scores`
+- `native_nvfp4_dequant_swizzled_to_bf16`
 - `has_native_ops`
 - `naive_flash_decode_with_topk_idx`
 - `naive_flash_decode_with_gqa_share_sparse`
 This package contains:
 - native CUDA score-to-top-k helper;
+- native CUDA tensor-core sparse decode route for the MiniMax-M3 Blackwell shape;
+- native CUDA FP4 block-score indexer;
+- native CUDA swizzled NVFP4 -> BF16 dequantization for the W4A16 quality path;
+- Blackwell-validated sparse prefill attention wrapper;
 - MiniMaxAI/msa-compatible Python API layer for decode, prefill, CSR, NVFP4,
   and FP4 block-score helpers.
+When loaded from Hub built artifacts, the decode, FP4 indexer, and NVFP4
+dequant hot paths use compiled CUDA ops. The source-tree mode keeps reference
+paths so the API and correctness tests remain runnable before a wheel/shared
+object has been built.
 Source provenance and validation details are documented in `SYNC.md` and
 `VALIDATION.md`.

README.md CHANGED Viewed

@@ -39,7 +39,7 @@ msa = get_kernel(
 |---|---|
 | `sparse_decode_atten_func` | Available. Blackwell paged BF16/FP16 single-token decode wrapper. |
 | `SparseDecodePagedAttentionWrapper` | Available. `plan(...).run(...)` wrapper for the same decode path. |
-| `build_k2q_csr` | Available. Torch CSR construction fallback. |
 | `SparseK2qCsrBuilderSm100` | Available compatibility class; `build()` delegates to `build_k2q_csr`. |
 | `Nvfp4QuantizedTensor` | Available metadata dataclass. |
 | `quantize_bf16_to_nvfp4_128x4` | Available when Transformer Engine NVFP4 support is installed. |
@@ -48,8 +48,8 @@ msa = get_kernel(
 | `swizzle_nvfp4_scale_to_128x4` | Available scale-layout helper. |
 | `nvfp4_global_scale_from_amax` | Available scale helper. |
 | `sparse_atten_func` | Available. Official CSR sparse prefill API backed by the Blackwell Triton BF16/FP16 prefill kernel. |
-| `sparse_atten_nvfp4_kv_func` | Available. NVFP4 KV compatibility path: dequantizes KV with 128x4 metadata, then calls Blackwell sparse prefill. |
-| `fp4_indexer_block_scores` | Available. Correctness-first FP4 block-score fallback returning the official `[Hq, ceil(max_seqlen_k/128), total_q]` score layout. |
 ### FlashRT Blackwell helper names
@@ -59,6 +59,7 @@ path:
 - `flash_decode_with_topk_idx`
 - `flash_decode_with_gqa_share_sparse`
 - `native_topk_from_scores`
 - `has_native_ops`
 - `naive_flash_decode_with_topk_idx`
 - `naive_flash_decode_with_gqa_share_sparse`
@@ -206,14 +207,17 @@ out = msa.flash_decode_with_gqa_share_sparse(
 This package contains:
 - native CUDA score-to-top-k helper;
-- Blackwell-validated Triton CUDA sparse decode and prefill attention;
 - MiniMaxAI/msa-compatible Python API layer for decode, prefill, CSR, NVFP4,
   and FP4 block-score helpers.
-The optimized SM100 CUTE prefill/indexer bodies are not claimed as ported here.
-For Blackwell, this package provides a validated Triton sparse prefill path and
-correctness-first compatibility fallbacks where the original API requires SM100
-FP4/NVFP4-specific machinery.
 Source provenance and validation details are documented in `SYNC.md` and
 `VALIDATION.md`.

 |---|---|
 | `sparse_decode_atten_func` | Available. Blackwell paged BF16/FP16 single-token decode wrapper. |
 | `SparseDecodePagedAttentionWrapper` | Available. `plan(...).run(...)` wrapper for the same decode path. |
+| `build_k2q_csr` | Available. CSR construction helper for the official prefill API. |
 | `SparseK2qCsrBuilderSm100` | Available compatibility class; `build()` delegates to `build_k2q_csr`. |
 | `Nvfp4QuantizedTensor` | Available metadata dataclass. |
 | `quantize_bf16_to_nvfp4_128x4` | Available when Transformer Engine NVFP4 support is installed. |
 | `swizzle_nvfp4_scale_to_128x4` | Available scale-layout helper. |
 | `nvfp4_global_scale_from_amax` | Available scale helper. |
 | `sparse_atten_func` | Available. Official CSR sparse prefill API backed by the Blackwell Triton BF16/FP16 prefill kernel. |
+| `sparse_atten_nvfp4_kv_func` | Available. Built artifacts use native CUDA swizzled NVFP4 -> BF16 dequantization, then call Blackwell sparse prefill. |
+| `fp4_indexer_block_scores` | Available. Built artifacts use the native CUDA Blackwell block-score kernel and return the official `[Hq, ceil(max_seqlen_k/128), total_q]` score layout. |
 ### FlashRT Blackwell helper names
 - `flash_decode_with_topk_idx`
 - `flash_decode_with_gqa_share_sparse`
 - `native_topk_from_scores`
+- `native_nvfp4_dequant_swizzled_to_bf16`
 - `has_native_ops`
 - `naive_flash_decode_with_topk_idx`
 - `naive_flash_decode_with_gqa_share_sparse`
 This package contains:
 - native CUDA score-to-top-k helper;
+- native CUDA tensor-core sparse decode route for the MiniMax-M3 Blackwell shape;
+- native CUDA FP4 block-score indexer;
+- native CUDA swizzled NVFP4 -> BF16 dequantization for the W4A16 quality path;
+- Blackwell-validated sparse prefill attention wrapper;
 - MiniMaxAI/msa-compatible Python API layer for decode, prefill, CSR, NVFP4,
   and FP4 block-score helpers.
+When loaded from Hub built artifacts, the decode, FP4 indexer, and NVFP4
+dequant hot paths use compiled CUDA ops. The source-tree mode keeps reference
+paths so the API and correctness tests remain runnable before a wheel/shared
+object has been built.
 Source provenance and validation details are documented in `SYNC.md` and
 `VALIDATION.md`.

VALIDATION.md ADDED Viewed

	@@ -0,0 +1,206 @@

+# Validation
+## Target
+- Kernel family: MiniMax M3 sparse attention (MSA)
+- Package: `flashrt/MiniMaxAI-msa-blackwell`
+- HF Jobs package selector: `MiniMaxAI-msa-blackwell`
+- Package version: v1 Blackwell native-helper package
+- Target GPU family: Blackwell CUDA compute capability 12.x
+- Validated GPU: SM121 / GB10 / DGX Spark
+- Dtype: BF16 inputs with FP32 accumulation references
+- Layout: paged KV cache
+- Model path: FlashRT MiniMax-Spark runtime on DGX Spark / GB10
+## Correctness Gate
+Run quick validation:
+```bash
+PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \
+  python MiniMaxAI-msa-blackwell/tests/test_msa_blackwell.py --quick
+```
+Run full validation:
+```bash
+PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \
+  python MiniMaxAI-msa-blackwell/tests/test_msa_blackwell.py
+```
+Run standalone long-context validation:
+```bash
+PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \
+  python MiniMaxAI-msa-blackwell/tests/test_msa_blackwell.py --long-context
+```
+Expected full coverage:
+| Area | Shapes | Reference | Required |
+|---|---:|---|---|
+| API surface | official `MiniMaxAI/msa` public names | `api_status.py` | all official root names exported; no unsupported public root API entries |
+| Native CUDA top-k helper | heads 64, batch 1-2, blocks 1-256 | PyTorch top-k over valid blocks | exact set match |
+| Decode sparse GQA attention | ctx 128, 2048, 4096, 32768 | paged FP32 PyTorch | cos >= 0.999, max_abs <= 5e-2 |
+| Prefill sparse GQA attention | ctx 512, 4096 | paged causal FP32 PyTorch | cos >= 0.999, max_abs <= 5e-2 |
+| Decode sparse GQA attention with sink | ctx 2048, 32768 | paged FP32 PyTorch | cos >= 0.999, max_abs <= 5e-2 |
+| Official decode API wrapper | ctx 2048, 4096 | direct Blackwell decode kernel | cos = 1.0, max_abs = 0 |
+| Official CSR prefill API wrapper | ctx 512, 2048 | direct Blackwell prefill kernel | cos = 1.0, max_abs = 0 under CSR-preserved block order |
+| Official NVFP4 prefill API wrapper | ctx 512 BF16 dispatch path | `sparse_atten_func` | cos = 1.0, max_abs = 0 |
+| Native CUDA NVFP4 dequant | rows/cols `(1,128)`, `(257,128)`, `(64,4096)` | Python NVFP4 reference | exact BF16 match |
+| Official FP4 indexer API | tiny FP4 packed tensors; native artifact path when built | PyTorch block-score reference | returns official score layout |
+| Decode lightning indexer | ctx 2048, 4096, 32768 | PyTorch blockmax top-k set | overlap >= 0.99 |
+| Standalone long-context decode | ctx 65536, 131072 | paged FP32 PyTorch / direct kernel | cos >= 0.999; wrapper max_abs = 0 |
+| Installed-artifact native long top-k | blocks 512, 1024 | PyTorch top-k over valid blocks | exact set match |
+API surface validation:
+```bash
+PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \
+  python -m pytest MiniMaxAI-msa-blackwell/tests/test_api_surface.py -q
+```
+The test tracks every official `MiniMaxAI/msa` public API name:
+- `sparse_atten_func`
+- `sparse_atten_nvfp4_kv_func`
+- `sparse_decode_atten_func`
+- `SparseDecodePagedAttentionWrapper`
+- `fp4_indexer_block_scores`
+- `build_k2q_csr`
+- `SparseK2qCsrBuilderSm100`
+- `Nvfp4QuantizedTensor`
+- `quantize_bf16_to_nvfp4_128x4`
+- `quantize_kv_bf16_to_nvfp4_128x4`
+- `dequantize_nvfp4_128x4_to_bf16`
+- `swizzle_nvfp4_scale_to_128x4`
+- `nvfp4_global_scale_from_amax`
+The root module exports every official public name. Decode, CSR prefill, NVFP4
+prefill compatibility, FP4 block scoring, CSR, and NVFP4 helper names are all
+callable. Hub built artifacts use compiled CUDA ops for the MiniMax-M3
+Blackwell decode route, FP4 block-score indexer, and swizzled NVFP4 -> BF16
+dequantization path. Source-tree mode keeps reference paths so the API remains
+testable before the extension is built.
+## FlashRT Integration Note
+FlashRT has validated the decode sparse path on SM121 over context lengths
+128 to 32768 with cosine similarity >= 0.999. The 32768 context length has
+also been exercised in the FlashRT MiniMax-Spark model runtime on DGX Spark /
+GB10, so it is the current end-to-end model validation boundary.
+The standalone package kernel tests additionally cover 65536 and 131072
+context lengths. These long-context rows validate the kernel and API wrapper
+contract outside the full model runtime; they should not be described as
+MiniMax-Spark end-to-end model validation until the full runtime path is rerun
+at those lengths.
+The same decode sparse path has also been exercised in FlashRT's MiniMax-Spark
+model runtime on DGX Spark / GB10. That end-to-end validation is intentionally
+kept as a FlashRT runtime validation item, while this Hub package exposes the
+standalone kernel API for community use.
+## Native Helper Compile Smoke
+Before HF Jobs publish, the native helper was compiled locally as a PyTorch
+extension using the same source files:
+- `torch-ext/torch_binding.cpp`
+- `csrc/msa_topk_from_scores.cu`
+- `csrc/msa_decode_attn.cu`
+- `csrc/msa_decode_attn_mma.cu`
+- `csrc/msa_indexer_block_scores.cu`
+- `csrc/msa_nvfp4_dequant.cu`
+Environment:
+| Field | Value |
+|---|---|
+| GPU | NVIDIA GeForce RTX 5090 |
+| PyTorch | 2.9.1+cu128 |
+| nvcc | CUDA 13.0 |
+| Target arch | sm_120 |
+Result:
+| Check | Shape | Reference | Verdict |
+|---|---:|---|---|
+| Native score -> top-k | heads 64, batch 1, blocks 256, topk 16 | PyTorch top-k set | PASS |
+| Native FP4 block-score indexer | official `[Hq, blocks, total_q]` score layout | PyTorch block-score reference | PASS |
+| Native NVFP4 swizzled -> BF16 dequant | rows/cols `(1,128)`, `(257,128)`, `(64,4096)` | Python NVFP4 reference | PASS |
+## Blackwell Package Validation
+Remote Blackwell validation environment:
+| Field | Value |
+|---|---|
+| Host | `spark-f517` |
+| GPU | NVIDIA GB10 |
+| Compute capability | 12.1 |
+| Driver | 580.159.03 |
+| Python | 3.12.3 |
+| PyTorch | 2.12.0+cu130 |
+| Triton | 3.7.0 |
+Command:
+```bash
+PY=/home/leadtek/jax/bin/python
+PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \
+  $PY MiniMaxAI-msa-blackwell/tests/test_msa_blackwell.py
+```
+Result:
+| Check | Shape | Cosine | Max abs / overlap | Verdict |
+|---|---|---:|---:|---|
+| Decode sparse GQA | ctx128_b1 | 0.999998 | 1.6032e-03 | PASS |
+| Decode sparse GQA | ctx2048_b1 | 0.999996 | 4.9090e-04 | PASS |
+| Decode sparse GQA | ctx2048_b2_sink | 0.999996 | 6.8302e-04 | PASS |
+| Decode sparse GQA | ctx4096_b1 | 0.999996 | 4.5899e-04 | PASS |
+| Decode sparse GQA | ctx4096_b2_mixed | 0.999996 | 7.3129e-04 | PASS |
+| Decode sparse GQA | ctx32768_b1 | 0.999996 | 6.9451e-04 | PASS |
+| Decode sparse GQA | ctx32768_b1_sink | 0.999996 | 5.6115e-04 | PASS |
+| Decode sparse GQA | ctx65536_b1 | 0.999996 | 4.3470e-04 | PASS |
+| Decode sparse GQA | ctx131072_b1 | 0.999996 | 7.1825e-04 | PASS |
+| Decode top-k indexer | ctx2048 | n/a | overlap 1.000 | PASS |
+| Decode top-k indexer | ctx4096 | n/a | overlap 1.000 | PASS |
+| Decode top-k indexer | ctx32768 | n/a | overlap 1.000 | PASS |
+| Decode top-k indexer | ctx65536 | n/a | overlap 1.000 | PASS |
+| Decode top-k indexer | ctx131072 | n/a | overlap 1.000 | PASS |
+| Official decode wrapper | ctx2048 | 1.000000 | 0.0000e+00 | PASS |
+| Official decode wrapper | ctx4096 | 1.000000 | 0.0000e+00 | PASS |
+| Official decode wrapper | ctx65536 | 1.000000 | 0.0000e+00 | PASS |
+| Official decode wrapper | ctx131072 | 1.000000 | 0.0000e+00 | PASS |
+| Native CUDA NVFP4 dequant | rows1_cols128 | 1.000000 | 0.0000e+00 | PASS |
+| Native CUDA NVFP4 dequant | rows257_cols128 | 1.000000 | 0.0000e+00 | PASS |
+| Native CUDA NVFP4 dequant | rows64_cols4096 | 1.000000 | 0.0000e+00 | PASS |
+Installed-artifact native top-k validation on RTX 5090 / torch 2.11 / CUDA
+12.8:
+| Context | Blocks | Overlap | Verdict |
+|---:|---:|---:|---|
+| 32768 | 256 | 1.000 | PASS |
+| 65536 | 512 | 1.000 | PASS |
+| 131072 | 1024 | 1.000 | PASS |
+The warning `tl.make_block_ptr is deprecated` appears with Triton 3.7.0. It is
+a deprecation warning, not a correctness failure.
+## Native Alignment Status
+The upstream `MiniMaxAI/msa` package targets SM100. This Blackwell package
+keeps the same public API surface where practical and provides native CUDA
+implementations for the hot paths needed by the FlashRT MiniMax-Spark runtime:
+- score-to-top-k sparse block selection;
+- tensor-core sparse decode for the MiniMax-M3 Blackwell shape;
+- FP4 block-score indexing;
+- swizzled NVFP4 -> BF16 dequantization for the W4A16 path.
+The CSR prefill wrapper remains part of the public compatibility surface and is
+validated against the package reference path. Shape and parameter restrictions
+are explicit errors rather than silent wrong results.