Apply for a GPU community grant: Academic project

#1
by OzTianlu - opened
NoesisLab org

ARCH

benchmark_stress

Hugging Face Community Grant Application

Project: Spartacus-1B-Instruct – Causal Monoid Language Model

Applicant: Zixi Li (Independent Researcher, NoesisLab)

1. Project Overview & Core Innovation I am an independent AI researcher proposing Spartacus-1B-Instruct, a 1.34B parameter language model that fundamentally replaces standard softmax attention with causal monoid state compression. My architecture entirely eliminates the $O(T)$ KV-cache bottleneck, achieving strictly $O(1)$ inference time per token and $O(1)$ memory per layer, regardless of sequence length.

While sub-quadratic architectures are gaining traction, Spartacus introduces a unique mathematical approach: it models causality dynamically through a vector decay monoid recurrence. The core associative binary operation is defined as:

(α,S)(β,X)=(αβ,diag(β)S+X)(α, S) \oplus (β, X) = (α \cdot β, \text{diag}(β) \cdot S + X)

Unlike scalar decay models, our per-dimension vector decay gate ($α_t \in (0,1)^d$) allows different feature dimensions to possess independent memory lifetimes. Fast-decaying dimensions capture local syntax, while slow-decaying ones maintain global entity memory across massive contexts. Furthermore, by using SiLU-activated non-negative keys, we ensure the compressed state matrix remains positive semi-definite (PSD), preventing feature erasure.

2. Current Progress & Benchmarks Spartacus is not just theoretical; it is fully implemented and SFT-trained.

  • Performance: In 0-shot benchmarks, Spartacus-1B achieves highly competitive results against established sub-quadratic baselines, scoring 0.4610 on HellaSwag and 0.5518 on ARC-Easy, matching or exceeding Mamba-1.4B and RWKV-6-1.6B.
  • Hardware Efficiency: I have engineered a custom Triton JIT-compiled parallel prefix scan (monoid_scan_cuda.py) that parallelizes the forward pass across the batch, head, and dimension axes, maintaining $O(T)$ parallel training efficiency.
  • Stress Testing: As demonstrated in our maximum stress tests on an RTX PRO 6000 Blackwell, Spartacus maintains a perfectly flat decode latency (8.2 ms/token) and constant memory usage (5100 MB) continuously up to a 128K token context window. In contrast, the standard Transformer baseline hits catastrophic memory spikes and latency degradation beyond 32K.

3. Why We Need the GPU Grant I am building an interactive Gradio Space to publicly showcase Spartacus's $O(1)$ generation capabilities at extreme context lengths (100K+ tokens). To do this justice and allow the community to experience the flat latency firsthand without queuing timeouts, standard CPU or free T4 tiers are insufficient.

I am requesting an A10G or A100 GPU Grant. This compute will be strictly utilized to:

  1. Host the Gradio Space, allowing users to input massive documents and observe the constant-time decoding speed.
  2. Compile and run our fused Triton kernels at peak efficiency for concurrent user requests.
  3. Serve as a highly visible proof-of-concept for the open-source community that efficient, implicit reasoning without KV-cache explosion is viable.

At NoesisLab, we believe the future of reasoning models lies in structural efficiency, not just scaling brute force. Your support will help democratize access to this new paradigm of infinite-context generation. Thank you for your consideration.

Sign up or log in to comment