TemporalMesh Transformer: dynamic kNN graph attention + adaptive exit gates, 29.4 PPL at 48% compute

by vigneshwar234 - opened Jun 7

Jun 7

New open-source transformer architecture — directly relevant to this repo

TMT achieves 29.4 PPL on WikiText-2 at 48% compute (−30.2% vs vanilla, 120M params). Directly relevant to users comparing efficient attention and depth-adaptive architectures.

Five innovations: Mesh Attention (O(S·k) dynamic kNN), Temporal Decay (post-softmax multiplicative), Adaptive Exit Gate (per-token depth routing, avg 5.76/12 layers), Dual-Stream FFN, EMA Memory Anchors.

vs. models in this category:

Beats Mamba: 29.4 vs 31.8 PPL, same 120M params
Beats Longformer: 29.4 vs 39.6 PPL, same compute class
LongBench: 53.4 vs 51.3 Mamba

📄 Paper (DOI 10.5281/zenodo.20287197): https://zenodo.org/records/20287390
💻 Code + 226 tests: https://github.com/vignesh2027/TemporalMesh-Transformer
🎮 Live demo: https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment