Transformers
PyTorch
English
language-model
graph-attention
adaptive-depth
temporal-decay
efficient-llm
Eval Results (legacy)
Instructions to use vigneshwar234/TemporalMesh-Transformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use vigneshwar234/TemporalMesh-Transformer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("vigneshwar234/TemporalMesh-Transformer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
TMT v3: 30.2% PPL reduction at 48% compute — full paper, code & live demo released
#1
by vigneshwar234 - opened
TemporalMesh Transformer v3 is now fully public 🎉
tl;dr — A new transformer architecture that achieves 29.4 PPL on WikiText-2 vs 42.1 baseline (−30.2%) at 48% relative compute, with zero architectural compromises.
What makes TMT different from every other efficient transformer?
Every prior approach fixes one problem:
- Longformer/BigBird → sparse attention, but static topology
- Mamba/RWKV → linear time, but no pairwise attention
- MoE → high capacity, but uniform depth per token
TMT fixes all three simultaneously:
| Innovation | What it does | Cost |
|---|---|---|
| Mesh Attention | Dynamic $k$NN graph rebuilt per-layer from cosine similarity | $O(S \cdot k)$ vs $O(S^2)$ |
| Temporal Decay Encoding | Learned multiplicative scalar attenuates semantically distant tokens post-softmax | ~0% overhead |
| Adaptive Depth Routing | Per-token exit gate: punctuation exits at layer 2, rare tokens at layer 12 | −52% avg compute |
| Dual-Stream FFN | Parallel syntax + semantic streams with sigmoid fusion gate | Same FLOPs as standard FFN |
| EMA Memory Anchors | 16 persistent fast-weight vectors, cross-sequence recall without recurrence | 32KB extra params |
Numbers
| Benchmark | Vanilla | Mamba | TMT |
|---|---|---|---|
| WikiText-2 PPL ↓ | 42.1 | 31.8 | 29.4 |
| WikiText-103 PPL ↓ | 51.3 | 38.4 | 36.1 |
| LongBench ↑ | 41.2 | 51.3 | 53.4 |
| C4 PPL ↓ | 38.4 | 30.1 | 27.4 |
| Throughput (TPS, A100 FP16) | 94K | 148K | 138K |
| VRAM at S=4096 | OOM | 12GB | 18GB |
Resources
- 📄 Paper (Zenodo v3): https://zenodo.org/records/20287390 · DOI: 10.5281/zenodo.20287197
- 💻 Code + 226 tests: https://github.com/vignesh2027/TemporalMesh-Transformer
- 🚀 Live demo: https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo
- 📊 Benchmarks dataset: https://huggingface.co/datasets/vigneshwar234/TMT-Benchmarks
Quick start
pip install temporalmesh-transformer
from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
import torch
model = TMTModel(TMTConfig(vocab_size=50257, d_model=512, n_heads=8, n_layers=12))
out = model(torch.randint(0, 50257, (1, 256)))
# out.logits, out.exit_masks, out.graph_edges, out.confidences
Happy to answer questions about the architecture, training setup, or ablations. All results are reproducible with the provided training scripts and 3 fixed seeds.
— Vigneshwar LK