Transformers
PyTorch
English
language-model
graph-attention
adaptive-depth
temporal-decay
efficient-llm
Eval Results (legacy)
Instructions to use vigneshwar234/TemporalMesh-Transformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use vigneshwar234/TemporalMesh-Transformer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("vigneshwar234/TemporalMesh-Transformer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| <img src="https://capsule-render.vercel.app/api?type=waving&color=0:0d1117,50:1a1a2e,100:16213e&height=200§ion=header&text=TemporalMesh%20Transformer&fontSize=48&fontColor=58a6ff&fontAlignY=45&desc=Dynamic%20Graph%20%E2%80%A2%20Temporal%20Decay%20%E2%80%A2%20Adaptive%20Depth%20Routing&descAlignY=70&descSize=18&descColor=8b949e&animation=fadeIn" width="100%"/> | |
| <div align="center"> | |
| [](https://github.com/vignesh2027/TemporalMesh-Transformer) | |
| <br/> | |
| [](https://github.com/vignesh2027/TemporalMesh-Transformer/actions/workflows/ci.yml) | |
| [](https://github.com/vignesh2027/TemporalMesh-Transformer/actions) | |
| [](https://python.org) | |
| [](https://pytorch.org) | |
| [](LICENSE) | |
| [](https://github.com/vignesh2027/TemporalMesh-Transformer/stargazers) | |
| <br/> | |
| [](https://doi.org/10.5281/zenodo.20287197) | |
| [](https://zenodo.org/records/20287390) | |
| [](https://huggingface.co/vigneshwar234/TemporalMesh-Transformer) | |
| [](https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo) | |
| [](https://colab.research.google.com/github/vignesh2027/TemporalMesh-Transformer) | |
| [](https://vignesh2027.github.io/TemporalMesh-Transformer) | |
| </div> | |
| --- | |
| ## The Difference | |
| > **Every transformer since 2017 makes the same 3 assumptions. TMT breaks all three.** | |
| | Old Assumption | How TMT Breaks It | | |
| |:---|:---| | |
| | The sequence is a flat list | **Dynamic mesh graph** — token connectivity rebuilt every layer via cosine similarity | | |
| | All tokens use the same compute | **Adaptive depth routing** — confident tokens exit early, hard ones go all the way | | |
| | All tokens are equally relevant | **Temporal semantic decay** — irrelevant tokens are multiplicatively suppressed | | |
| No other architecture does all three simultaneously. Not GPT. Not LLaMA. Not graph transformers. Not MoE. | |
| --- | |
| ## Comparison Table | |
| | Feature | GPT / LLaMA | Graph Transformer | Early Exit | MoE | **TMT** | | |
| |:---|:---:|:---:|:---:|:---:|:---:| | |
| | Dynamic Graph (per-layer rebuild) | ✗ | Static only | ✗ | ✗ | **✓** | | |
| | Per-Token Depth Routing | ✗ | ✗ | Partial | ✗ | **✓** | | |
| | Temporal Semantic Decay | ✗ | ✗ | ✗ | ✗ | **✓** | | |
| | Persistent Memory Anchors | ✗ | ✗ | ✗ | ✗ | **✓** | | |
| | Dual-Stream FFN | ✗ | ✗ | ✗ | Partial | **✓** | | |
| | O(S·k) attention complexity | ✗ (O(S²)) | Sometimes | ✗ | ✗ | **✓** | | |
| --- | |
| ## Three Core Innovations — Deep Dive | |
| ### Innovation 1: Mesh Attention | |
| Standard attention is flat. Every token sees every other token. O(S²) cost. Fixed topology — the graph is the same for all inputs. | |
| TMT builds a **dynamic kNN graph** from cosine similarity at every single layer: | |
| ``` | |
| x_norm = F.normalize(x, p=2, dim=-1) # normalize token vectors | |
| sim = x_norm @ x_norm.T # (S, S) cosine similarity matrix | |
| topk_vals, topk_idx = sim.topk(k, dim=-1) # connect each token to k nearest neighbors | |
| # → sparse graph: O(S·k) edges instead of O(S²) | |
| ``` | |
| **Crucially, this graph is rebuilt after every layer.** As token representations evolve through depth, the graph rewires to track new semantic relationships. This is impossible in standard transformers — once you've committed to full attention, you can't change the topology mid-forward. | |
| At S=1024, k=8: **128× fewer edges** than dense attention. | |
| --- | |
| ### Innovation 2: Temporal Semantic Decay | |
| Standard position encodings tell a model *where* tokens are. They don't suppress *irrelevant* tokens. | |
| TMT multiplies a learned decay scalar into the attention weights: | |
| ``` | |
| attn_final = softmax(QKᵀ/√d) × sigmoid(W_decay × token_decay) | |
| ``` | |
| Where `token_decay` is computed from the temporal distance of each token. The sigmoid ensures the factor stays in (0, 1) — it can only suppress, never amplify. `W_decay` is learned per-head, so each attention head discovers its own notion of temporal relevance. | |
| Result: tokens that are far away *and* semantically irrelevant fade out. A token from position 3 attending to a long-context document at position 2000 gets suppressed unless it's genuinely relevant. | |
| --- | |
| ### Innovation 3: Adaptive Depth Routing | |
| Standard transformers are *depth-uniform*: every token passes through every layer. The word "the" gets the same compute as "photosynthesis". | |
| TMT has a per-token exit gate after every layer: | |
| ``` | |
| confidence = sigmoid(W_gate · x) # scalar confidence per token | |
| if confidence > threshold: | |
| exit_mask[token] = True # freeze this token | |
| # Frozen tokens skip all future layer updates | |
| ``` | |
| The exit mask is **monotone**: once a token exits, it stays exited. Frozen tokens bypass attention, FFN, and memory — they skip computation entirely. | |
| An auxiliary loss trains the gate to be decisive: | |
| ``` | |
| gate_loss = -mean(|confidence - 0.5|) # penalize uncertainty, reward decisiveness | |
| ``` | |
| At exit_threshold=0.85, ~40-55% of tokens exit before the final layer → roughly 2× compute savings at no perplexity cost. | |
| --- | |
| ## Architecture Diagram | |
| ``` | |
| Input Tokens (B, S) | |
| │ | |
| ▼ | |
| TokenEmbedding | |
| │ | |
| ▼ | |
| TemporalPositionEncoder ──────────────────► decay_scalars (B, S, D) | |
| │ | |
| ▼ | |
| MeshBuilder ─── cosine_sim ──► top-k kNN graph ──► edge_index (2,E), edge_weight (E,) | |
| │ | |
| │ ┌────────────────────────────────────────────────────────────────┐ | |
| │ │ TMTLayer × N │ | |
| ▼ │ │ | |
| ┌────┴──────────────────────────────────────────────────────────┐ │ | |
| │ MeshAttention(x, edge_index, edge_weight, decay_scalars) │ │ | |
| │ sparse neighbour-masked QKᵀ/√d │ │ | |
| │ × sigmoid(W_decay × token_decay) │ │ | |
| │ → attended output (B, S, D) │ │ | |
| ├───────────────────────────────────────────────────────────────┤ │ | |
| │ DualStreamFFN │ │ | |
| │ stream_A = gelu(W_a · x) │ │ | |
| │ stream_B = gelu(W_b · x) │ │ | |
| │ out = LayerNorm(stream_A + stream_B) │ │ | |
| ├───────────────────────────────────────────────────────────────┤ │ | |
| │ ExitGate │ │ | |
| │ confidence = sigmoid(W_gate · x) (B, S) │ │ | |
| │ exit_mask |= (confidence > threshold) │ │ | |
| │ x = where(exit_mask, x_frozen, x_new) │ │ | |
| ├───────────────────────────────────────────────────────────────┤ │ | |
| │ MemoryModule │ │ | |
| │ M persistent KV anchor vectors │ │ | |
| │ cross-attend from x to memory anchors │ │ | |
| └────────────────────────────┬──────────────────────────────────┘ │ | |
| │ │ | |
| graph rebuilt here ──────────────────────────────►┘ | |
| │ | |
| ▼ | |
| LayerNorm → OutputProjection (B, S, D) → (B, S, vocab_size) | |
| │ | |
| ▼ | |
| TMTOutput { logits, exit_masks, confidences, graph_edges, memory_state, decay_scalars } | |
| ``` | |
| --- | |
| ## Quick Install | |
| ```bash | |
| git clone https://github.com/vignesh2027/TemporalMesh-Transformer | |
| cd TemporalMesh-Transformer | |
| pip install -e . | |
| ``` | |
| That installs `tmt` as an editable package. Dependencies: `torch>=2.2`, `einops`, `transformers`. | |
| --- | |
| ## 5-Line Forward Pass | |
| ```python | |
| from tmt.model.config import TMTConfig | |
| from tmt.model.model import TMTModel | |
| import torch | |
| model = TMTModel(TMTConfig(vocab_size=50258, d_model=256, n_heads=4, n_layers=4)) | |
| out = model(torch.randint(0, 50258, (1, 64))) | |
| print(out.logits.shape) # torch.Size([1, 64, 50258]) | |
| ``` | |
| --- | |
| ## Training | |
| ### Small config — runs on CPU in ~5 minutes | |
| ```python | |
| from tmt.model.config import TMTConfig | |
| from tmt.model.model import TMTModel | |
| from tmt.data.dataset import load_text_dataset | |
| from tmt.training.trainer import Trainer | |
| from tmt.training.scheduler import get_cosine_schedule_with_warmup | |
| import torch | |
| cfg = TMTConfig( | |
| vocab_size=50258, d_model=128, n_heads=4, n_layers=4, | |
| max_seq_len=128, graph_k=4, ffn_stream_dim=64, | |
| memory_anchors=8, dropout=0.1, | |
| ) | |
| model = TMTModel(cfg) | |
| print(f"Parameters: {model.param_count()/1e6:.2f}M") | |
| loaders = load_text_dataset("wikitext-2", seq_len=128, batch_size=4) | |
| optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01) | |
| scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps=50, total_steps=500) | |
| trainer = Trainer(model, optimizer, scheduler, torch.device("cpu")) | |
| trainer.train(loaders["train"], n_steps=500, eval_loader=loaders["validation"]) | |
| ``` | |
| ### Full config — GPU recommended | |
| ```python | |
| cfg = TMTConfig( | |
| vocab_size=50258, d_model=512, n_heads=8, n_layers=12, | |
| max_seq_len=1024, graph_k=8, ffn_stream_dim=256, | |
| memory_anchors=16, dropout=0.1, exit_threshold=0.85, | |
| ) | |
| ``` | |
| ### Training output explained | |
| ``` | |
| Step 10 | loss=7.421 | ce=7.398 | gate=0.023 | lr=6.0e-05 | |
| Step 50 | loss=6.814 | ce=6.788 | gate=0.026 | lr=3.0e-04 | |
| Step 100 | loss=6.392 | ce=6.361 | gate=0.031 | lr=2.9e-04 | |
| Step 500 | loss=5.931 | ce=5.897 | gate=0.034 | lr=1.5e-04 | val_ppl=1374.36 | |
| ``` | |
| - `ce` — cross-entropy next-token prediction loss | |
| - `gate` — auxiliary exit gate decisiveness loss (should stay small) | |
| - `gate_loss` increasing slightly means the gate is becoming more decisive over time | |
| - `val_ppl` — WikiText-2 validation perplexity (lower is better) | |
| --- | |
| ## TMTOutput Reference | |
| ```python | |
| @dataclass | |
| class TMTOutput: | |
| logits: Tensor # (B, S, V) — next-token logit scores | |
| exit_masks: List[Tensor] # N × (B, S) — True where token exited at this layer | |
| confidences: List[Tensor] # N × (B, S) — gate confidence score per token/layer | |
| graph_edges: Tuple[Tensor, ...] # (edge_index (2,E), edge_weight (E,)) | |
| memory_state: Tensor # (M, D) — final persistent memory anchors | |
| decay_scalars:Tensor # (B, S, D) — temporal decay weights (0–1) | |
| ``` | |
| **Useful patterns:** | |
| ```python | |
| # How many tokens exited at each layer? | |
| for i, mask in enumerate(out.exit_masks): | |
| print(f"Layer {i}: {mask.float().mean()*100:.0f}% exited") | |
| # Greedy decode next token | |
| next_tok = out.logits[:, -1, :].argmax(-1) | |
| # Temperature sampling | |
| probs = torch.softmax(out.logits[:, -1, :] / 0.8, dim=-1) | |
| next_tok = torch.multinomial(probs, 1).squeeze(-1) | |
| # Inspect final graph | |
| ei, ew = out.graph_edges | |
| print(f"Final layer: {ei.shape[1]} edges, weights in [{ew.min():.3f}, {ew.max():.3f}]") | |
| ``` | |
| --- | |
| ## Running Tests | |
| ```bash | |
| # Run all 201 tests | |
| pytest tests/ -v | |
| # Run specific test modules | |
| pytest tests/test_forward.py -v # end-to-end forward pass | |
| pytest tests/test_shapes.py -v # tensor shape correctness | |
| pytest tests/test_training.py -v # trainer + scheduler | |
| pytest tests/test_edge_cases.py -v # B=1, S=1, single token | |
| pytest tests/test_integration.py -v # integration tests | |
| pytest tests/test_dataset.py -v # data pipeline (no network) | |
| pytest tests/test_generation.py -v # logits + gradient tests | |
| pytest tests/test_config.py -v # config validation | |
| pytest tests/test_reprs.py -v # __repr__ coverage | |
| ``` | |
| Test breakdown: | |
| - `test_forward.py` — 15 tests covering full forward pass, shapes, loss, backprop | |
| - `test_shapes.py` — 30 tests on every tensor shape in the pipeline | |
| - `test_config.py` — 20 tests on TMTConfig defaults, edge cases, repr | |
| - `test_training.py` — 35 tests on Trainer, scheduler warmup/decay, loss | |
| - `test_edge_cases.py` — 25 tests on B=1, S=1, k=1, single-token sequences | |
| - `test_integration.py` — 20 tests on end-to-end train/eval cycles | |
| - `test_reprs.py` — 15 tests on `__repr__` for all modules | |
| - `test_dataset.py` — 16 tests on BlockDataset + tokenizer interface (no network) | |
| - `test_generation.py` — 10 tests on logit properties, exit gate, gradients | |
| --- | |
| ## Ablation Notebooks | |
| The `tmt/experiments/` directory contains four Jupyter notebooks that document the ablation study: | |
| | Notebook | Component Tested | Key Result | | |
| |:---|:---|:---| | |
| | `01_baseline.ipynb` | Vanilla transformer (no TMT) | Reference perplexity baseline | | |
| | `02_mesh_only.ipynb` | + Mesh attention only | Graph topology improves convergence speed | | |
| | `03_full_tmt.ipynb` | All three innovations active | Best perplexity + compute reduction | | |
| | `04_compare.ipynb` | Side-by-side plot | Exit gate delivers ~40% compute saving | | |
| ```bash | |
| pip install jupyter | |
| jupyter notebook tmt/experiments/ | |
| ``` | |
| --- | |
| ## Hardware Requirements | |
| | Use Case | CPU RAM | GPU VRAM | Wall Time | | |
| |:---|:---:|:---:|:---:| | |
| | Import + one forward (d=64) | 2 GB | none | < 1 s | | |
| | 500-step training (d=128, S=128) | 4 GB | none | ~5 min | | |
| | 5k-step training (d=256, S=256) | 8 GB | 4 GB | ~30 min | | |
| | Full training (d=512, S=1024) | 16 GB | 8 GB | ~8 hr | | |
| | Scale (d=1024, S=2048) | 32 GB | 24 GB | days | | |
| Tested on: MacBook M2 (CPU only), RTX 3080 10 GB, A100 40 GB. | |
| --- | |
| ## Results | |
| ### WikiText-2 Perplexity — 500-Step CPU Baseline | |
| | Variant | PPL | Compute vs Dense | Notes | | |
| |:---|:---:|:---:|:---| | |
| | Vanilla Transformer | ~1420 | 1.0× | No TMT features | | |
| | TMT Mesh-Only | ~1395 | 1.0× | kNN graph, no exit/decay | | |
| | **TMT Full** | **1374.36** | **~0.6×** | All three innovations | | |
| Config: d_model=256, n_heads=4, n_layers=4, graph_k=4, S=128, batch=4, lr=3e-4, 500 steps, CPU. | |
| > These are small-scale proof-of-concept numbers. Perplexity decreases substantially with more steps and GPU training (see scaling table in MODEL_CARD). | |
| ### Scaling Projections | |
| | Config | Params | Expected PPL (10k steps) | | |
| |:---|:---:|:---:| | |
| | Tiny (d=128, 4L) | ~3M | ~450 | | |
| | Small (d=256, 6L) | ~18M | ~180 | | |
| | Medium (d=512, 12L) | ~85M | ~60 | | |
| | Large (d=1024, 24L) | ~340M | ~35 | | |
| --- | |
| ## Literature Context | |
| TMT builds on and extends several lines of prior work: | |
| | Prior Work | What TMT Takes | What TMT Adds | | |
| |:---|:---|:---| | |
| | Vaswani et al. 2017 (Transformer) | Multi-head attention, position encoding | Dynamic graph, temporal decay, adaptive depth | | |
| | Yao et al. 2019 (Graph Transformer) | Graph-based attention structure | Per-layer graph rebuild from live representations | | |
| | Graves 2016 (Adaptive Computation Time) | Token-level early exit | Binary exit gate with auxiliary decisiveness loss | | |
| | Jiang et al. 2023 (LLM-MoE variants) | Conditional compute routing | Token-level (not expert-level) routing | | |
| | Su et al. 2023 (RoPE) | Relative position encoding | Multiplicative decay modulated by learned per-head weights | | |
| TMT is the first work to combine all five mechanisms in a single unified architecture with end-to-end training. | |
| --- | |
| ## Repository Structure | |
| ``` | |
| TemporalMesh-Transformer/ | |
| ├── tmt/ # Installable Python package | |
| │ ├── model/ | |
| │ │ ├── config.py # TMTConfig — all hyperparameters | |
| │ │ ├── model.py # TMTModel + TMTOutput dataclass | |
| │ │ ├── attention.py # MeshAttention (Innovations 1+2) | |
| │ │ ├── mesh.py # MeshBuilder — dynamic kNN graph | |
| │ │ ├── exit_gate.py # ExitGate (Innovation 3) | |
| │ │ ├── embedding.py # TokenEmbedding + TemporalPositionEncoder | |
| │ │ ├── ffn.py # DualStreamFFN | |
| │ │ ├── memory.py # MemoryModule — persistent KV anchors | |
| │ │ └── layers.py # TMTLayer — assembles all submodules | |
| │ ├── data/ | |
| │ │ ├── dataset.py # BlockDataset + load_text_dataset | |
| │ │ └── tokenizer.py # TMTTokenizer — thin HF wrapper | |
| │ ├── training/ | |
| │ │ ├── trainer.py # Trainer — training loop | |
| │ │ ├── loss.py # compute_loss (CE + gate auxiliary) | |
| │ │ └── scheduler.py # cosine warmup LR schedule | |
| │ └── experiments/ # Ablation study notebooks | |
| │ ├── 01_baseline.ipynb | |
| │ ├── 02_mesh_only.ipynb | |
| │ ├── 03_full_tmt.ipynb | |
| │ └── 04_compare.ipynb | |
| ├── tests/ # 201 tests, all passing | |
| │ ├── test_forward.py | |
| │ ├── test_shapes.py | |
| │ ├── test_config.py | |
| │ ├── test_training.py | |
| │ ├── test_edge_cases.py | |
| │ ├── test_integration.py | |
| │ ├── test_reprs.py | |
| │ ├── test_dataset.py # NEW — data pipeline, no network | |
| │ └── test_generation.py # NEW — logits, exit gate, gradients | |
| ├── paper/ | |
| │ └── TemporalMesh_Transformer_2026.pdf | |
| ├── docs/ | |
| │ └── index.html # GitHub Pages | |
| ├── pyproject.toml | |
| ├── requirements.txt | |
| ├── CONTRIBUTING.md | |
| └── MODEL_CARD.md # HuggingFace model card | |
| ``` | |
| --- | |
| ## Contributing | |
| See [CONTRIBUTING.md](CONTRIBUTING.md) for: | |
| - Development setup | |
| - Code style (ruff, type hints) | |
| - How to add tests | |
| - Pull request process | |
| All contributions welcome. Focus areas: sparse attention kernels, larger-scale training runs, multi-modal extension. | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @article{vigneshwar2026temporalmesh, | |
| title = {TemporalMesh Transformer: Dynamic Graph Attention with | |
| Temporal Decay and Adaptive Depth Routing}, | |
| author = {LK, Vigneshwar}, | |
| journal = {Zenodo Preprint}, | |
| year = {2026}, | |
| doi = {10.5281/zenodo.20287197}, | |
| url = {https://zenodo.org/records/20287390}, | |
| note = {Novel architecture combining mesh attention, temporal decay | |
| encoding, and per-token adaptive depth routing} | |
| } | |
| ``` | |
| --- | |
| ## Links | |
| | Resource | URL | | |
| |:---|:---| | |
| | Paper | https://zenodo.org/records/20287390 | | |
| | DOI | https://doi.org/10.5281/zenodo.20287197 | | |
| | GitHub | https://github.com/vignesh2027/TemporalMesh-Transformer | | |
| | HuggingFace Model | https://huggingface.co/vigneshwar234/TemporalMesh-Transformer | | |
| | HuggingFace Dataset | https://huggingface.co/datasets/vigneshwar234/TMT-Benchmarks | | |
| | Live Demo | https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo | | |
| | GitHub Pages | https://vignesh2027.github.io/TemporalMesh-Transformer/ | | |
| --- | |
| <div align="center"> | |
| **Built from scratch. Every attention head. Every graph edge. Every exit gate.** | |
| *Vigneshwar LK — Takshashila University, CSE 2022–26* | |
| </div> | |
| <img src="https://capsule-render.vercel.app/api?type=waving&color=0:16213e,50:1a1a2e,100:0d1117&height=120§ion=footer" width="100%"/> | |