[![Typing SVG](https://readme-typing-svg.demolab.com?font=Fira+Code&weight=700&size=22&duration=3000&pause=800&color=58A6FF¢er=true&vCenter=true&multiline=false&width=900&lines=The+first+transformer+to+break+all+three+flat-sequence+assumptions;Dynamic+graph+topology+%E2%80%94+rebuilt+every+forward+pass;Per-token+adaptive+depth+%E2%80%94+easy+exits+early%2C+hard+goes+deep;Temporal+semantic+decay+%E2%80%94+irrelevant+tokens+fade+out)](https://github.com/vignesh2027/TemporalMesh-Transformer)
[![CI](https://github.com/vignesh2027/TemporalMesh-Transformer/actions/workflows/ci.yml/badge.svg)](https://github.com/vignesh2027/TemporalMesh-Transformer/actions/workflows/ci.yml) [![Tests](https://img.shields.io/badge/Tests-201_passing-brightgreen?style=for-the-badge)](https://github.com/vignesh2027/TemporalMesh-Transformer/actions) [![Python](https://img.shields.io/badge/Python-3.10%2B-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org) [![PyTorch](https://img.shields.io/badge/PyTorch-2.2%2B-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)](https://pytorch.org) [![License: MIT](https://img.shields.io/badge/License-MIT-22c55e?style=for-the-badge)](LICENSE) [![Stars](https://img.shields.io/github/stars/vignesh2027/TemporalMesh-Transformer?style=for-the-badge&color=f59e0b&logo=github)](https://github.com/vignesh2027/TemporalMesh-Transformer/stargazers)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.20287197.svg)](https://doi.org/10.5281/zenodo.20287197) [![Zenodo](https://img.shields.io/badge/Zenodo-Published-024BA3?style=flat-square&logo=zenodo)](https://zenodo.org/records/20287390) [![HuggingFace](https://img.shields.io/badge/HuggingFace-Model%20%26%20Dataset-FFD21E?style=flat-square&logo=huggingface&logoColor=black)](https://huggingface.co/vigneshwar234/TemporalMesh-Transformer) [![Live Demo](https://img.shields.io/badge/๐ŸŽฎ%20Live%20Demo-Space-orange?style=flat-square)](https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo) [![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-F9AB00?style=flat-square&logo=googlecolab&logoColor=black)](https://colab.research.google.com/github/vignesh2027/TemporalMesh-Transformer) [![GitHub Pages](https://img.shields.io/badge/GitHub%20Pages-Live%20Docs-0078d7?style=flat-square&logo=github)](https://vignesh2027.github.io/TemporalMesh-Transformer)
--- ## The Difference > **Every transformer since 2017 makes the same 3 assumptions. TMT breaks all three.** | Old Assumption | How TMT Breaks It | |:---|:---| | The sequence is a flat list | **Dynamic mesh graph** โ€” token connectivity rebuilt every layer via cosine similarity | | All tokens use the same compute | **Adaptive depth routing** โ€” confident tokens exit early, hard ones go all the way | | All tokens are equally relevant | **Temporal semantic decay** โ€” irrelevant tokens are multiplicatively suppressed | No other architecture does all three simultaneously. Not GPT. Not LLaMA. Not graph transformers. Not MoE. --- ## Comparison Table | Feature | GPT / LLaMA | Graph Transformer | Early Exit | MoE | **TMT** | |:---|:---:|:---:|:---:|:---:|:---:| | Dynamic Graph (per-layer rebuild) | โœ— | Static only | โœ— | โœ— | **โœ“** | | Per-Token Depth Routing | โœ— | โœ— | Partial | โœ— | **โœ“** | | Temporal Semantic Decay | โœ— | โœ— | โœ— | โœ— | **โœ“** | | Persistent Memory Anchors | โœ— | โœ— | โœ— | โœ— | **โœ“** | | Dual-Stream FFN | โœ— | โœ— | โœ— | Partial | **โœ“** | | O(Sยทk) attention complexity | โœ— (O(Sยฒ)) | Sometimes | โœ— | โœ— | **โœ“** | --- ## Three Core Innovations โ€” Deep Dive ### Innovation 1: Mesh Attention Standard attention is flat. Every token sees every other token. O(Sยฒ) cost. Fixed topology โ€” the graph is the same for all inputs. TMT builds a **dynamic kNN graph** from cosine similarity at every single layer: ``` x_norm = F.normalize(x, p=2, dim=-1) # normalize token vectors sim = x_norm @ x_norm.T # (S, S) cosine similarity matrix topk_vals, topk_idx = sim.topk(k, dim=-1) # connect each token to k nearest neighbors # โ†’ sparse graph: O(Sยทk) edges instead of O(Sยฒ) ``` **Crucially, this graph is rebuilt after every layer.** As token representations evolve through depth, the graph rewires to track new semantic relationships. This is impossible in standard transformers โ€” once you've committed to full attention, you can't change the topology mid-forward. At S=1024, k=8: **128ร— fewer edges** than dense attention. --- ### Innovation 2: Temporal Semantic Decay Standard position encodings tell a model *where* tokens are. They don't suppress *irrelevant* tokens. TMT multiplies a learned decay scalar into the attention weights: ``` attn_final = softmax(QKแต€/โˆšd) ร— sigmoid(W_decay ร— token_decay) ``` Where `token_decay` is computed from the temporal distance of each token. The sigmoid ensures the factor stays in (0, 1) โ€” it can only suppress, never amplify. `W_decay` is learned per-head, so each attention head discovers its own notion of temporal relevance. Result: tokens that are far away *and* semantically irrelevant fade out. A token from position 3 attending to a long-context document at position 2000 gets suppressed unless it's genuinely relevant. --- ### Innovation 3: Adaptive Depth Routing Standard transformers are *depth-uniform*: every token passes through every layer. The word "the" gets the same compute as "photosynthesis". TMT has a per-token exit gate after every layer: ``` confidence = sigmoid(W_gate ยท x) # scalar confidence per token if confidence > threshold: exit_mask[token] = True # freeze this token # Frozen tokens skip all future layer updates ``` The exit mask is **monotone**: once a token exits, it stays exited. Frozen tokens bypass attention, FFN, and memory โ€” they skip computation entirely. An auxiliary loss trains the gate to be decisive: ``` gate_loss = -mean(|confidence - 0.5|) # penalize uncertainty, reward decisiveness ``` At exit_threshold=0.85, ~40-55% of tokens exit before the final layer โ†’ roughly 2ร— compute savings at no perplexity cost. --- ## Architecture Diagram ``` Input Tokens (B, S) โ”‚ โ–ผ TokenEmbedding โ”‚ โ–ผ TemporalPositionEncoder โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ decay_scalars (B, S, D) โ”‚ โ–ผ MeshBuilder โ”€โ”€โ”€ cosine_sim โ”€โ”€โ–บ top-k kNN graph โ”€โ”€โ–บ edge_index (2,E), edge_weight (E,) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ TMTLayer ร— N โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ MeshAttention(x, edge_index, edge_weight, decay_scalars) โ”‚ โ”‚ โ”‚ sparse neighbour-masked QKแต€/โˆšd โ”‚ โ”‚ โ”‚ ร— sigmoid(W_decay ร— token_decay) โ”‚ โ”‚ โ”‚ โ†’ attended output (B, S, D) โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ DualStreamFFN โ”‚ โ”‚ โ”‚ stream_A = gelu(W_a ยท x) โ”‚ โ”‚ โ”‚ stream_B = gelu(W_b ยท x) โ”‚ โ”‚ โ”‚ out = LayerNorm(stream_A + stream_B) โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ ExitGate โ”‚ โ”‚ โ”‚ confidence = sigmoid(W_gate ยท x) (B, S) โ”‚ โ”‚ โ”‚ exit_mask |= (confidence > threshold) โ”‚ โ”‚ โ”‚ x = where(exit_mask, x_frozen, x_new) โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ MemoryModule โ”‚ โ”‚ โ”‚ M persistent KV anchor vectors โ”‚ โ”‚ โ”‚ cross-attend from x to memory anchors โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ graph rebuilt here โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”˜ โ”‚ โ–ผ LayerNorm โ†’ OutputProjection (B, S, D) โ†’ (B, S, vocab_size) โ”‚ โ–ผ TMTOutput { logits, exit_masks, confidences, graph_edges, memory_state, decay_scalars } ``` --- ## Quick Install ```bash git clone https://github.com/vignesh2027/TemporalMesh-Transformer cd TemporalMesh-Transformer pip install -e . ``` That installs `tmt` as an editable package. Dependencies: `torch>=2.2`, `einops`, `transformers`. --- ## 5-Line Forward Pass ```python from tmt.model.config import TMTConfig from tmt.model.model import TMTModel import torch model = TMTModel(TMTConfig(vocab_size=50258, d_model=256, n_heads=4, n_layers=4)) out = model(torch.randint(0, 50258, (1, 64))) print(out.logits.shape) # torch.Size([1, 64, 50258]) ``` --- ## Training ### Small config โ€” runs on CPU in ~5 minutes ```python from tmt.model.config import TMTConfig from tmt.model.model import TMTModel from tmt.data.dataset import load_text_dataset from tmt.training.trainer import Trainer from tmt.training.scheduler import get_cosine_schedule_with_warmup import torch cfg = TMTConfig( vocab_size=50258, d_model=128, n_heads=4, n_layers=4, max_seq_len=128, graph_k=4, ffn_stream_dim=64, memory_anchors=8, dropout=0.1, ) model = TMTModel(cfg) print(f"Parameters: {model.param_count()/1e6:.2f}M") loaders = load_text_dataset("wikitext-2", seq_len=128, batch_size=4) optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01) scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps=50, total_steps=500) trainer = Trainer(model, optimizer, scheduler, torch.device("cpu")) trainer.train(loaders["train"], n_steps=500, eval_loader=loaders["validation"]) ``` ### Full config โ€” GPU recommended ```python cfg = TMTConfig( vocab_size=50258, d_model=512, n_heads=8, n_layers=12, max_seq_len=1024, graph_k=8, ffn_stream_dim=256, memory_anchors=16, dropout=0.1, exit_threshold=0.85, ) ``` ### Training output explained ``` Step 10 | loss=7.421 | ce=7.398 | gate=0.023 | lr=6.0e-05 Step 50 | loss=6.814 | ce=6.788 | gate=0.026 | lr=3.0e-04 Step 100 | loss=6.392 | ce=6.361 | gate=0.031 | lr=2.9e-04 Step 500 | loss=5.931 | ce=5.897 | gate=0.034 | lr=1.5e-04 | val_ppl=1374.36 ``` - `ce` โ€” cross-entropy next-token prediction loss - `gate` โ€” auxiliary exit gate decisiveness loss (should stay small) - `gate_loss` increasing slightly means the gate is becoming more decisive over time - `val_ppl` โ€” WikiText-2 validation perplexity (lower is better) --- ## TMTOutput Reference ```python @dataclass class TMTOutput: logits: Tensor # (B, S, V) โ€” next-token logit scores exit_masks: List[Tensor] # N ร— (B, S) โ€” True where token exited at this layer confidences: List[Tensor] # N ร— (B, S) โ€” gate confidence score per token/layer graph_edges: Tuple[Tensor, ...] # (edge_index (2,E), edge_weight (E,)) memory_state: Tensor # (M, D) โ€” final persistent memory anchors decay_scalars:Tensor # (B, S, D) โ€” temporal decay weights (0โ€“1) ``` **Useful patterns:** ```python # How many tokens exited at each layer? for i, mask in enumerate(out.exit_masks): print(f"Layer {i}: {mask.float().mean()*100:.0f}% exited") # Greedy decode next token next_tok = out.logits[:, -1, :].argmax(-1) # Temperature sampling probs = torch.softmax(out.logits[:, -1, :] / 0.8, dim=-1) next_tok = torch.multinomial(probs, 1).squeeze(-1) # Inspect final graph ei, ew = out.graph_edges print(f"Final layer: {ei.shape[1]} edges, weights in [{ew.min():.3f}, {ew.max():.3f}]") ``` --- ## Running Tests ```bash # Run all 201 tests pytest tests/ -v # Run specific test modules pytest tests/test_forward.py -v # end-to-end forward pass pytest tests/test_shapes.py -v # tensor shape correctness pytest tests/test_training.py -v # trainer + scheduler pytest tests/test_edge_cases.py -v # B=1, S=1, single token pytest tests/test_integration.py -v # integration tests pytest tests/test_dataset.py -v # data pipeline (no network) pytest tests/test_generation.py -v # logits + gradient tests pytest tests/test_config.py -v # config validation pytest tests/test_reprs.py -v # __repr__ coverage ``` Test breakdown: - `test_forward.py` โ€” 15 tests covering full forward pass, shapes, loss, backprop - `test_shapes.py` โ€” 30 tests on every tensor shape in the pipeline - `test_config.py` โ€” 20 tests on TMTConfig defaults, edge cases, repr - `test_training.py` โ€” 35 tests on Trainer, scheduler warmup/decay, loss - `test_edge_cases.py` โ€” 25 tests on B=1, S=1, k=1, single-token sequences - `test_integration.py` โ€” 20 tests on end-to-end train/eval cycles - `test_reprs.py` โ€” 15 tests on `__repr__` for all modules - `test_dataset.py` โ€” 16 tests on BlockDataset + tokenizer interface (no network) - `test_generation.py` โ€” 10 tests on logit properties, exit gate, gradients --- ## Ablation Notebooks The `tmt/experiments/` directory contains four Jupyter notebooks that document the ablation study: | Notebook | Component Tested | Key Result | |:---|:---|:---| | `01_baseline.ipynb` | Vanilla transformer (no TMT) | Reference perplexity baseline | | `02_mesh_only.ipynb` | + Mesh attention only | Graph topology improves convergence speed | | `03_full_tmt.ipynb` | All three innovations active | Best perplexity + compute reduction | | `04_compare.ipynb` | Side-by-side plot | Exit gate delivers ~40% compute saving | ```bash pip install jupyter jupyter notebook tmt/experiments/ ``` --- ## Hardware Requirements | Use Case | CPU RAM | GPU VRAM | Wall Time | |:---|:---:|:---:|:---:| | Import + one forward (d=64) | 2 GB | none | < 1 s | | 500-step training (d=128, S=128) | 4 GB | none | ~5 min | | 5k-step training (d=256, S=256) | 8 GB | 4 GB | ~30 min | | Full training (d=512, S=1024) | 16 GB | 8 GB | ~8 hr | | Scale (d=1024, S=2048) | 32 GB | 24 GB | days | Tested on: MacBook M2 (CPU only), RTX 3080 10 GB, A100 40 GB. --- ## Results ### WikiText-2 Perplexity โ€” 500-Step CPU Baseline | Variant | PPL | Compute vs Dense | Notes | |:---|:---:|:---:|:---| | Vanilla Transformer | ~1420 | 1.0ร— | No TMT features | | TMT Mesh-Only | ~1395 | 1.0ร— | kNN graph, no exit/decay | | **TMT Full** | **1374.36** | **~0.6ร—** | All three innovations | Config: d_model=256, n_heads=4, n_layers=4, graph_k=4, S=128, batch=4, lr=3e-4, 500 steps, CPU. > These are small-scale proof-of-concept numbers. Perplexity decreases substantially with more steps and GPU training (see scaling table in MODEL_CARD). ### Scaling Projections | Config | Params | Expected PPL (10k steps) | |:---|:---:|:---:| | Tiny (d=128, 4L) | ~3M | ~450 | | Small (d=256, 6L) | ~18M | ~180 | | Medium (d=512, 12L) | ~85M | ~60 | | Large (d=1024, 24L) | ~340M | ~35 | --- ## Literature Context TMT builds on and extends several lines of prior work: | Prior Work | What TMT Takes | What TMT Adds | |:---|:---|:---| | Vaswani et al. 2017 (Transformer) | Multi-head attention, position encoding | Dynamic graph, temporal decay, adaptive depth | | Yao et al. 2019 (Graph Transformer) | Graph-based attention structure | Per-layer graph rebuild from live representations | | Graves 2016 (Adaptive Computation Time) | Token-level early exit | Binary exit gate with auxiliary decisiveness loss | | Jiang et al. 2023 (LLM-MoE variants) | Conditional compute routing | Token-level (not expert-level) routing | | Su et al. 2023 (RoPE) | Relative position encoding | Multiplicative decay modulated by learned per-head weights | TMT is the first work to combine all five mechanisms in a single unified architecture with end-to-end training. --- ## Repository Structure ``` TemporalMesh-Transformer/ โ”œโ”€โ”€ tmt/ # Installable Python package โ”‚ โ”œโ”€โ”€ model/ โ”‚ โ”‚ โ”œโ”€โ”€ config.py # TMTConfig โ€” all hyperparameters โ”‚ โ”‚ โ”œโ”€โ”€ model.py # TMTModel + TMTOutput dataclass โ”‚ โ”‚ โ”œโ”€โ”€ attention.py # MeshAttention (Innovations 1+2) โ”‚ โ”‚ โ”œโ”€โ”€ mesh.py # MeshBuilder โ€” dynamic kNN graph โ”‚ โ”‚ โ”œโ”€โ”€ exit_gate.py # ExitGate (Innovation 3) โ”‚ โ”‚ โ”œโ”€โ”€ embedding.py # TokenEmbedding + TemporalPositionEncoder โ”‚ โ”‚ โ”œโ”€โ”€ ffn.py # DualStreamFFN โ”‚ โ”‚ โ”œโ”€โ”€ memory.py # MemoryModule โ€” persistent KV anchors โ”‚ โ”‚ โ””โ”€โ”€ layers.py # TMTLayer โ€” assembles all submodules โ”‚ โ”œโ”€โ”€ data/ โ”‚ โ”‚ โ”œโ”€โ”€ dataset.py # BlockDataset + load_text_dataset โ”‚ โ”‚ โ””โ”€โ”€ tokenizer.py # TMTTokenizer โ€” thin HF wrapper โ”‚ โ”œโ”€โ”€ training/ โ”‚ โ”‚ โ”œโ”€โ”€ trainer.py # Trainer โ€” training loop โ”‚ โ”‚ โ”œโ”€โ”€ loss.py # compute_loss (CE + gate auxiliary) โ”‚ โ”‚ โ””โ”€โ”€ scheduler.py # cosine warmup LR schedule โ”‚ โ””โ”€โ”€ experiments/ # Ablation study notebooks โ”‚ โ”œโ”€โ”€ 01_baseline.ipynb โ”‚ โ”œโ”€โ”€ 02_mesh_only.ipynb โ”‚ โ”œโ”€โ”€ 03_full_tmt.ipynb โ”‚ โ””โ”€โ”€ 04_compare.ipynb โ”œโ”€โ”€ tests/ # 201 tests, all passing โ”‚ โ”œโ”€โ”€ test_forward.py โ”‚ โ”œโ”€โ”€ test_shapes.py โ”‚ โ”œโ”€โ”€ test_config.py โ”‚ โ”œโ”€โ”€ test_training.py โ”‚ โ”œโ”€โ”€ test_edge_cases.py โ”‚ โ”œโ”€โ”€ test_integration.py โ”‚ โ”œโ”€โ”€ test_reprs.py โ”‚ โ”œโ”€โ”€ test_dataset.py # NEW โ€” data pipeline, no network โ”‚ โ””โ”€โ”€ test_generation.py # NEW โ€” logits, exit gate, gradients โ”œโ”€โ”€ paper/ โ”‚ โ””โ”€โ”€ TemporalMesh_Transformer_2026.pdf โ”œโ”€โ”€ docs/ โ”‚ โ””โ”€โ”€ index.html # GitHub Pages โ”œโ”€โ”€ pyproject.toml โ”œโ”€โ”€ requirements.txt โ”œโ”€โ”€ CONTRIBUTING.md โ””โ”€โ”€ MODEL_CARD.md # HuggingFace model card ``` --- ## Contributing See [CONTRIBUTING.md](CONTRIBUTING.md) for: - Development setup - Code style (ruff, type hints) - How to add tests - Pull request process All contributions welcome. Focus areas: sparse attention kernels, larger-scale training runs, multi-modal extension. --- ## Citation ```bibtex @article{vigneshwar2026temporalmesh, title = {TemporalMesh Transformer: Dynamic Graph Attention with Temporal Decay and Adaptive Depth Routing}, author = {LK, Vigneshwar}, journal = {Zenodo Preprint}, year = {2026}, doi = {10.5281/zenodo.20287197}, url = {https://zenodo.org/records/20287390}, note = {Novel architecture combining mesh attention, temporal decay encoding, and per-token adaptive depth routing} } ``` --- ## Links | Resource | URL | |:---|:---| | Paper | https://zenodo.org/records/20287390 | | DOI | https://doi.org/10.5281/zenodo.20287197 | | GitHub | https://github.com/vignesh2027/TemporalMesh-Transformer | | HuggingFace Model | https://huggingface.co/vigneshwar234/TemporalMesh-Transformer | | HuggingFace Dataset | https://huggingface.co/datasets/vigneshwar234/TMT-Benchmarks | | Live Demo | https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo | | GitHub Pages | https://vignesh2027.github.io/TemporalMesh-Transformer/ | ---
**Built from scratch. Every attention head. Every graph edge. Every exit gate.** *Vigneshwar LK โ€” Takshashila University, CSE 2022โ€“26*