Mixture of Attention Heads: Selecting Attention Heads Per Token Paper • 2210.05144 • Published Oct 11, 2022 • 3
MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling Paper • 2602.03359 • Published Feb 3 • 10
MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers Paper • 2602.00398 • Published Jan 30 • 5
Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers Paper • 2602.18292 • Published 17 days ago • 10
RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs Paper • 2602.05367 • Published Feb 5 • 7
DFlash: Block Diffusion for Flash Speculative Decoding Paper • 2602.06036 • Published Feb 5 • 42
POP: Prefill-Only Pruning for Efficient Large Model Inference Paper • 2602.03295 • Published Feb 3 • 4
Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in {pm 1, pm i} Paper • 2512.02901 • Published Dec 2, 2025 • 6
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models Paper • 2511.23319 • Published Nov 28, 2025 • 24
Metis: Training Large Language Models with Advanced Low-Bit Quantization Paper • 2509.00404 • Published Aug 30, 2025 • 7
Jamba 1.7 Collection The AI21 Jamba family of models are hybrid SSM-Transformer foundation models, blending speed, efficient long context processing, and accuracy. • 4 items • Updated Jul 2, 2025 • 12
BitVLA Collection 1-bit Vision-Language-Action Models for Robotics Manipulation • 9 items • Updated 7 days ago • 4
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs Paper • 2504.18415 • Published Apr 25, 2025 • 49