TTTPilot-Q-5B-Thinking-MAC

A Pilot Implementation of Titans MAC Architecture

Combining TTT-Linear-1.3B-Base-Pile-8k and Qwen3-4B-Thinking-2507 using the Memory as Context (MAC) architecture pattern.


🎯 Pilot Experiment Overview

This is an experimental pilot exploring how test-time training (TTT) memory layers can be combined with standard transformer cores in a modular architecture. The MAC pattern separates:

  • Memory Layers: TTT-Linear's self-adaptation mechanism for dynamic context learning
  • Core Layers: Qwen3's transformer decoder for reasoning and generation

Key Idea: MAC Processing Flow

Input Sequence (Processed in Segments)
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. Read-Only Retrieval (R-Mode)   β”‚  ← Memory layers (Q-only projection)
β”‚     Generate memory queries          β”‚    Enables parallel computation
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2. Core Processing                 β”‚  ← Qwen3 transformer layers
β”‚     [Fixed Memory + Query + Input]  β”‚    Standard attention + MLP
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  3. Memory Update (W-Mode)          β”‚  ← Memory layers (full QKV)
β”‚     Update context representations  β”‚    Test-time adaptation
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
Final Output (Join updated context with core output)

Architecture Benefits

  1. Parallel Memory Retrieval: Q-only projection in R-mode enables efficient segment processing
  2. Weight Tying: retriever.q shares weights with memory.q for efficiency
  3. Modular Design: Memory and core can be independently scaled/fine-tuned
  4. Hybrid Capabilities: Combines TTT's adaptive learning with transformer's proven performance

πŸ“Š Model Statistics

Component Source Model Layers Parameters Intermediate Size
Embedding & LM Head Qwen3-4B-Thinking - ~310M (vocab: 151,936) -
Memory Layers TTT-Linear-1.3B 24 ~1.3B 5,504
Core Layers Qwen3-4B-Thinking 36 ~4B 9,728
Total Combined MAC 60 layers ~5.6B Mixed

Hidden Size: 2048 (from TTT-Linear)
Attention Heads: 32 (memory), 32/8 GQA (core)
Context Length: Up to 262K tokens (Qwen3's max)
Precision: BFloat16


πŸ—οΈ Architecture Details

Memory Module (TTT-Linear)

  • Purpose: Dynamic context adaptation through test-time training
  • Key Components:
    • Self-adaptation layers with learnable neural memory
    • Momentum-based learning rate gates
    • Q/K/V projections with RoPE
    • Mini-batch processing (chunk_size=16)
  • Special Features:
    • Weight tying between retriever and full memory
    • Shared Q/K projections with separate conv layers
    • Learnable token-wise learning rates

Core Module (Qwen3)

  • Purpose: High-capacity reasoning and generation
  • Key Components:
    • Multi-head attention with Grouped Query Attention (GQA)
    • SwiGLU MLP activations
    • RMSNorm for layer normalization
    • RoPE with theta=5,000,000 for long context
  • Special Features:
    • 8 KV heads for efficient inference
    • No attention bias
    • Sliding window attention support

Fixed Persistent Memory

  • Size: 64 tokens Γ— 2048 dimensions
  • Purpose: Store global context/knowledge across segments
  • Initialization: Zeros (trainable parameter)

πŸš€ Usage

Installation

# Install dependencies
pip install torch transformers safetensors accelerate

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "./TTTPilot-Q-5B-Thinking-MAC",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("./TTTPilot-Q-5B-Thinking-MAC")

# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced: Segment Processing

# MAC processes inputs in segments for memory efficiency
# Segment size controlled by mini_batch_size (default: 16)

# For long sequences, the model automatically:
# 1. Chunks input into mini-batches
# 2. Processes each chunk through R-mode β†’ Core β†’ W-mode
# 3. Accumulates context updates across chunks

long_text = "..." * 1000  # Very long input
inputs = tokenizer(long_text, return_tensors="pt", truncation=False)
outputs = model.generate(**inputs, max_new_tokens=200)

πŸ”§ Weight Conversion

To recreate this model from source checkpoints:

python convert_weights.py

The script:

  1. Loads TTT-Linear-1.3B and Qwen3-4B-Thinking
  2. Maps TTT weights β†’ memory layers (preserving exact key names)
  3. Maps Qwen3 weights β†’ core layers (preserving exact key names)
  4. Uses Qwen3's embedding & lm_head (for vocab compatibility)
  5. Copies tokenizer files from Qwen3
  6. Saves combined model in HuggingFace format

βš™οΈ Configuration

Key hyperparameters in config.json:

{
  "model_type": "tttpilot_mac",
  "vocab_size": 151936,          // Qwen3
  "hidden_size": 2048,            // TTT-Linear
  "num_memory_layers": 24,        // TTT-Linear
  "num_core_layers": 36,          // Qwen3
  "memory_intermediate_size": 5504,   // TTT MLP
  "core_intermediate_size": 9728,     // Qwen3 MLP
  "num_attention_heads": 32,
  "num_key_value_heads": 8,       // GQA in cores
  "mini_batch_size": 16,          // TTT chunk size
  "ttt_base_lr": 1.0,             // TTT learning rate
  "fixed_memory_size": 64,        // Persistent memory tokens
  "rope_theta": 5000000,          // Long context RoPE
  "max_position_embeddings": 262144   // Max sequence length
}

πŸ§ͺ Pilot Experiment Status

What Works

βœ… Model architecture defined
βœ… Weight conversion pipeline
βœ… Configuration files generated
βœ… Tokenizer compatibility (Qwen3)
βœ… Basic forward pass structure

What's Experimental

⚠️ MAC segment processing: Simplified in pilot, needs full TTT integration
⚠️ Retriever implementation: Placeholder, requires Q-only inference mode
⚠️ Weight tying: Defined but not verified in practice
⚠️ Memory update logic: Full TTT adaptation step needs integration

Known Limitations

  • Memory layers use placeholder identity functions (need full TTT code)
  • No actual segment-based MAC flow yet (processes like standard transformer)
  • TTT cache and Qwen3 KV cache not properly integrated
  • No training/fine-tuning tested
  • Generation quality not benchmarked

πŸŽ“ Research Context

This pilot implements concepts from:

  1. Test-Time Training (TTT): Self-supervised adaptation during inference

  2. Titans Architecture: Modular memory-augmented patterns

    • Inspiration: Memory-as-X design patterns (MAC, MAE, MAL, etc.)
    • Idea: Separate stateful memory from stateless reasoning
  3. Grouped Query Attention: Efficient multi-head attention

    • From: Qwen3 and modern LLMs
    • Benefit: Faster inference with minimal quality loss

πŸ“ Citation

If you use this pilot or build upon it:

@misc{tttpilot-mac-2026,
  title={TTTPilot-MAC: A Pilot Implementation of Memory-Augmented-Core Architecture},
  author={Your Name},
  year={2026},
  note={Pilot experiment combining TTT-Linear and Qwen3},
  howpublished={\url{https://github.com/...}}
}

Source Models:


πŸ“„ License

This project combines:

  • TTT-Linear (MIT License)
  • Qwen3 (Apache 2.0 License)

Final license: Apache 2.0 (compatible with both)

See LICENSE for details.


🀝 Contributing

This is a pilot experiment for research exploration. Contributions welcome:

  1. Full MAC segment processing implementation
  2. TTT-Linear integration (replace placeholders)
  3. Retriever Q-only mode
  4. Training scripts
  5. Benchmark evaluations
  6. Documentation improvements

πŸ› Issues & Feedback

Found a bug or have suggestions? Open an issue!

Important Notes:

  • This is NOT a production-ready model
  • Use for research/experimentation only
  • Performance not guaranteed
  • May require significant compute resources (5.6B parameters)

🌟 Acknowledgments

  • TTT Team for the test-time training paradigm
  • Qwen Team for Qwen3-Thinking model
  • Titans Architecture inspiration from modular design patterns

Status: 🚧 Experimental Pilot - Use with caution!

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RetentionLabs/TTTPilot-Q-5B-Thinking-MAC

Datasets used to train RetentionLabs/TTTPilot-Q-5B-Thinking-MAC

Collection including RetentionLabs/TTTPilot-Q-5B-Thinking-MAC

Paper for RetentionLabs/TTTPilot-Q-5B-Thinking-MAC