TTTPilot-Q-5B-Thinking-MAC
A Pilot Implementation of Titans MAC Architecture
Combining TTT-Linear-1.3B-Base-Pile-8k and Qwen3-4B-Thinking-2507 using the Memory as Context (MAC) architecture pattern.
π― Pilot Experiment Overview
This is an experimental pilot exploring how test-time training (TTT) memory layers can be combined with standard transformer cores in a modular architecture. The MAC pattern separates:
- Memory Layers: TTT-Linear's self-adaptation mechanism for dynamic context learning
- Core Layers: Qwen3's transformer decoder for reasoning and generation
Key Idea: MAC Processing Flow
Input Sequence (Processed in Segments)
β
βββββββββββββββββββββββββββββββββββββββ
β 1. Read-Only Retrieval (R-Mode) β β Memory layers (Q-only projection)
β Generate memory queries β Enables parallel computation
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β 2. Core Processing β β Qwen3 transformer layers
β [Fixed Memory + Query + Input] β Standard attention + MLP
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β 3. Memory Update (W-Mode) β β Memory layers (full QKV)
β Update context representations β Test-time adaptation
βββββββββββββββββββββββββββββββββββββββ
β
Final Output (Join updated context with core output)
Architecture Benefits
- Parallel Memory Retrieval: Q-only projection in R-mode enables efficient segment processing
- Weight Tying:
retriever.qshares weights withmemory.qfor efficiency - Modular Design: Memory and core can be independently scaled/fine-tuned
- Hybrid Capabilities: Combines TTT's adaptive learning with transformer's proven performance
π Model Statistics
| Component | Source Model | Layers | Parameters | Intermediate Size |
|---|---|---|---|---|
| Embedding & LM Head | Qwen3-4B-Thinking | - | ~310M (vocab: 151,936) | - |
| Memory Layers | TTT-Linear-1.3B | 24 | ~1.3B | 5,504 |
| Core Layers | Qwen3-4B-Thinking | 36 | ~4B | 9,728 |
| Total | Combined MAC | 60 layers | ~5.6B | Mixed |
Hidden Size: 2048 (from TTT-Linear)
Attention Heads: 32 (memory), 32/8 GQA (core)
Context Length: Up to 262K tokens (Qwen3's max)
Precision: BFloat16
ποΈ Architecture Details
Memory Module (TTT-Linear)
- Purpose: Dynamic context adaptation through test-time training
- Key Components:
- Self-adaptation layers with learnable neural memory
- Momentum-based learning rate gates
- Q/K/V projections with RoPE
- Mini-batch processing (chunk_size=16)
- Special Features:
- Weight tying between retriever and full memory
- Shared Q/K projections with separate conv layers
- Learnable token-wise learning rates
Core Module (Qwen3)
- Purpose: High-capacity reasoning and generation
- Key Components:
- Multi-head attention with Grouped Query Attention (GQA)
- SwiGLU MLP activations
- RMSNorm for layer normalization
- RoPE with theta=5,000,000 for long context
- Special Features:
- 8 KV heads for efficient inference
- No attention bias
- Sliding window attention support
Fixed Persistent Memory
- Size: 64 tokens Γ 2048 dimensions
- Purpose: Store global context/knowledge across segments
- Initialization: Zeros (trainable parameter)
π Usage
Installation
# Install dependencies
pip install torch transformers safetensors accelerate
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"./TTTPilot-Q-5B-Thinking-MAC",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./TTTPilot-Q-5B-Thinking-MAC")
# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Advanced: Segment Processing
# MAC processes inputs in segments for memory efficiency
# Segment size controlled by mini_batch_size (default: 16)
# For long sequences, the model automatically:
# 1. Chunks input into mini-batches
# 2. Processes each chunk through R-mode β Core β W-mode
# 3. Accumulates context updates across chunks
long_text = "..." * 1000 # Very long input
inputs = tokenizer(long_text, return_tensors="pt", truncation=False)
outputs = model.generate(**inputs, max_new_tokens=200)
π§ Weight Conversion
To recreate this model from source checkpoints:
python convert_weights.py
The script:
- Loads TTT-Linear-1.3B and Qwen3-4B-Thinking
- Maps TTT weights β memory layers (preserving exact key names)
- Maps Qwen3 weights β core layers (preserving exact key names)
- Uses Qwen3's embedding & lm_head (for vocab compatibility)
- Copies tokenizer files from Qwen3
- Saves combined model in HuggingFace format
βοΈ Configuration
Key hyperparameters in config.json:
{
"model_type": "tttpilot_mac",
"vocab_size": 151936, // Qwen3
"hidden_size": 2048, // TTT-Linear
"num_memory_layers": 24, // TTT-Linear
"num_core_layers": 36, // Qwen3
"memory_intermediate_size": 5504, // TTT MLP
"core_intermediate_size": 9728, // Qwen3 MLP
"num_attention_heads": 32,
"num_key_value_heads": 8, // GQA in cores
"mini_batch_size": 16, // TTT chunk size
"ttt_base_lr": 1.0, // TTT learning rate
"fixed_memory_size": 64, // Persistent memory tokens
"rope_theta": 5000000, // Long context RoPE
"max_position_embeddings": 262144 // Max sequence length
}
π§ͺ Pilot Experiment Status
What Works
β
Model architecture defined
β
Weight conversion pipeline
β
Configuration files generated
β
Tokenizer compatibility (Qwen3)
β
Basic forward pass structure
What's Experimental
β οΈ MAC segment processing: Simplified in pilot, needs full TTT integration
β οΈ Retriever implementation: Placeholder, requires Q-only inference mode
β οΈ Weight tying: Defined but not verified in practice
β οΈ Memory update logic: Full TTT adaptation step needs integration
Known Limitations
- Memory layers use placeholder identity functions (need full TTT code)
- No actual segment-based MAC flow yet (processes like standard transformer)
- TTT cache and Qwen3 KV cache not properly integrated
- No training/fine-tuning tested
- Generation quality not benchmarked
π Research Context
This pilot implements concepts from:
Test-Time Training (TTT): Self-supervised adaptation during inference
- Paper: Learning to (Learn at Test Time)
- Code: TTT-Linear
Titans Architecture: Modular memory-augmented patterns
- Inspiration: Memory-as-X design patterns (MAC, MAE, MAL, etc.)
- Idea: Separate stateful memory from stateless reasoning
Grouped Query Attention: Efficient multi-head attention
- From: Qwen3 and modern LLMs
- Benefit: Faster inference with minimal quality loss
π Citation
If you use this pilot or build upon it:
@misc{tttpilot-mac-2026,
title={TTTPilot-MAC: A Pilot Implementation of Memory-Augmented-Core Architecture},
author={Your Name},
year={2026},
note={Pilot experiment combining TTT-Linear and Qwen3},
howpublished={\url{https://github.com/...}}
}
Source Models:
- TTT-Linear: test-time-training/TTT-Linear-1.3B-Base-Pile-8k
- Qwen3: Qwen/Qwen3-4B-Thinking-2507
π License
This project combines:
- TTT-Linear (MIT License)
- Qwen3 (Apache 2.0 License)
Final license: Apache 2.0 (compatible with both)
See LICENSE for details.
π€ Contributing
This is a pilot experiment for research exploration. Contributions welcome:
- Full MAC segment processing implementation
- TTT-Linear integration (replace placeholders)
- Retriever Q-only mode
- Training scripts
- Benchmark evaluations
- Documentation improvements
π Issues & Feedback
Found a bug or have suggestions? Open an issue!
Important Notes:
- This is NOT a production-ready model
- Use for research/experimentation only
- Performance not guaranteed
- May require significant compute resources (5.6B parameters)
π Acknowledgments
- TTT Team for the test-time training paradigm
- Qwen Team for Qwen3-Thinking model
- Titans Architecture inspiration from modular design patterns
Status: π§ Experimental Pilot - Use with caution!
- Downloads last month
- 1