Listen, @Monenyo and @ianncity , before you call it a 'larp', try to understand how a high-density MoE (Mixture of Experts) pipeline actually scales. You're doing basic linear math on a 1.1T sparse architecture, which is a rookie mistake.
Throughput vs. Active Parameters: The ~4,000 tokens/sec is the weighted average across the cluster. In the initial phases (Phase 1 & 2), the model was trained with a lower expert-routing frequency, pushing the throughput significantly higher (up to 8,500 tokens/sec/GPU) before we stabilized for Phase 3.
Cluster Expansion: As mentioned in Section 4, the cluster was a phased rollout. We hit the 146T mark by expanding the node count mid-run and utilizing staged sequence lengths (8k to 32k) which drastically reduces compute overhead compared to a fixed 512k window.
Data Parallelism (DP): We utilized a massive Global Batch Size enabled by ZeRO-3 and custom gradient accumulation, which allows for much higher effective token processing than your '104 days linear' estimation suggests.
The full TFLOPS/GPU logs and Batch-size progression are in the internal audit report. If you can't wrap your head around 1.1T scaling, maybe stick to fine-tuning 7B models. 😉