Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
Paper โข 2409.12903 โข Published โข 22
Initialized from facebook/MobileLLM-R1-140M-base using HyperCloning (Samragh et al., 2024).
| Source | HyperCloned | |
|---|---|---|
| hidden_size | 576 | 1152 |
| num_attention_heads | 9 | 18 |
| num_key_value_heads | 3 | 6 |
| head_dim | 64 | 64 |
| intermediate_size | 8192 | 16384 |
| num_layers | 15 | 15 |
| parameters | 140,248,512 | 454,790,016 |
Each weight W is expanded via W.repeat(n, n) / n (the paper's W/2 for 2ร).
Heads double with embedding dimension. head_dim is preserved.
Output logits match the source model at initialization.
This is an initialization checkpoint โ further training is needed.
@article{samragh2024hypercloning,
title={Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization},
author={Samragh et al.},
journal={arXiv:2409.12903},
year={2024}
}
Base model
facebook/MobileLLM-R1-140M-base