MLP Design Choice

#47
by scissorstail - opened

Why does this model series, unlike other models, always use the approach of splitting into two chunks when computing the up_states and gate in the MLP? I’m not sure if this is the right place to ask, but I didn’t really have anywhere else to turn.

https://github.com/huggingface/transformers/blob/40a493c7ed4f19f08eadb0639cf26d49bfa5e180/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py#L763

Microsoft org

Hello @scissorstail !

We took advantage of some performance tricks to increase the throughput / MFU during pre-training, e.g., using a single matrix to compute the up and gate states.

gugarosa changed discussion status to closed
gugarosa changed discussion status to open
This comment has been hidden (marked as Spam)

Sign up or log in to comment