Errors During Training for the Original Implementation and the Fixes for the Errors

pinned

by v2ray - opened Mar 29, 2024

Mar 29, 2024

•

edited Mar 29, 2024

https://huggingface.co/v2ray/dbrx-base-fixed
The original DBRX implementation code has a few bugs which only affect training, which I fixed in my re-upload.
I re-uploaded because the changes require the weights files to be converted, so if anyone want to use the fix you need to re-download the entire weights!

The issues - How I fixed them:

Error when using gradient checkpointing - Fixed by using positional arguments instead because _gradient_checkpointing_func doesn't support kwargs.
VRAM usage go zoom and CUDA Out of Memory when backpropping through the MLP layer - Fixed by separating the experts' weights into different tensors instead of using a single tensor for all the experts. IDK why this fixed it but maybe it's because torch is trying to compute gradient for every expert at once, which shouldn't happen since it's a MoE model.

Undi95

Owner Mar 29, 2024

•

edited Mar 29, 2024

Hey thanks for this.
I will not fix this on my side since you have done it, and will try to keep the repo as 1:1 from the original.
Nice work tho!

Undi95 pinned discussion Mar 29, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment