Model Card
Overview
This repository documents two separate large language model training methodologies and precision strategies:
Mistral LLM Training
- Fully trained in native FP32 precision
- Optimization performed using standard AdamW
- No Adam8bit, quantized optimizer states, or reduced-precision optimizer approximations were used during training
- Intended to preserve numerical stability and high-fidelity gradient accumulation throughout all training phases
DIT Ernie Model
- Uses a Monte Carlo estimation approach to approximate FP32 behavior
Training Details
Mistral LLM
Precision
- Full FP32 training
- FP32 activations
- FP32 optimizer states
- FP32 gradients
Optimizer
- AdamW
- Weight decay enabled
- No 8-bit optimizer compression
- No low-rank optimizer approximation
Notes
The Mistral configuration prioritizes:
- numerical consistency
- deterministic convergence behavior
- stable long-context optimization
- reduced quantization-induced gradient noise
This setup is computationally expensive but provides high-fidelity optimization dynamics during pretraining and finetuning.
DIT Ernie
Precision Strategy
The DIT Ernie architecture utilizes:
- Monte Carlo estimation techniques
- probabilistic FP32 approximation
- stochastic numerical reconstruction
Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.
Goals
- reduce memory bandwidth requirements
- improve throughput efficiency
- retain approximate FP32 convergence characteristics
- balance numerical quality with hardware scalability
Notes
This methodology may introduce:
- stochastic variance between runs
- approximation noise
- non-deterministic optimization characteristics
However, it can significantly reduce training cost relative to native FP32 execution.
Intended Use
This repository is intended for:
- research documentation
- training methodology comparison
- optimizer precision analysis
- numerical stability benchmarking
- transformer architecture experimentation
Limitations
Results can vary depending on:
- sampling strategy
- hardware backend
- distributed training topology
- random seed initialization
License
Apache License 2.0
- Downloads last month
- -