Model Card

Overview

This repository documents two separate large language model training methodologies and precision strategies:


Mistral LLM Training

  • Fully trained in native FP32 precision
  • Optimization performed using standard AdamW
  • No Adam8bit, quantized optimizer states, or reduced-precision optimizer approximations were used during training
  • Intended to preserve numerical stability and high-fidelity gradient accumulation throughout all training phases

DIT Ernie Model

  • Uses a Monte Carlo estimation approach to approximate FP32 behavior

Training Details

Mistral LLM

Precision

  • Full FP32 training
  • FP32 activations
  • FP32 optimizer states
  • FP32 gradients

Optimizer

  • AdamW
  • Weight decay enabled
  • No 8-bit optimizer compression
  • No low-rank optimizer approximation

Notes

The Mistral configuration prioritizes:

  • numerical consistency
  • deterministic convergence behavior
  • stable long-context optimization
  • reduced quantization-induced gradient noise

This setup is computationally expensive but provides high-fidelity optimization dynamics during pretraining and finetuning.


DIT Ernie

Precision Strategy

The DIT Ernie architecture utilizes:

  • Monte Carlo estimation techniques
  • probabilistic FP32 approximation
  • stochastic numerical reconstruction

Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.

Goals

  • reduce memory bandwidth requirements
  • improve throughput efficiency
  • retain approximate FP32 convergence characteristics
  • balance numerical quality with hardware scalability

Notes

This methodology may introduce:

  • stochastic variance between runs
  • approximation noise
  • non-deterministic optimization characteristics

However, it can significantly reduce training cost relative to native FP32 execution.


Intended Use

This repository is intended for:

  • research documentation
  • training methodology comparison
  • optimizer precision analysis
  • numerical stability benchmarking
  • transformer architecture experimentation

Limitations

Results can vary depending on:

  • sampling strategy
  • hardware backend
  • distributed training topology
  • random seed initialization

License

Apache License 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Felldude/ERNIE-Image

Merges
2 models