Consolidate all training logging through MetricsLogger b103659 thomas-schweich commited on about 13 hours ago
Log patience counter, best val loss/step in val records a050f72 thomas-schweich commited on about 13 hours ago
Per-model early stopping: freeze converged variants individually 190085d thomas-schweich commited on about 14 hours ago
Push metrics to HF at eval intervals, add dashboard HF sync 86ec60c thomas-schweich commited on about 15 hours ago
Add early stopping patience to multi-model training 07c93ac thomas-schweich commited on about 15 hours ago
Remove .item() CUDA sync from hot path, batch size 512, run slugs fc9d7f7 thomas-schweich commited on about 15 hours ago
Add post-training evals, /dev/shm checkpoints, async HF push, and _orig_mod fix 87b2fa6 thomas-schweich commited on about 15 hours ago
Safetensors migration, checkpoint integrity, and multi-model training. (#1) 230508d unverified thomas-schweich commited on about 17 hours ago