tinyllms/aime-1983-2023-trajectories
Viewer • Updated • 1.84k • 55
Fine-tuned from Qwen/Qwen2.5-7B-Instruct using QLoRA (4-bit NF4 quantization + LoRA adapters, merged before upload).
This is the SFT stage of a leave-one-out (LOO) experiment: the model is trained on Game24 and AIME trajectories, deliberately excluding domain knowledge (GPQA) data. The held-out domain is later used to measure cross-domain transfer.
messages to prompt/completion format before trainingTrained on two datasets (domain knowledge held out):
| Dataset | Domain |
|---|---|
tinyllms/game24-trajectories |
Game of 24 — arithmetic reasoning |
tinyllms/aime-1983-2023-trajectories |
AIME — competition math |
Examples exceeding max_seq_len are filtered out. A 10% holdout is used for evaluation (eval runs every 10 steps).
| Domain | Role |
|---|---|
| Game24 | Train |
| AIME | Train |
| Domain Knowledge (GPQA) | Held out |
The GRPO stage follows using tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge, trained on the same two datasets. Transfer is measured by evaluating on GPQA Diamond.
pocket-sheet-sft)