Add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- duoneural
|
| 7 |
+
- sft
|
| 8 |
+
- multi-task
|
| 9 |
+
- qwen2.5-coder
|
| 10 |
+
- structured-output
|
| 11 |
+
- sql
|
| 12 |
+
- json
|
| 13 |
+
- webcode
|
| 14 |
+
base_model: Qwen/Qwen2.5-Coder-3B-Instruct
|
| 15 |
+
datasets:
|
| 16 |
+
- DuoNeural/Gemma4-E2B-SFT-SQL
|
| 17 |
+
- DuoNeural/Gemma4-E2B-SFT-JSON
|
| 18 |
+
- DuoNeural/Gemma4-E2B-SFT-WebCode
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# Qwen2.5-Coder-3B-SFT-StructuredOutput
|
| 22 |
+
|
| 23 |
+
**✅ Winner** — Multi-task SFT by [DuoNeural](https://huggingface.co/DuoNeural).
|
| 24 |
+
|
| 25 |
+
**Research question:** Does training on SQL+JSON+WebCode *together* generalize
|
| 26 |
+
better than individual domain specialists?
|
| 27 |
+
|
| 28 |
+
- **Base model:** [Qwen/Qwen2.5-Coder-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct)
|
| 29 |
+
- **Combined dataset:** SQL (7560) + JSON (3568) + WebCode (1107) = **12235 examples**
|
| 30 |
+
- **Training:** LoRA r=16 α=32, 3 epochs, lr=0.0002, eff batch=16, gradient checkpointing
|
| 31 |
+
- **Training time:** 321.6 min
|
| 32 |
+
- **Eval:** GSM8K + ARC-Challenge (lm_eval 0.4.x)
|
| 33 |
+
|
| 34 |
+
## Benchmark vs Baseline
|
| 35 |
+
|
| 36 |
+
| Model | GSM8K flex | ARC-norm | ARC-acc |
|
| 37 |
+
|---|---|---|---|
|
| 38 |
+
| Baseline (Qwen2.5-Coder-3B-Instruct) | 0.5823 | 0.4898 | 0.4556 |
|
| 39 |
+
| **Qwen2.5-Coder-3B-SFT-StructuredOutput** | **0.7013** | **0.4949** | **0.4522** |
|
| 40 |
+
| Δ | +0.1190 | +0.0051 | — |
|
| 41 |
+
|
| 42 |
+
## Design Notes
|
| 43 |
+
|
| 44 |
+
Datasets were shuffled and interleaved (seed=42) to prevent domain ordering bias.
|
| 45 |
+
Each domain contributes proportionally — SQL dominates by count (62%) which
|
| 46 |
+
may bias the model slightly toward SQL-style structured outputs.
|
| 47 |
+
|
| 48 |
+
See individual specialist models for comparison:
|
| 49 |
+
- [Qwen2.5-Coder-3B-SFT-SQL](https://huggingface.co/DuoNeural/Qwen2.5-Coder-3B-SFT-SQL)
|
| 50 |
+
- [Qwen2.5-Coder-3B-SFT-JSON](https://huggingface.co/DuoNeural/Qwen2.5-Coder-3B-SFT-JSON)
|
| 51 |
+
- [Qwen2.5-Coder-3B-SFT-WebCode](https://huggingface.co/DuoNeural/Qwen2.5-Coder-3B-SFT-WebCode)
|
| 52 |
+
|
| 53 |
+
## About DuoNeural
|
| 54 |
+
|
| 55 |
+
Post-training research lab exploring emergent behaviors in small language models.
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
*Archon — DuoNeural lab AI*
|