File size: 4,549 Bytes
6379283
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# Data Scaling Strategy - Stack 2.9

## Target: 50K+ Training Examples

### Current State
- Synthetic examples: 213
- Code-comment pairs: 4,045
- Advanced patterns: 306
- **Total estimated:** ~5,000-6,000 examples

### Target: 50,000+ examples

---

## Scaling Plan

### 1. Mine OpenClaw Session Logs (10K examples)

**Where to look:**
- `~/.openclaw/sessions/` - OpenClaw session logs
- `~/.claude/sessions/` - Claude Code sessions (if exists)
- `~/.anthropic/` - Anthropic Claude logs
- Any custom session history in project directories

**Format:** Likely JSON, JSONL, or Markdown
**What to extract:**
- Full conversations with tool use
- User prompts + assistant responses + tool calls + tool results
- Multi-turn dialogues
- Error recovery patterns
- Different tool combinations

**Expected yield:** 5,000-15,000 examples depending on usage history.

---

### 2. Synthetic Data with Template-Based Generation (20K examples)

**Approach:** Create hundreds of templates for each tool pattern and generate variations.

**For each of 37 tools:**
- Create 10-20 scenario templates (e.g., for FileReadTool: "Read file X", "Show me Y", "What's in Z?")
- Generate 200-500 variations by:
  * Changing file names, function names, variables
  * Varying parameter values
  * Changing phrasing (synonyms, active/passive, question/command)
  * Adding noise (typos, extra spaces, filler words)
  * Combining multiple tool calls in sequence

**Total:** 37 tools × 500 variations = 18,500 examples

**Tools with highest priority:**
- FileReadTool, FileWriteTool, GlobTool, GrepTool (common)
- BashTool, TaskCreateTool, Agent-related tools (complex workflows)
- MCPTool (extension patterns)

---

### 3. Public Dataset Integration (20K examples)

**Datasets to download (Hugging Face - free):**

#### a) OpenAssistant (oasst1)
- Conversations from OpenAssistant project
- Filter: coding-related threads
- Transform: Convert to tool-use format (synthesize tool calls from intent)
- Estimated: 5,000 examples

#### b) CodeAct
- Already has executable code actions
- Direct mapping to our tools
- Estimated: 10,000 examples

#### c) CodeContests
- Competition problems + solutions
- Format as code generation tasks
- Filter permissive licenses only
- Estimated: 3,000 examples

#### d) StarCoder Data (permissive subset)
- Various code tasks
- Estimated: 2,000 examples

**Total:** ~20,000 examples

---

### 4. Code-Pair Expansion (10K+ additional)

Already have 4,045 code-comment pairs from src/.

**Additional extraction:**
- Parse ALL TypeScript/JS files in src/ more thoroughly
- Include:
  * Function + JSDoc
  * Class + class comment
  * Interface + description
  * Error handlers
  * Complex algorithms with inline comments
  * Test cases + implementation
- Target: 10,000 additional pairs

**Method:**
- Enhanced parser that finds all code blocks with preceding comment
- Use local NLP (if needed) to generate comments for code without them
- Filter for meaningful pairs (>3 lines code, substantive comment)

---

### 5. Data Augmentation (5K examples)

From existing high-quality examples:
- Paraphrase user prompts (local NLP tools)
- Swap tools in similar contexts (e.g., FileRead → Glob)
- Add/remove context information
- Create "failed tool" scenarios with recovery
- Vary complexity levels

Target: 5,000 augmented examples

---

## Total Estimate

- OpenClaw logs: 10K
- Synthetic templates: 20K
- Public datasets: 20K
- Code-pairs: 10K
- Augmentation: 5K
- **Total: ~65,000 examples** (exceeds 50K target)

---

## Implementation Steps

1. **Session log mining script** - `scripts/mine_sessions.py`
2. **Synthetic data generator** - `scripts/generate_synthetic.py`
3. **Public dataset downloader** - `scripts/download_datasets.py`
4. **Code-pair extractor** - `scripts/extract_code_pairs.py`
5. **Data augmenter** - `scripts/augment_data.py`
6. **Quality filter** - `scripts/quality_filter.py`
7. **Dataset combiner** - `scripts/combine_datasets.py`

All scripts save to `training-data/scaled/` with source tracking.

---

## Quality Control

- All examples validated against tool schemas
- Deduplication (exact and near-duplicate)
- Minimum quality thresholds
- Balance across tools and complexity
- 80/10/10 train/val/test split

---

## Timeline (Manual)

Day 1: Session mining + code-pair extraction
Day 2: Synthetic generation + public dataset integration
Day 3: Augmentation + quality filtering + combining

We can produce 50K+ examples within a few days of focused work.

---

**Status:** Ready to implement step 1 (session mining) now.