walidsobhie-code commited on
Commit
cee9266
·
1 Parent(s): 29a776a

Exclude large jsonl files from repo

Browse files
.gitignore CHANGED
@@ -75,4 +75,6 @@ logs/
75
 
76
  # Temporary
77
  tmp/
78
- temp/
 
 
 
75
 
76
  # Temporary
77
  tmp/
78
+ temp/training-data/**/*.jsonl
79
+ training-data-expanded/**/*.jsonl
80
+ *.jsonl
training-data-expanded/tool_examples.jsonl DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:62e9ca4a94ef5c4c4b3d00c87d669ac33e23efb7bd6468d9c71304acc89cd553
3
- size 18893234
 
 
 
 
training-data/README.md DELETED
@@ -1,182 +0,0 @@
1
- # Stack 2.9 Training Data
2
-
3
- This directory contains synthetic training data for fine-tuning code generation models.
4
-
5
- ## Directory Structure
6
-
7
- ```
8
- training-data/
9
- ├── README.md # This file
10
- ├── tool_examples.jsonl # Tool-calling examples (Qwen2.5-Coder format)
11
- ├── tool_examples.json # Same as above in JSON format
12
- ├── code_completion/ # Pure code completion examples
13
- │ ├── code_completion.jsonl
14
- │ └── code_completion.json
15
- └── training-data-expanded/ # Additional generated data
16
- └── tool_examples.jsonl # 5000 expanded tool-calling examples
17
- ```
18
-
19
- ## Data Formats
20
-
21
- ### Tool-Calling Examples
22
-
23
- **Format:** Qwen2.5-Coder style with `tool_calls`
24
-
25
- Each example contains:
26
- - `messages`: Array of conversation messages (system, user, assistant, tool)
27
- - `tools`: Array of tool definitions
28
-
29
- **Example structure:**
30
- ```json
31
- {
32
- "messages": [
33
- {"role": "system", "content": "You are a helpful AI assistant..."},
34
- {"role": "user", "content": "Read the file at src/main.py..."},
35
- {
36
- "role": "assistant",
37
- "content": null,
38
- "tool_calls": [
39
- {
40
- "id": "call_1234",
41
- "type": "function",
42
- "function": {
43
- "name": "FileRead",
44
- "arguments": "{\"path\": \"src/main.py\"}"
45
- }
46
- }
47
- ]
48
- },
49
- {
50
- "role": "tool",
51
- "content": "Successfully read file: src/main.py\n...",
52
- "tool_call_id": "call_1234",
53
- "name": "FileRead"
54
- },
55
- {"role": "assistant", "content": "Here's the contents..."}
56
- ],
57
- "tools": [...]
58
- }
59
- ```
60
-
61
- **Available Tools:**
62
- - `Bash` - Execute bash commands
63
- - `FileRead` - Read file contents
64
- - `FileWrite` - Write/create files
65
- - `WebSearch` - Search the web
66
- - `Grep` - Search patterns in files
67
-
68
- ### Code Completion Examples
69
-
70
- **Format:** Chat-based with context and completion
71
-
72
- Each example contains:
73
- - `messages`: Array of conversation messages
74
- - `language`: Programming language (python, javascript, go, rust, typescript)
75
- - `difficulty`: easy, medium, hard
76
- - `variant`: basic, explain, debug, optimize
77
- - `context`: The code context to complete
78
- - `completion`: The expected completion
79
-
80
- **Example structure:**
81
- ```json
82
- {
83
- "messages": [
84
- {"role": "system", "content": "You are a helpful AI assistant..."},
85
- {"role": "user", "content": "Complete the following code:\n```python\ndef greet(name):\n```"},
86
- {"role": "assistant", "content": "Here's the completed code:\n```python\ndef greet(name):\n return f\"Hello, {name}!\"\n```"}
87
- ],
88
- "language": "python",
89
- "difficulty": "easy",
90
- "variant": "basic",
91
- "description": "Simple function that returns a greeting",
92
- "context": "def greet(name):",
93
- "completion": " return f\"Hello, {name}!\""
94
- }
95
- ```
96
-
97
- ## Generation Scripts
98
-
99
- ### Tool Data Generator
100
-
101
- ```bash
102
- python3 scripts/generate_tool_data.py \
103
- --num-examples 5000 \
104
- --output-dir training-data-expanded \
105
- --output-format jsonl
106
- ```
107
-
108
- ### Code Completion Generator
109
-
110
- ```bash
111
- python3 scripts/generate_code_completion_data.py \
112
- --num-examples 1000 \
113
- --output-dir training-data/code-completion \
114
- --languages python javascript go rust typescript \
115
- --difficulties easy medium hard \
116
- --variants basic explain debug optimize
117
- ```
118
-
119
- ## Difficulty Levels
120
-
121
- | Level | Description |
122
- |-------|-------------|
123
- | **easy** | Simple functions, basic operations, single concepts |
124
- | **medium** | Intermediate patterns, async operations, error handling |
125
- | **hard** | Complex algorithms, data structures, design patterns |
126
-
127
- ## Variants
128
-
129
- | Variant | Description |
130
- |---------|-------------|
131
- | **basic** | Standard code completion |
132
- | **explain** | Code completion with explanation |
133
- | **debug** | Bug fixing and completion |
134
- | **optimize** | Performance optimization and completion |
135
-
136
- ## Supported Languages
137
-
138
- - Python
139
- - JavaScript
140
- - Go
141
- - Rust
142
- - TypeScript
143
-
144
- ## Usage
145
-
146
- ### Training with MLflow
147
-
148
- ```bash
149
- mlflow run . -P num_examples=5000
150
- ```
151
-
152
- ### Loading Data for Training
153
-
154
- ```python
155
- import json
156
-
157
- # Load JSONL
158
- with open("training-data/tool_examples.jsonl", "r") as f:
159
- for line in f:
160
- example = json.loads(line)
161
- # Process example
162
- pass
163
-
164
- # Load JSON
165
- with open("training-data/tool_examples.json", "r") as f:
166
- data = json.load(f)
167
- ```
168
-
169
- ## Augmentation
170
-
171
- The tool-calling generator applies augmentation to create diversity:
172
- - Varying file paths
173
- - Varying command options
174
- - Varying search queries
175
- - Varying code snippets
176
-
177
- ## Quality Guidelines
178
-
179
- - All generated code is syntactically correct
180
- - Examples include realistic context
181
- - Tools have proper arguments and responses
182
- - Code completions are deterministic and correct
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
training-data/tool_examples.json DELETED
The diff for this file is too large to render. See raw diff
 
training-data/tool_examples.jsonl DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:1043720a918f5fe0f70cc013c108710570c37ae6c9cee6f504e49dc359af5a2a
3
- size 3779800
 
 
 
 
training-data/tool_examples_combined.jsonl DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:32da2f0f67ba3fd83d180ec2c1a323e77d4263ff5aeb1e8062cf596b070691d5
3
- size 5669209