walidsobhie-code commited on
Commit ·
cee9266
1
Parent(s): 29a776a
Exclude large jsonl files from repo
Browse files
.gitignore
CHANGED
|
@@ -75,4 +75,6 @@ logs/
|
|
| 75 |
|
| 76 |
# Temporary
|
| 77 |
tmp/
|
| 78 |
-
temp/
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
# Temporary
|
| 77 |
tmp/
|
| 78 |
+
temp/training-data/**/*.jsonl
|
| 79 |
+
training-data-expanded/**/*.jsonl
|
| 80 |
+
*.jsonl
|
training-data-expanded/tool_examples.jsonl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:62e9ca4a94ef5c4c4b3d00c87d669ac33e23efb7bd6468d9c71304acc89cd553
|
| 3 |
-
size 18893234
|
|
|
|
|
|
|
|
|
|
|
|
training-data/README.md
DELETED
|
@@ -1,182 +0,0 @@
|
|
| 1 |
-
# Stack 2.9 Training Data
|
| 2 |
-
|
| 3 |
-
This directory contains synthetic training data for fine-tuning code generation models.
|
| 4 |
-
|
| 5 |
-
## Directory Structure
|
| 6 |
-
|
| 7 |
-
```
|
| 8 |
-
training-data/
|
| 9 |
-
├── README.md # This file
|
| 10 |
-
├── tool_examples.jsonl # Tool-calling examples (Qwen2.5-Coder format)
|
| 11 |
-
├── tool_examples.json # Same as above in JSON format
|
| 12 |
-
├── code_completion/ # Pure code completion examples
|
| 13 |
-
│ ├── code_completion.jsonl
|
| 14 |
-
│ └── code_completion.json
|
| 15 |
-
└── training-data-expanded/ # Additional generated data
|
| 16 |
-
└── tool_examples.jsonl # 5000 expanded tool-calling examples
|
| 17 |
-
```
|
| 18 |
-
|
| 19 |
-
## Data Formats
|
| 20 |
-
|
| 21 |
-
### Tool-Calling Examples
|
| 22 |
-
|
| 23 |
-
**Format:** Qwen2.5-Coder style with `tool_calls`
|
| 24 |
-
|
| 25 |
-
Each example contains:
|
| 26 |
-
- `messages`: Array of conversation messages (system, user, assistant, tool)
|
| 27 |
-
- `tools`: Array of tool definitions
|
| 28 |
-
|
| 29 |
-
**Example structure:**
|
| 30 |
-
```json
|
| 31 |
-
{
|
| 32 |
-
"messages": [
|
| 33 |
-
{"role": "system", "content": "You are a helpful AI assistant..."},
|
| 34 |
-
{"role": "user", "content": "Read the file at src/main.py..."},
|
| 35 |
-
{
|
| 36 |
-
"role": "assistant",
|
| 37 |
-
"content": null,
|
| 38 |
-
"tool_calls": [
|
| 39 |
-
{
|
| 40 |
-
"id": "call_1234",
|
| 41 |
-
"type": "function",
|
| 42 |
-
"function": {
|
| 43 |
-
"name": "FileRead",
|
| 44 |
-
"arguments": "{\"path\": \"src/main.py\"}"
|
| 45 |
-
}
|
| 46 |
-
}
|
| 47 |
-
]
|
| 48 |
-
},
|
| 49 |
-
{
|
| 50 |
-
"role": "tool",
|
| 51 |
-
"content": "Successfully read file: src/main.py\n...",
|
| 52 |
-
"tool_call_id": "call_1234",
|
| 53 |
-
"name": "FileRead"
|
| 54 |
-
},
|
| 55 |
-
{"role": "assistant", "content": "Here's the contents..."}
|
| 56 |
-
],
|
| 57 |
-
"tools": [...]
|
| 58 |
-
}
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
**Available Tools:**
|
| 62 |
-
- `Bash` - Execute bash commands
|
| 63 |
-
- `FileRead` - Read file contents
|
| 64 |
-
- `FileWrite` - Write/create files
|
| 65 |
-
- `WebSearch` - Search the web
|
| 66 |
-
- `Grep` - Search patterns in files
|
| 67 |
-
|
| 68 |
-
### Code Completion Examples
|
| 69 |
-
|
| 70 |
-
**Format:** Chat-based with context and completion
|
| 71 |
-
|
| 72 |
-
Each example contains:
|
| 73 |
-
- `messages`: Array of conversation messages
|
| 74 |
-
- `language`: Programming language (python, javascript, go, rust, typescript)
|
| 75 |
-
- `difficulty`: easy, medium, hard
|
| 76 |
-
- `variant`: basic, explain, debug, optimize
|
| 77 |
-
- `context`: The code context to complete
|
| 78 |
-
- `completion`: The expected completion
|
| 79 |
-
|
| 80 |
-
**Example structure:**
|
| 81 |
-
```json
|
| 82 |
-
{
|
| 83 |
-
"messages": [
|
| 84 |
-
{"role": "system", "content": "You are a helpful AI assistant..."},
|
| 85 |
-
{"role": "user", "content": "Complete the following code:\n```python\ndef greet(name):\n```"},
|
| 86 |
-
{"role": "assistant", "content": "Here's the completed code:\n```python\ndef greet(name):\n return f\"Hello, {name}!\"\n```"}
|
| 87 |
-
],
|
| 88 |
-
"language": "python",
|
| 89 |
-
"difficulty": "easy",
|
| 90 |
-
"variant": "basic",
|
| 91 |
-
"description": "Simple function that returns a greeting",
|
| 92 |
-
"context": "def greet(name):",
|
| 93 |
-
"completion": " return f\"Hello, {name}!\""
|
| 94 |
-
}
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
## Generation Scripts
|
| 98 |
-
|
| 99 |
-
### Tool Data Generator
|
| 100 |
-
|
| 101 |
-
```bash
|
| 102 |
-
python3 scripts/generate_tool_data.py \
|
| 103 |
-
--num-examples 5000 \
|
| 104 |
-
--output-dir training-data-expanded \
|
| 105 |
-
--output-format jsonl
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
### Code Completion Generator
|
| 109 |
-
|
| 110 |
-
```bash
|
| 111 |
-
python3 scripts/generate_code_completion_data.py \
|
| 112 |
-
--num-examples 1000 \
|
| 113 |
-
--output-dir training-data/code-completion \
|
| 114 |
-
--languages python javascript go rust typescript \
|
| 115 |
-
--difficulties easy medium hard \
|
| 116 |
-
--variants basic explain debug optimize
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
-
## Difficulty Levels
|
| 120 |
-
|
| 121 |
-
| Level | Description |
|
| 122 |
-
|-------|-------------|
|
| 123 |
-
| **easy** | Simple functions, basic operations, single concepts |
|
| 124 |
-
| **medium** | Intermediate patterns, async operations, error handling |
|
| 125 |
-
| **hard** | Complex algorithms, data structures, design patterns |
|
| 126 |
-
|
| 127 |
-
## Variants
|
| 128 |
-
|
| 129 |
-
| Variant | Description |
|
| 130 |
-
|---------|-------------|
|
| 131 |
-
| **basic** | Standard code completion |
|
| 132 |
-
| **explain** | Code completion with explanation |
|
| 133 |
-
| **debug** | Bug fixing and completion |
|
| 134 |
-
| **optimize** | Performance optimization and completion |
|
| 135 |
-
|
| 136 |
-
## Supported Languages
|
| 137 |
-
|
| 138 |
-
- Python
|
| 139 |
-
- JavaScript
|
| 140 |
-
- Go
|
| 141 |
-
- Rust
|
| 142 |
-
- TypeScript
|
| 143 |
-
|
| 144 |
-
## Usage
|
| 145 |
-
|
| 146 |
-
### Training with MLflow
|
| 147 |
-
|
| 148 |
-
```bash
|
| 149 |
-
mlflow run . -P num_examples=5000
|
| 150 |
-
```
|
| 151 |
-
|
| 152 |
-
### Loading Data for Training
|
| 153 |
-
|
| 154 |
-
```python
|
| 155 |
-
import json
|
| 156 |
-
|
| 157 |
-
# Load JSONL
|
| 158 |
-
with open("training-data/tool_examples.jsonl", "r") as f:
|
| 159 |
-
for line in f:
|
| 160 |
-
example = json.loads(line)
|
| 161 |
-
# Process example
|
| 162 |
-
pass
|
| 163 |
-
|
| 164 |
-
# Load JSON
|
| 165 |
-
with open("training-data/tool_examples.json", "r") as f:
|
| 166 |
-
data = json.load(f)
|
| 167 |
-
```
|
| 168 |
-
|
| 169 |
-
## Augmentation
|
| 170 |
-
|
| 171 |
-
The tool-calling generator applies augmentation to create diversity:
|
| 172 |
-
- Varying file paths
|
| 173 |
-
- Varying command options
|
| 174 |
-
- Varying search queries
|
| 175 |
-
- Varying code snippets
|
| 176 |
-
|
| 177 |
-
## Quality Guidelines
|
| 178 |
-
|
| 179 |
-
- All generated code is syntactically correct
|
| 180 |
-
- Examples include realistic context
|
| 181 |
-
- Tools have proper arguments and responses
|
| 182 |
-
- Code completions are deterministic and correct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
training-data/tool_examples.json
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
training-data/tool_examples.jsonl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:1043720a918f5fe0f70cc013c108710570c37ae6c9cee6f504e49dc359af5a2a
|
| 3 |
-
size 3779800
|
|
|
|
|
|
|
|
|
|
|
|
training-data/tool_examples_combined.jsonl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:32da2f0f67ba3fd83d180ec2c1a323e77d4263ff5aeb1e8062cf596b070691d5
|
| 3 |
-
size 5669209
|
|
|
|
|
|
|
|
|
|
|
|