Exclude large jsonl files from repo

Files changed (6) hide show

.gitignore +3 -1
training-data-expanded/tool_examples.jsonl +0 -3
training-data/README.md +0 -182
training-data/tool_examples.json +0 -0
training-data/tool_examples.jsonl +0 -3
training-data/tool_examples_combined.jsonl +0 -3

.gitignore CHANGED Viewed

@@ -75,4 +75,6 @@ logs/
 # Temporary
 tmp/
-temp/

 # Temporary
 tmp/
+temp/training-data/**/*.jsonl
+training-data-expanded/**/*.jsonl
+*.jsonl

training-data-expanded/tool_examples.jsonl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:62e9ca4a94ef5c4c4b3d00c87d669ac33e23efb7bd6468d9c71304acc89cd553
-size 18893234

training-data/README.md DELETED Viewed

@@ -1,182 +0,0 @@
-# Stack 2.9 Training Data
-This directory contains synthetic training data for fine-tuning code generation models.
-## Directory Structure
-```
-training-data/
-├── README.md                           # This file
-├── tool_examples.jsonl                 # Tool-calling examples (Qwen2.5-Coder format)
-├── tool_examples.json                  # Same as above in JSON format
-├── code_completion/                    # Pure code completion examples
-│   ├── code_completion.jsonl
-│   └── code_completion.json
-└── training-data-expanded/            # Additional generated data
-    └── tool_examples.jsonl             # 5000 expanded tool-calling examples
-```
-## Data Formats
-### Tool-Calling Examples
-**Format:** Qwen2.5-Coder style with `tool_calls`
-Each example contains:
-- `messages`: Array of conversation messages (system, user, assistant, tool)
-- `tools`: Array of tool definitions
-**Example structure:**
-```json
-{
-  "messages": [
-    {"role": "system", "content": "You are a helpful AI assistant..."},
-    {"role": "user", "content": "Read the file at src/main.py..."},
-    {
-      "role": "assistant",
-      "content": null,
-      "tool_calls": [
-        {
-          "id": "call_1234",
-          "type": "function",
-          "function": {
-            "name": "FileRead",
-            "arguments": "{\"path\": \"src/main.py\"}"
-          }
-        }
-      ]
-    },
-    {
-      "role": "tool",
-      "content": "Successfully read file: src/main.py\n...",
-      "tool_call_id": "call_1234",
-      "name": "FileRead"
-    },
-    {"role": "assistant", "content": "Here's the contents..."}
-  ],
-  "tools": [...]
-}
-```
-**Available Tools:**
-- `Bash` - Execute bash commands
-- `FileRead` - Read file contents
-- `FileWrite` - Write/create files
-- `WebSearch` - Search the web
-- `Grep` - Search patterns in files
-### Code Completion Examples
-**Format:** Chat-based with context and completion
-Each example contains:
-- `messages`: Array of conversation messages
-- `language`: Programming language (python, javascript, go, rust, typescript)
-- `difficulty`: easy, medium, hard
-- `variant`: basic, explain, debug, optimize
-- `context`: The code context to complete
-- `completion`: The expected completion
-**Example structure:**
-```json
-{
-  "messages": [
-    {"role": "system", "content": "You are a helpful AI assistant..."},
-    {"role": "user", "content": "Complete the following code:\n```python\ndef greet(name):\n```"},
-    {"role": "assistant", "content": "Here's the completed code:\n```python\ndef greet(name):\n    return f\"Hello, {name}!\"\n```"}
-  ],
-  "language": "python",
-  "difficulty": "easy",
-  "variant": "basic",
-  "description": "Simple function that returns a greeting",
-  "context": "def greet(name):",
-  "completion": "    return f\"Hello, {name}!\""
-}
-```
-## Generation Scripts
-### Tool Data Generator
-```bash
-python3 scripts/generate_tool_data.py \
-    --num-examples 5000 \
-    --output-dir training-data-expanded \
-    --output-format jsonl
-```
-### Code Completion Generator
-```bash
-python3 scripts/generate_code_completion_data.py \
-    --num-examples 1000 \
-    --output-dir training-data/code-completion \
-    --languages python javascript go rust typescript \
-    --difficulties easy medium hard \
-    --variants basic explain debug optimize
-```
-## Difficulty Levels
-| Level | Description |
-|-------|-------------|
-| **easy** | Simple functions, basic operations, single concepts |
-| **medium** | Intermediate patterns, async operations, error handling |
-| **hard** | Complex algorithms, data structures, design patterns |
-## Variants
-| Variant | Description |
-|---------|-------------|
-| **basic** | Standard code completion |
-| **explain** | Code completion with explanation |
-| **debug** | Bug fixing and completion |
-| **optimize** | Performance optimization and completion |
-## Supported Languages
-- Python
-- JavaScript
-- Go
-- Rust
-- TypeScript
-## Usage
-### Training with MLflow
-```bash
-mlflow run . -P num_examples=5000
-```
-### Loading Data for Training
-```python
-import json
-# Load JSONL
-with open("training-data/tool_examples.jsonl", "r") as f:
-    for line in f:
-        example = json.loads(line)
-        # Process example
-        pass
-# Load JSON
-with open("training-data/tool_examples.json", "r") as f:
-    data = json.load(f)
-```
-## Augmentation
-The tool-calling generator applies augmentation to create diversity:
-- Varying file paths
-- Varying command options
-- Varying search queries
-- Varying code snippets
-## Quality Guidelines
-- All generated code is syntactically correct
-- Examples include realistic context
-- Tools have proper arguments and responses
-- Code completions are deterministic and correct

training-data/tool_examples.json DELETED Viewed

The diff for this file is too large to render. See raw diff

training-data/tool_examples.jsonl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:1043720a918f5fe0f70cc013c108710570c37ae6c9cee6f504e49dc359af5a2a
-size 3779800

training-data/tool_examples_combined.jsonl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:32da2f0f67ba3fd83d180ec2c1a323e77d4263ff5aeb1e8062cf596b070691d5
-size 5669209