Stack-2-9-finetuned / docs /DATA_FORMAT.md
walidsobhie-code
feat: add production infrastructure - CI/CD, Docker, code quality, and monitoring
b5998ff
# Stack 2.9 Training Data Format
This document describes the format and structure of training data for Stack 2.9.
## Overview
Training data is stored in JSONL format (JSON Lines), where each line is a valid JSON object representing a single training example.
## File Structure
```
training-data/
├── tool_examples.jsonl # Original examples (1000)
├── augmented_tool_examples.jsonl # Augmented examples (2-5x)
└── scaled/ # Processed datasets
├── train.jsonl
└── val.jsonl
```
## Example Format
```json
{
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant that can use tools to help users solve problems."
},
{
"role": "user",
"content": "Can you show me the tests/test_main.py file?"
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_$1180",
"type": "function",
"function": {
"name": "FileRead",
"arguments": "{\"path\": \"src/main.py\"}"
}
}
]
},
{
"role": "tool",
"content": "Successfully read file: README.md\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n```",
"tool_call_id": "call_$1180",
"name": "FileRead"
},
{
"role": "assistant",
"content": "Here's the README.md:\n\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n```"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "Bash",
"description": "Execute bash commands in the terminal.",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string", "description": "The bash command to execute"},
"timeout": {"type": "integer", "description": "Timeout in seconds"}
},
"required": ["command"]
}
}
},
{
"type": "function",
"function": {
"name": "FileRead",
"description": "Read the contents of a file.",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Path to the file to read"},
"offset": {"type": "integer", "description": "Line number to start from"},
"limit": {"type": "integer", "description": "Max lines to read"}
},
"required": ["path"]
}
}
}
]
}
```
## Field Definitions
### Top-Level Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `messages` | array | Yes | Array of message objects |
| `tools` | array | Yes | Available tools/functions |
| `source` | string | No | Data source identifier |
### Message Object
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `role` | string | Yes | One of: system, user, assistant, tool |
| `content` | string | Yes* | Message content (null if tool_calls present) |
| `tool_calls` | array | No* | Tool call requests |
| `tool_call_id` | string | No* | ID linking to tool response |
| `name` | string | No* | Tool name (for tool messages) |
*Content is required unless `tool_calls` is present. `tool_call_id` and `name` required for role="tool".
### Tool Call Object
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `id` | string | Yes | Unique call identifier |
| `type` | string | Yes | Always "function" |
| `function` | object | Yes | Function name and arguments |
| `function.name` | string | Yes | Tool/function name |
| `function.arguments` | object/string | Yes | JSON arguments |
## Data Sources
- **random_synthetic**: Auto-generated with random parameters
- **synthetic_template**: Template-based synthetic examples
- **augmented_***: Augmented from other sources
- **original**: Human-curated examples
## Augmentation
The augmentation script applies these transformations:
1. **Paraphrasing**: Reword user prompts (70% chance)
2. **Difficulty scaling**: Add complexity modifiers
3. **Parameter variation**: Change file paths, commands
4. **Filler words**: Add "please", "thanks" (30% chance)
5. **Edge cases**: Empty input, multi-step, error handling
Run augmentation:
```bash
python scripts/augment_training_data.py \
--input training-data/tool_examples.jsonl \
--output training-data/augmented.jsonl \
--multiplier 3
```
## Validation
Run validation to check data quality:
```bash
python scripts/validate_training_data.py --input training-data/tool_examples.jsonl
```
Checks include:
- Required fields present
- Valid JSON syntax
- Message role ordering
- Tool call structure
- No empty entries
## Converting to Training Format
For training, convert to standard format:
```python
# Example conversion
python scripts/combine_datasets.py \
--input training-data/augmented.jsonl \
--output data/final/train.jsonl \
--format chatml
```