Stack-2-9-finetuned / docs /DATA_FORMAT.md
walidsobhie-code
feat: add production infrastructure - CI/CD, Docker, code quality, and monitoring
b5998ff

Stack 2.9 Training Data Format

This document describes the format and structure of training data for Stack 2.9.

Overview

Training data is stored in JSONL format (JSON Lines), where each line is a valid JSON object representing a single training example.

File Structure

training-data/
β”œβ”€β”€ tool_examples.jsonl          # Original examples (1000)
β”œβ”€β”€ augmented_tool_examples.jsonl # Augmented examples (2-5x)
└── scaled/                      # Processed datasets
    β”œβ”€β”€ train.jsonl
    └── val.jsonl

Example Format

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful AI assistant that can use tools to help users solve problems."
    },
    {
      "role": "user",
      "content": "Can you show me the tests/test_main.py file?"
    },
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "id": "call_$1180",
          "type": "function",
          "function": {
            "name": "FileRead",
            "arguments": "{\"path\": \"src/main.py\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "Successfully read file: README.md\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n```",
      "tool_call_id": "call_$1180",
      "name": "FileRead"
    },
    {
      "role": "assistant",
      "content": "Here's the README.md:\n\n```markdown\n# My Project\n\nA sample project for Stack 2.9.\n```"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "Bash",
        "description": "Execute bash commands in the terminal.",
        "parameters": {
          "type": "object",
          "properties": {
            "command": {"type": "string", "description": "The bash command to execute"},
            "timeout": {"type": "integer", "description": "Timeout in seconds"}
          },
          "required": ["command"]
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "FileRead",
        "description": "Read the contents of a file.",
        "parameters": {
          "type": "object",
          "properties": {
            "path": {"type": "string", "description": "Path to the file to read"},
            "offset": {"type": "integer", "description": "Line number to start from"},
            "limit": {"type": "integer", "description": "Max lines to read"}
          },
          "required": ["path"]
        }
      }
    }
  ]
}

Field Definitions

Top-Level Fields

Field Type Required Description
messages array Yes Array of message objects
tools array Yes Available tools/functions
source string No Data source identifier

Message Object

Field Type Required Description
role string Yes One of: system, user, assistant, tool
content string Yes* Message content (null if tool_calls present)
tool_calls array No* Tool call requests
tool_call_id string No* ID linking to tool response
name string No* Tool name (for tool messages)

*Content is required unless tool_calls is present. tool_call_id and name required for role="tool".

Tool Call Object

Field Type Required Description
id string Yes Unique call identifier
type string Yes Always "function"
function object Yes Function name and arguments
function.name string Yes Tool/function name
function.arguments object/string Yes JSON arguments

Data Sources

  • random_synthetic: Auto-generated with random parameters
  • synthetic_template: Template-based synthetic examples
  • augmented_*: Augmented from other sources
  • original: Human-curated examples

Augmentation

The augmentation script applies these transformations:

  1. Paraphrasing: Reword user prompts (70% chance)
  2. Difficulty scaling: Add complexity modifiers
  3. Parameter variation: Change file paths, commands
  4. Filler words: Add "please", "thanks" (30% chance)
  5. Edge cases: Empty input, multi-step, error handling

Run augmentation:

python scripts/augment_training_data.py \
  --input training-data/tool_examples.jsonl \
  --output training-data/augmented.jsonl \
  --multiplier 3

Validation

Run validation to check data quality:

python scripts/validate_training_data.py --input training-data/tool_examples.jsonl

Checks include:

  • Required fields present
  • Valid JSON syntax
  • Message role ordering
  • Tool call structure
  • No empty entries

Converting to Training Format

For training, convert to standard format:

# Example conversion
python scripts/combine_datasets.py \
  --input training-data/augmented.jsonl \
  --output data/final/train.jsonl \
  --format chatml