Zerolang Editing

zerolang-editing is a Verifiers/Prime RL environment for training coding agents to edit Zerolang programs through checked graph edits instead of loose text replacement.

The RL harness is built around Roder, using a custom zero-coder plugin/distribution that exposes a Zerolang-only graph toolset to the model. In training, generic source editing tools are disabled; the agent is expected to use the zero_* tools below, especially zero_graph_summary and zero_graph_patch, against the rollout file on disk.

The core task is intentionally narrow: each rollout starts with a .0 source file already written to disk, asks the model for a semantic code edit, and scores the edited file after the model uses Zerolang tooling. The intended successful behavior is:

  1. Inspect the file with Zerolang graph/check tools.
  2. Identify the relevant graph hash and semantic node.
  3. Apply a checked zero graph patch operation to the on-disk file.
  4. Finish with a compact JSON response pointing at the edited path.

This repository contains the environment source package, synthetic task builders, tool wrappers, and documentation. The trained checkpoint from hosted RL runs is published separately by the training service when a run is finalized.

Training Report: Laguna XS.2 Zerolang Editing LoRA

This report covers the hosted Prime RL runs used to select the published LoRA adapter at loras/laguna-xs2-zerolang-editing-step75-hdznmf/.

Selected checkpoint: run hdznmfje3xv0clwhu9sx4b0n, checkpoint qyxg7ya6x53ntmfjerp11gah, step 75. It reached the best held-out eval score, Avg@1 0.6604, on pandelis/zerolang-editing@0.1.11 with poolside/Laguna-XS.2.

Run Summary

Run Env LR Max tokens Eval examples Peak Avg@1 Final Avg@1 Finding
r0pbp4 0.1.9 1e-04 2048 16 0.5844 @ step 20 0.1500 Early lift, then collapsed to no-tool behavior.
impksa 0.1.10 1e-04 2048 16 0.5094 @ step 20 0.1500 File-path task format worked, but LR/run length still collapsed.
hdznmf 0.1.11 2e-05 4096 24 0.6604 @ step 75 0.5542 Lower LR and stricter 0.1.11 rewards produced the selected LoRA.

Curves

Held-out eval Avg@1

Training reward mean

Collapse diagnostics

Best run zero tool behavior

Findings

  • The two 1e-4 overnight runs showed early learning, then collapsed. r0pbp4 peaked at 0.5844 on step 20 and finished at 0.1500; impksa peaked at 0.5094 on step 20 and also finished at 0.1500.
  • The collapse correlated with no-tool behavior: both 200-step runs ended with stop_condition/all/no_tools_called = 1.0, zero_graph_patch_calls = 0.0, and graph_patch_success = 0.0.
  • The selected 0.1.11 run changed the training shape: lower learning rate 2e-5, shorter 80-step run, larger decode budget 4096, stricter rewards, and evals every 5 steps. It held up much better, peaking at step 75 and finishing at 0.5542 instead of collapsing.
  • The best run still has room to improve: at the last training metric step 79, no-tool stop rate was 0.6719, average graph patch calls were 1.4531, and graph patch success was only 0.0312. Supporting signals were stronger: path argument validity 0.8750, target source match 0.7500, and zero-check pass 0.8281.
  • Prime hosted metrics for these runs do not expose a training-loss series, so the plots use held-out eval score, reward, filtering, stop conditions, and tool telemetry instead of loss.

Recommendation

Use the step-75 LoRA as the current best artifact. For the next full run, keep the 0.1.11 reward direction and lower LR, but add stronger pressure against no-tool endings and make successful checked zero_graph_patch application a larger part of the reward so high eval scores come from the intended graph-edit behavior rather than partial-credit checking and summarization.

Trajectory Samples

Selected Prime RL rollout trajectories from the best run are published under trajectories/hdznmfje3xv0clwhu9sx4b0n/. The bundle includes normalized JSONL rows and the exact raw prime train rollouts pages for retained steps 0, 20, and 70.

Why This Exists

Most code-editing agents learn to patch source through line-oriented text operations. Zerolang exposes a graph-level editing surface where a patch is guarded by the expected graph hash and the expected field value. That makes edits auditable and harder to apply to stale or mismatched code.

This environment is designed to train that behavior directly. It rewards successful checked graph patches, while still checking that the resulting file compiles and matches the hidden target source.

Environment Summary

  • Package name: zerolang-editing
  • Prime environment ID: pandelis/zerolang-editing
  • Version in this repo: 0.1.8
  • Task type: multi-turn tool-use code editing
  • Agent harness: Roder with a custom Zero graph-only plugin/tool allowlist
  • Language under edit: Zerolang .0
  • Train split: 209 deterministic synthetic tasks
  • Eval split: 67 held-out deterministic synthetic tasks
  • Primary reward target: successful zero_graph_patch on the rollout file

Roder Harness

The intended RL setup runs the model inside Roder rather than a generic chat loop. Roder provides the coding-agent harness, while a custom zero-coder plugin configures the available tool surface for this environment.

That plugin is deliberately restrictive:

  • It exposes only Zerolang graph/check/fix/skills tools.
  • It removes generic text edit tools from the training harness.
  • It routes tool calls to on-disk .0 files using path arguments.
  • It keeps checked graph edits as the primary affordance for code changes.

This matters because the behavior we want to train is not "rewrite this source string". The target behavior is "inspect the Zerolang graph and apply a checked semantic graph patch to the file Roder is managing". The Verifiers environment then grades the resulting file from disk.

Rollout Contract

Each task row includes an initial Zerolang source program and a hidden target program. At rollout setup time, the environment writes the initial source to:

<temporary rollout workspace>/program.0

The model receives that path in the user prompt. Tools must operate on path arguments that point to this .0 file. Pasting the full source into tool calls is rejected because the training target is disk-backed graph editing, not source-string rewriting.

The environment canonicalizes recoverable path mistakes, such as missing paths or paths outside the rollout workspace, back to the rollout file and records those corrections. The path_argument_valid metric rewards clean tool calls that did not require correction.

Tools

The environment exposes only Zerolang-specific tools:

Tool Purpose
zero_check(path) Run zero check --json against a .0 file.
zero_graph_summary(path) Return compact graph hash and patchable node facts.
zero_graph_dump(path) Run zero graph dump for detailed graph inspection.
zero_graph_json(path) Run zero graph --json.
zero_fix_plan(path) Run zero fix --plan --json.
zero_graph_patch(path, expect_graph_hash, op) Apply one checked graph patch operation to the file.
zero_skills_get(skill) Load version-matched Zerolang guidance such as language, diagnostics, or stdlib.

Example checked patch shape:

zero graph patch program.0 \
  --expect-graph-hash graph:49dd208f8361c221 \
  --op 'set node="#78ac4364" field="value" expect="66" value="65"'

Reward Metrics

The main rubric is weighted toward actually patching the graph and producing the hidden target program.

Metric Weight Meaning
graph_patch_success 0.50 A successful zero_graph_patch call edited the file to the hidden target.
target_source_match 0.20 The final on-disk source matches the target after whitespace normalization.
zero_check_pass 0.15 The edited file passes zero check --json.
zerolang_surface_used 0.10 The rollout used graph hashes, node IDs, expect, or graph-patch semantics.
path_argument_valid 0.05 Tool calls used the rollout .0 path without harness-side correction.

The reward is intentionally not fully binary. A model can get partial credit for producing compilable code and using the right interface, but the highest reward requires the checked graph patch to land correctly.

Dataset Construction

The synthetic tasks are generated from canonical Zerolang snippets:

  1. Build an initial .0 program.
  2. Select a patchable semantic node, usually a literal, function value, call target, or printed diagnostic string.
  3. Mutate the semantic value to produce the target program.
  4. Store the target source and task metadata.
  5. During rollout, require the model to recover the target through graph tools.

The environment currently focuses on deterministic editing families where zero graph patch support is reliable. The task builders live in:

  • zerolang_editing/tasks.py
  • zerolang_editing/train_tasks.py
  • zerolang_editing/task_builders.py

Installation

Install from Prime Hub:

prime env install pandelis/zerolang-editing@0.1.8

Install from this repository:

uv sync
uv run python -m compileall zerolang_editing

Zerolang is required at runtime. If zero is not already on PATH, the tool wrapper checks $HOME/.zero/bin/zero and can download a release binary into a temporary install directory.

Local Eval

prime eval run ./environments/zerolang_editing \
  -m poolside/laguna-xs.2 \
  -n 3 -r 1 -t 2048 -T 0.4 \
  -a '{"split":"eval","max_turns":10}' \
  -s -d -A

For quick package-level validation:

cd environments/zerolang_editing
uv run python -m compileall zerolang_editing
uv run python - <<'PY'
from zerolang_editing.zerolang_editing import load_environment
env = load_environment(split="eval", max_examples=1, max_turns=2)
print(type(env).__name__, len(env.dataset))
PY

Hosted RL Configuration

The overnight Laguna XS.2 run uses:

model = "poolside/Laguna-XS.2"
max_steps = 200
batch_size = 64
rollouts_per_example = 8
learning_rate = 1e-4

[sampling]
max_tokens = 2048
temperature = 0.4
enable_thinking = true

The config is stored in:

configs/rl/zerolang-editing-laguna-xs2-overnight.toml

Previous Training Signal

A 20-step stress run on poolside/Laguna-XS.2 completed successfully before the overnight scale-up:

  • Baseline eval Avg@1: 0.1500
  • Step 15 eval Avg@1: 0.2357
  • Final eval Avg@1: 0.2250
  • First 10 train-step reward average: 0.1606
  • Last 10 train-step reward average: 0.2056
  • No fatal orchestrator errors, no eval truncation, no no-response.

The main failure signatures were invalid tool paths: missing path arguments and paths outside the rollout workspace. Version 0.1.8 keeps the path sandbox but converts recoverable path mistakes into canonicalized calls against the rollout file and adds a small clean-path reward term.

Repository Contents

README.md
pyproject.toml
uv.lock
configs/
  rl/
    zerolang-editing-laguna-xs2-20step.toml
    zerolang-editing-laguna-xs2-overnight.toml
zerolang_editing/
  __init__.py
  task_builders.py
  tasks.py
  train_tasks.py
  zero_tools.py
  zerolang_editing.py

Build artifacts, local virtualenvs, Zerolang caches, rollout outputs, and compiled Python caches are intentionally excluded from the Hugging Face repo.

Limitations

  • The task distribution is synthetic and should be expanded before treating the trained behavior as general Zerolang editing competence.
  • Current graph-edit families focus on reliable literal/value style patches.
  • The environment is designed for RL tool-use behavior, not as a standalone benchmark of general coding ability.
  • This repo contains the environment source, not final model weights.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for poolside-laguna-hackathon/zerolang-editing

Finetuned
(22)
this model