SWE-Bench Trajectory Eval Bundle (v1)

Companion artifact for the trajectory-probe downstream eval of the code-graph-v7 encoders (W1, I6, ...).

traj_full_bundle.tar.gz (488 MB) — contains:
- specs.jsonl: 2456 SWE-Bench Verified agent trajectories harvested from swe-bench-submissions S3 bucket. Fields: instance_id, traj_id, repo, base_commit, patches (1 entry = final model patch), resolved.
- repos/: shallow (--filter=blob:none) clones of the 12 target repos (django, sympy, sphinx, matplotlib, scikit-learn, astropy, xarray, pytest, pylint, requests, seaborn, flask). ~671 MB uncompressed. Blobs pulled lazily per base_commit checkout.
- graphjepa/: pipeline code (trajectory_pipeline, trajectory_realize, trajectory_probe, trajectory_harvest) plus scripts/trajectory_full.sh.
harvest.log — stdout from the S3 harvester that produced specs.jsonl.

Downstream workflow

tar -xzf traj_full_bundle.tar.gz
rsync -a traj_full/graphjepa/ graphjepa/
mkdir -p outputs/traj_real
cp traj_full/specs.jsonl outputs/traj_real/
mv traj_full/repos outputs/traj_real/repos

# realize (4 sharded workers by repo)
SHARDS=4 bash graphjepa/scripts/trajectory_full.sh
tail -f outputs/traj_real/logs/realize_shard*.log

# merge manifests + probe with each encoder
cat outputs/traj_real/manifest_shard*.jsonl > outputs/traj_real/manifest.jsonl
for NAME in W1_softplus_s0 I6_joint_s0; do
  .venv/bin/python -m graphjepa.trajectory_probe \
    --manifest outputs/traj_real/manifest.jsonl \
    --ckpt outputs/$NAME/ckpt_final.pt \
    --pool mean --split-by repo \
    --output outputs/traj_real/probe_${NAME}.json
done

Provenance

Specs harvested from 5 SWE-Bench Verified submissions:

Submission	N	Resolved	Rate
20240620_sweagent_claude3.5sonnet	485	168	34.6%
20241022_tools_claude-3-5-sonnet-updated	483	245	50.7%
20241028_agentless-1.5_gpt4o	495	194	39.2%
20241029_OpenHands-CodeAct-2.1-sonnet	493	265	53.8%
20250405_amazon-q-developer-2025	500	330	66.0%
total	2456	1202	48.9%

500 unique instance_ids, 499 unique base_commits (median 5 trajectories per commit — different agents attempting the same task).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

IDMedicine
/

code-graph-trajeval-v1

SWE-Bench Trajectory Eval Bundle (v1)

Contents

Downstream workflow

Provenance