Buckets:
| suite,model,model_slug,source_kind,label,eval,artifact_path,screenshot_desktop_path,screenshot_mobile_path,screenshot_deep_path,screenshot_mobile_deep_path,artifact_bytes,generation_ok,generation_duration_s,input_tokens,output_tokens,total_tokens,billing_tokens,reasoning_tokens,tool_use_tokens,cache_read_tokens,cache_write_tokens,cache_hit_tokens,total_cache_tokens,effective_input_tokens,display_input_tokens,usage_event_count,tool_calls,turn_count,self_check_attempted,self_check_ran,self_check_succeeded,self_check_runs,self_check_failed_runs,self_check_successful_runs,self_correction_edits,self_corrected_after_checker,self_correction_verified,assistant_turns_trace,self_check_mode,self_check_evidence,deterministic_failures,deterministic_warnings,vlm_failures,vlm_warnings,deterministic_failure_units,deterministic_warning_units,vlm_failure_units,vlm_warning_units,desktop_failures,desktop_warnings,mobile_failures,mobile_warnings,deep_failures,deep_warnings,mobile_deep_failures,mobile_deep_warnings,artifact_present,artifact_score_100,task_score,task_score_max,quality_score,quality_cap_reason,quality_class | |
| publish,codexresponses.gpt-5.4-mini,codexresponses-gpt-5-4-mini,clean-final,skill-with-shell-codexresponses-gpt-5-4-mini-publication-final,numeric-data,results/publish/models/codexresponses-gpt-5-4-mini/artifacts/numeric-data.html,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/numeric-data-desktop.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/numeric-data-mobile.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/numeric-data-deep.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/numeric-data-mobile-deep.png,41655,True,233.57,257043,19565,276608,276608,13843,0,0,0,236032,236032,21011,257043,12,16,12,True,True,True,2,1,1,0,False,True,12,run-checker-cli,ran checker CLI: python /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publica,0,2,0,0,0,1,0,0,0,1,0,0,0,1,0,0,True,99,19.8,20,99,,warn | |
| publish,codexresponses.gpt-5.4-mini,codexresponses-gpt-5-4-mini,clean-final,skill-with-shell-codexresponses-gpt-5-4-mini-publication-final,code-review,results/publish/models/codexresponses-gpt-5-4-mini/artifacts/code-review.html,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/code-review-desktop.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/code-review-mobile.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/code-review-deep.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/code-review-mobile-deep.png,40247,True,251.091,1602209,16541,1618750,1618750,10735,0,0,0,1516544,1516544,85665,1602209,24,39,24,True,True,True,3,1,2,0,False,True,24,"checker-cli-error,run-checker-cli","ran checker CLI: cd /home/shaun/source/birch-html && uv run python skill/scripts/check_birch_renderings.py --help | sed -n '1,220p' | checker CLI usage error | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publicatio | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexres",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexresponses.gpt-5.4-mini,codexresponses-gpt-5-4-mini,clean-final,skill-with-shell-codexresponses-gpt-5-4-mini-publication-final,module-explainer,results/publish/models/codexresponses-gpt-5-4-mini/artifacts/module-explainer.html,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/module-explainer-desktop.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/module-explainer-mobile.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/module-explainer-deep.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/module-explainer-mobile-deep.png,51503,True,228.357,538144,20613,558757,558757,12973,0,0,0,489472,489472,48672,538144,14,29,14,True,True,True,2,0,2,0,False,False,14,"checker-shell-reference,read-checker,run-checker-cli","read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | shell referenced checker: rg -n ""^def (contract_findings|compare_stats|screenshot_findings|artifact_screenshot_findings|geometry_findings|render_markdown|capture|find_chrome|capture_height_for_viewport|css_ | ran checker CLI: mkdir -p /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publication-final && cat > /home/shaun/source/birch-html/eval-runs/skill-with-shell-co | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publication-fina",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexresponses.gpt-5.4-mini,codexresponses-gpt-5-4-mini,clean-final,skill-with-shell-codexresponses-gpt-5-4-mini-publication-final,implementation-plan,results/publish/models/codexresponses-gpt-5-4-mini/artifacts/implementation-plan.html,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/implementation-plan-desktop.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/implementation-plan-mobile.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/implementation-plan-deep.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/implementation-plan-mobile-deep.png,48838,True,249.193,122451,13529,135980,135980,8129,0,0,0,103936,103936,18515,122451,8,11,8,True,True,True,2,1,1,0,False,True,8,run-checker-cli,"ran checker CLI: cat > /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publication-final/implementation-plan.html <<'EOF' | |
| <!doctype html> | |
| <html lang=""en""> | |
| <head | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publicatio | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexres",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexresponses.gpt-5.4-mini,codexresponses-gpt-5-4-mini,clean-final,skill-with-shell-codexresponses-gpt-5-4-mini-publication-final,benchmark-comparison,results/publish/models/codexresponses-gpt-5-4-mini/artifacts/benchmark-comparison.html,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/codexresponses-gpt-5-4-mini/reports/screenshots/benchmark-comparison-mobile-deep.png,55271,True,193.592,280048,17564,297612,297612,9912,0,0,0,261120,261120,18928,280048,14,18,14,True,True,True,4,3,1,0,False,True,14,run-checker-cli,"ran checker CLI: cd /home/shaun/source/birch-html && mkdir -p eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publication-final && uv run --with matplotlib python - <<'PY' | |
| from pathlib impor | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publicatio | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexres | ran checker CLI: python3 - <<'PY' | |
| from pathlib import Path | |
| path = Path('/home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publication-final/benchmark-comparison.h | ran checker CLI: python3 - <<'PY' | |
| from pathlib import Path | |
| import re | |
| path = Path('/home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-4-mini-publication-final/benchmark-co",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexresponses.gpt-5.5,codexresponses-gpt-5-5,clean-final,skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522,numeric-data,results/publish/models/codexresponses-gpt-5-5/artifacts/numeric-data.html,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/numeric-data-desktop.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/numeric-data-mobile.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/numeric-data-deep.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/numeric-data-mobile-deep.png,42203,True,126.071,73486,5728,79214,79214,449,0,0,0,52736,52736,20750,73486,8,11,8,True,True,True,2,1,1,0,False,True,8,run-checker-cli,ran checker CLI: uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522/nume | ran checker CLI: uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-e,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexresponses.gpt-5.5,codexresponses-gpt-5-5,clean-final,skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522,code-review,results/publish/models/codexresponses-gpt-5-5/artifacts/code-review.html,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/code-review-desktop.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/code-review-mobile.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/code-review-deep.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/code-review-mobile-deep.png,42437,True,114.697,151259,4995,156254,156254,1208,0,0,0,122368,122368,28891,151259,9,11,9,True,True,True,2,1,1,0,False,True,9,run-checker-cli,"ran checker CLI: uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-e | ran checker CLI: python - <<'PY' | |
| from pathlib import Path | |
| p=Path('/home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522/code",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexresponses.gpt-5.5,codexresponses-gpt-5-5,clean-final,skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522,module-explainer,results/publish/models/codexresponses-gpt-5-5/artifacts/module-explainer.html,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/module-explainer-desktop.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/module-explainer-mobile.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/module-explainer-deep.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/module-explainer-mobile-deep.png,55010,True,166.583,315269,8441,323710,323710,504,0,0,0,259584,259584,55685,315269,10,22,10,True,True,True,1,0,1,0,False,False,10,"checker-shell-reference,read-checker,run-checker-cli","read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | shell referenced checker: cd /home/shaun/source/birch-html && rg -n ""^(def|class) "" scripts/check_birch_renderings.py scripts/birch_mpl.py evals/charts/run_eval.py evals/charts/build_chart_brief.py | ran checker CLI: cd /home/shaun/source/birch-html && mkdir -p eval-runs/skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522 && python - <<'PY' | |
| from pathlib import",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexresponses.gpt-5.5,codexresponses-gpt-5-5,clean-final,skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522,implementation-plan,results/publish/models/codexresponses-gpt-5-5/artifacts/implementation-plan.html,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/implementation-plan-desktop.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/implementation-plan-mobile.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/implementation-plan-deep.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/implementation-plan-mobile-deep.png,48834,True,141.971,98974,6433,105407,105407,451,0,0,0,79872,79872,19102,98974,9,11,9,True,True,True,2,0,2,0,False,False,9,run-checker-cli,"ran checker CLI: cat > /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522/implementation-plan.html <<'EOF' | |
| <!doctype html | ran checker CLI: python - <<'PY' | |
| from pathlib import Path | |
| p=Path('/home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522/impl",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexresponses.gpt-5.5,codexresponses-gpt-5-5,clean-final,skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522,benchmark-comparison,results/publish/models/codexresponses-gpt-5-5/artifacts/benchmark-comparison.html,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/codexresponses-gpt-5-5/reports/screenshots/benchmark-comparison-mobile-deep.png,52072,True,121.208,127399,5963,133362,133362,565,0,0,0,94208,94208,33191,127399,11,14,11,True,True,True,2,1,1,0,False,True,11,run-checker-cli,ran checker CLI: uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-experiment-20260524-164522/benc | ran checker CLI: uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses-gpt-5-5-opus-gpt55-deepseek-e,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexspark,codexspark,clean-final,skill-with-shell-codexspark-publication-final,numeric-data,results/publish/models/codexspark/artifacts/numeric-data.html,results/publish/models/codexspark/reports/screenshots/numeric-data-desktop.png,results/publish/models/codexspark/reports/screenshots/numeric-data-mobile.png,results/publish/models/codexspark/reports/screenshots/numeric-data-deep.png,results/publish/models/codexspark/reports/screenshots/numeric-data-mobile-deep.png,17281,True,82.34,825347,23923,849270,849270,13374,0,0,0,770688,770688,54659,825347,32,31,32,False,False,False,0,0,0,0,False,False,32,,,6,2,1,0,2,1,1,0,1,1,2,0,1,1,2,0,True,35.0,7.0,20,35.0,missing_birch_css,fail | |
| publish,codexspark,codexspark,clean-final,skill-with-shell-codexspark-publication-final,code-review,results/publish/models/codexspark/artifacts/code-review.html,results/publish/models/codexspark/reports/screenshots/code-review-desktop.png,results/publish/models/codexspark/reports/screenshots/code-review-mobile.png,results/publish/models/codexspark/reports/screenshots/code-review-deep.png,results/publish/models/codexspark/reports/screenshots/code-review-mobile-deep.png,9658,False,60.395,1737615,21291,1758906,1758906,17081,0,0,0,1702656,1702656,86941,1789597,41,32,26,True,True,True,3,0,3,0,False,False,41,"checker-shell-reference,read-checker","read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | shell referenced checker: nl -ba /home/shaun/source/birch-html/scripts/check_birch_renderings.py | sed -n '1,260p' | shell referenced checker: nl -ba /home/shaun/source/birch-html/scripts/check_birch_renderings.py | sed -n '260,560p' | shell referenced checker: nl -ba /home/shaun/source/birch-html/scripts/check_birch_renderings.py | sed -n '560,920p' | shell referenced checker: nl -ba /home/shaun/source/birch-html/scripts/check_birch_renderings.py | sed -n '920,1320p'",8,0,0,0,2,0,0,0,2,0,2,0,2,0,2,0,True,35.0,7.0,20,35.0,missing_birch_css,fail | |
| publish,codexspark,codexspark,clean-final,skill-with-shell-codexspark-publication-final,module-explainer,results/publish/models/codexspark/artifacts/module-explainer.html,results/publish/models/codexspark/reports/screenshots/module-explainer-desktop.png,results/publish/models/codexspark/reports/screenshots/module-explainer-mobile.png,results/publish/models/codexspark/reports/screenshots/module-explainer-deep.png,results/publish/models/codexspark/reports/screenshots/module-explainer-mobile-deep.png,16366,False,87.747,2740590,27049,2767639,2767639,15704,0,0,0,2024320,2024320,202803,2227123,35,51,42,True,False,False,0,0,0,0,False,False,35,"checker-shell-reference,read-checker,run-checker-cli","read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | shell referenced checker: cd /home/shaun/source/birch-html && rg -n ""class\s*=\""(flow-node|flow-edge|flow-list|flow-step|metric-row|chart-panel|finding|code-block|copyable|timeline)"" styles/birch-system.css | shell referenced checker: cd /home/shaun/source/birch-html && wc -l scripts/check_birch_renderings.py | shell referenced checker: cd /home/shaun/source/birch-html && mkdir -p eval-runs/skill-with-shell-codexspark-publication-final && cat > eval-runs/skill-with-shell-codexspark-publication-final/module-explain | ran checker CLI: cd /home/shaun/source/birch-html && cat > eval-runs/skill-with-shell-codexspark-publication-final/module-explainer.html <<'EOF' | |
| <!doctype html> | |
| <html lang=""en""> | |
| <head> | |
| <meta char",0,4,4,0,0,1,1,0,0,1,0,1,0,1,0,1,True,91,18.2,20,91,,fail | |
| publish,codexspark,codexspark,clean-final,skill-with-shell-codexspark-publication-final,implementation-plan,results/publish/models/codexspark/artifacts/implementation-plan.html,results/publish/models/codexspark/reports/screenshots/implementation-plan-desktop.png,results/publish/models/codexspark/reports/screenshots/implementation-plan-mobile.png,results/publish/models/codexspark/reports/screenshots/implementation-plan-deep.png,results/publish/models/codexspark/reports/screenshots/implementation-plan-mobile-deep.png,46864,True,91.953,1108319,14746,1123065,1123065,8043,0,0,0,1055232,1055232,53087,1108319,35,37,35,True,True,True,2,0,2,0,False,False,35,"checker-cli-error,checker-shell-reference,read-checker,run-checker-cli","read /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py | ran checker CLI: cat > /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexspark-publication-final/implementation-plan.html <<'EOF' | |
| <!doctype html> | |
| <html lang=""en""> | |
| <head> | |
| <meta charset | ran checker CLI: python3 /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py --help | head -n 120 | checker CLI usage error | ran checker CLI: cd /home/shaun/source/birch-html && uv run skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-codexspark-publication-final/implementation-plan.html --no- | ran checker CLI: python - <<'PY' | |
| from pathlib import Path | |
| from inspect import getsourcelines | |
| import importlib.util | |
| p=Path('/home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py') | |
| te",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,codexspark,codexspark,clean-final,skill-with-shell-codexspark-publication-final,benchmark-comparison,results/publish/models/codexspark/artifacts/benchmark-comparison.html,results/publish/models/codexspark/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/codexspark/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/codexspark/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/codexspark/reports/screenshots/benchmark-comparison-mobile-deep.png,55786,True,41.038,681289,5651,686940,686940,4100,0,0,0,628224,628224,53065,681289,24,23,24,False,False,False,0,0,0,0,False,False,24,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,deepseek,deepseek,clean-final,skill-with-shell-deepseek-publication-final,numeric-data,results/publish/models/deepseek/artifacts/numeric-data.html,results/publish/models/deepseek/reports/screenshots/numeric-data-desktop.png,results/publish/models/deepseek/reports/screenshots/numeric-data-mobile.png,results/publish/models/deepseek/reports/screenshots/numeric-data-deep.png,results/publish/models/deepseek/reports/screenshots/numeric-data-mobile-deep.png,62489,True,280.24,594128,18097,612225,612225,0,0,0,0,560512,560512,33616,594128,18,20,18,True,True,True,2,1,1,0,False,True,18,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-deepseek-publication-final/numeric-dat | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-deepseek,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,deepseek,deepseek,clean-final,skill-with-shell-deepseek-publication-final,code-review,results/publish/models/deepseek/artifacts/code-review.html,results/publish/models/deepseek/reports/screenshots/code-review-desktop.png,results/publish/models/deepseek/reports/screenshots/code-review-mobile.png,results/publish/models/deepseek/reports/screenshots/code-review-deep.png,results/publish/models/deepseek/reports/screenshots/code-review-mobile-deep.png,62789,True,294.1,784186,14634,798820,798820,0,0,0,0,749440,749440,34746,784186,26,30,26,True,True,True,3,1,2,0,False,True,26,"checker-shell-reference,run-checker-cli","ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-deepseek-publication-final/code-review | shell referenced checker: cd /home/shaun/source/birch-html && head -30 skill/scripts/check_birch_renderings.py | grep -A5 ""add_argument"" | shell referenced checker: cd /home/shaun/source/birch-html && grep -n ""artifact\|--artifact"" skill/scripts/check_birch_renderings.py | head -10 | ran checker CLI: cd /home/shaun/source/birch-html && rm -f skill/reports/birch-rendering-check.json skill/reports/birch-rendering-check.md && uv run --with pillow python skill/scripts/check_birch_r | shell referenced checker: cd /home/shaun/source/birch-html && grep -n ""ROOT\s*="" skill/scripts/check_birch_renderings.py | head -3 | ran checker CLI: cd /home/shaun/source/birch-html && rm -f skill/reports/birch-rendering-check-code-review.json && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /ho",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,deepseek,deepseek,clean-final,skill-with-shell-deepseek-publication-final,module-explainer,results/publish/models/deepseek/artifacts/module-explainer.html,results/publish/models/deepseek/reports/screenshots/module-explainer-desktop.png,results/publish/models/deepseek/reports/screenshots/module-explainer-mobile.png,results/publish/models/deepseek/reports/screenshots/module-explainer-deep.png,results/publish/models/deepseek/reports/screenshots/module-explainer-mobile-deep.png,31473,False,177.334,215656,9938,225594,225594,0,0,0,0,449920,449920,48511,498431,10,10,6,True,True,True,2,1,1,0,False,True,10,read-checker,read /home/shaun/source/birch-html/scripts/check_birch_renderings.py,8,1,7,0,3,1,2,0,1,1,3,0,1,0,3,0,True,20.0,4.0,20,20.0,missing_birch_css_and_visibly_unstyled,fail | |
| publish,deepseek,deepseek,clean-final,skill-with-shell-deepseek-publication-final,implementation-plan,results/publish/models/deepseek/artifacts/implementation-plan.html,results/publish/models/deepseek/reports/screenshots/implementation-plan-desktop.png,results/publish/models/deepseek/reports/screenshots/implementation-plan-mobile.png,results/publish/models/deepseek/reports/screenshots/implementation-plan-deep.png,results/publish/models/deepseek/reports/screenshots/implementation-plan-mobile-deep.png,52099,True,112.544,173739,6911,180650,180650,0,0,0,0,160128,160128,13611,173739,12,15,12,True,True,True,1,0,1,0,False,False,12,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-deepseek-publication-final/implementat,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,deepseek,deepseek,clean-final,skill-with-shell-deepseek-publication-final,benchmark-comparison,results/publish/models/deepseek/artifacts/benchmark-comparison.html,results/publish/models/deepseek/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/deepseek/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/deepseek/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/deepseek/reports/screenshots/benchmark-comparison-mobile-deep.png,78962,True,378.136,767427,27984,795411,795411,0,0,0,0,717696,717696,49731,767427,18,22,18,True,False,False,0,0,0,0,False,False,18,checker-shell-reference,"shell referenced checker: cd /home/shaun/source/birch-html && ls skill/scripts/check_birch_renderings.py 2>&1 && echo ""---"" && head -5 eval-runs/skill-with-shell-deepseek-publication-final/benchmark-compari",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,gemini35flash,gemini35flash,clean-final,skill-with-shell-gemini35flash-publication-final,numeric-data,results/publish/models/gemini35flash/artifacts/numeric-data.html,results/publish/models/gemini35flash/reports/screenshots/numeric-data-desktop.png,results/publish/models/gemini35flash/reports/screenshots/numeric-data-mobile.png,results/publish/models/gemini35flash/reports/screenshots/numeric-data-deep.png,results/publish/models/gemini35flash/reports/screenshots/numeric-data-mobile-deep.png,53215,True,114.216,1371616,5260,1376876,1376876,12418,0,0,0,1116684,1116684,254932,1371616,29,28,29,True,True,True,2,1,1,0,False,True,29,run-checker-cli,ran checker CLI: uv run --with pillow python3 skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-gemini35flash-publication-final/numeric-data.html | ran checker CLI: uv run --with pillow python3 skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-gemini35flash-publication-final/numeric-dat,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,gemini35flash,gemini35flash,clean-final,skill-with-shell-gemini35flash-publication-final,code-review,results/publish/models/gemini35flash/artifacts/code-review.html,results/publish/models/gemini35flash/reports/screenshots/code-review-desktop.png,results/publish/models/gemini35flash/reports/screenshots/code-review-mobile.png,results/publish/models/gemini35flash/reports/screenshots/code-review-deep.png,results/publish/models/gemini35flash/reports/screenshots/code-review-mobile-deep.png,53047,True,193.238,1684136,6902,1691038,1691038,23273,0,0,0,1424691,1424691,259445,1684136,34,33,34,True,True,True,3,1,2,0,False,True,34,"checker-cli-error,run-checker-cli",ran checker CLI: python3 /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py --help | checker CLI usage error | ran checker CLI: python3 /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-gemini35flash-publication-final/co | ran checker CLI: python3 /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py --no-capture --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-gemini35flash-publica,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,gemini35flash,gemini35flash,clean-final,skill-with-shell-gemini35flash-publication-final,module-explainer,results/publish/models/gemini35flash/artifacts/module-explainer.html,results/publish/models/gemini35flash/reports/screenshots/module-explainer-desktop.png,results/publish/models/gemini35flash/reports/screenshots/module-explainer-mobile.png,results/publish/models/gemini35flash/reports/screenshots/module-explainer-deep.png,results/publish/models/gemini35flash/reports/screenshots/module-explainer-mobile-deep.png,57420,True,203.178,2196880,10222,2207102,2207102,22501,0,0,0,1965131,1965131,231749,2196880,33,32,33,True,True,False,2,2,0,0,False,False,33,"read-checker,run-checker-cli",read scripts/check_birch_renderings.py | ran checker CLI: python3 scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-gemini35flash-publication-final/module-explainer.html,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,gemini35flash,gemini35flash,clean-final,skill-with-shell-gemini35flash-publication-final,implementation-plan,results/publish/models/gemini35flash/artifacts/implementation-plan.html,results/publish/models/gemini35flash/reports/screenshots/implementation-plan-desktop.png,results/publish/models/gemini35flash/reports/screenshots/implementation-plan-mobile.png,results/publish/models/gemini35flash/reports/screenshots/implementation-plan-deep.png,results/publish/models/gemini35flash/reports/screenshots/implementation-plan-mobile-deep.png,49628,True,201.715,2346900,9173,2356073,2356073,15150,0,0,0,2043078,2043078,303822,2346900,34,33,34,True,True,True,5,4,1,0,False,False,34,"checker-cli-error,run-checker-cli",ran checker CLI: python3 skill/scripts/check_birch_renderings.py --help | checker CLI usage error | ran checker CLI: python3 skill/scripts/check_birch_renderings.py --artifact temp_plan.html | ran checker CLI: python3 skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/temp_plan.html | ran checker CLI: python3 skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-gemini35flash-publication-final/implementation-plan.html,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,gemini35flash,gemini35flash,clean-final,skill-with-shell-gemini35flash-publication-final,benchmark-comparison,results/publish/models/gemini35flash/artifacts/benchmark-comparison.html,results/publish/models/gemini35flash/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/gemini35flash/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/gemini35flash/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/gemini35flash/reports/screenshots/benchmark-comparison-mobile-deep.png,97390,True,62.077,495825,829,496654,496654,4961,0,0,0,387138,387138,108687,495825,17,16,17,True,True,False,1,1,0,0,False,False,17,run-checker-cli,ran checker CLI: python3 /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-gemini35flash-publication-final/be,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,glm51,glm51,clean-final,skill-with-shell-glm51-publication-final,numeric-data,results/publish/models/glm51/artifacts/numeric-data.html,results/publish/models/glm51/reports/screenshots/numeric-data-desktop.png,results/publish/models/glm51/reports/screenshots/numeric-data-mobile.png,results/publish/models/glm51/reports/screenshots/numeric-data-deep.png,results/publish/models/glm51/reports/screenshots/numeric-data-mobile-deep.png,62971,True,300.114,459899,16275,476174,476174,0,0,0,0,369152,369152,90747,459899,15,16,15,True,True,False,1,1,0,0,False,False,15,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-glm51-publication-final/numeric-data.h,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,True,99,19.8,20,99,,warn | |
| publish,glm51,glm51,clean-final,skill-with-shell-glm51-publication-final,code-review,results/publish/models/glm51/artifacts/code-review.html,results/publish/models/glm51/reports/screenshots/code-review-desktop.png,results/publish/models/glm51/reports/screenshots/code-review-mobile.png,results/publish/models/glm51/reports/screenshots/code-review-deep.png,results/publish/models/glm51/reports/screenshots/code-review-mobile-deep.png,48933,True,133.324,254816,8008,262824,262824,0,0,0,0,202560,202560,52256,254816,11,13,11,True,True,True,1,0,1,0,False,False,11,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-glm51-publication-final/code-review.ht,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,True,92,18.4,20,92,,fail | |
| publish,glm51,glm51,clean-final,skill-with-shell-glm51-publication-final,module-explainer,results/publish/models/glm51/artifacts/module-explainer.html,results/publish/models/glm51/reports/screenshots/module-explainer-desktop.png,results/publish/models/glm51/reports/screenshots/module-explainer-mobile.png,results/publish/models/glm51/reports/screenshots/module-explainer-deep.png,results/publish/models/glm51/reports/screenshots/module-explainer-mobile-deep.png,54229,True,94.822,358438,6652,365090,365090,0,0,0,0,254656,254656,103782,358438,9,15,9,True,True,True,1,0,1,0,False,False,9,"read-checker,run-checker-cli",read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-glm51-publication-final/module-explainer.htm,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,glm51,glm51,clean-final,skill-with-shell-glm51-publication-final,implementation-plan,results/publish/models/glm51/artifacts/implementation-plan.html,results/publish/models/glm51/reports/screenshots/implementation-plan-desktop.png,results/publish/models/glm51/reports/screenshots/implementation-plan-mobile.png,results/publish/models/glm51/reports/screenshots/implementation-plan-deep.png,results/publish/models/glm51/reports/screenshots/implementation-plan-mobile-deep.png,60535,True,90.03,210191,7574,217765,217765,0,0,0,0,180736,180736,29455,210191,15,16,15,True,True,True,2,0,2,0,False,False,15,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-glm51-publication-final/implementation,2,0,0,2,1,0,0,1,0,0,1,0,0,0,1,0,True,93,18.6,20,93,,fail | |
| publish,glm51,glm51,clean-final,skill-with-shell-glm51-publication-final,benchmark-comparison,results/publish/models/glm51/artifacts/benchmark-comparison.html,results/publish/models/glm51/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/glm51/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/glm51/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/glm51/reports/screenshots/benchmark-comparison-mobile-deep.png,64863,True,149.159,274201,14416,288617,288617,0,0,0,0,214336,214336,59865,274201,12,14,12,True,True,True,1,0,1,0,False,False,12,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-glm51-publication-final/benchmark-comp,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,gpt-5.3-codex,gpt-5-3-codex,clean-final,skill-with-shell-gpt-5-3-codex-publication-final,numeric-data,results/publish/models/gpt-5-3-codex/artifacts/numeric-data.html,results/publish/models/gpt-5-3-codex/reports/screenshots/numeric-data-desktop.png,results/publish/models/gpt-5-3-codex/reports/screenshots/numeric-data-mobile.png,results/publish/models/gpt-5-3-codex/reports/screenshots/numeric-data-deep.png,results/publish/models/gpt-5-3-codex/reports/screenshots/numeric-data-mobile-deep.png,40305,True,63.372,91503,5097,96600,96600,1083,0,0,0,76800,76800,14703,91503,8,11,8,False,False,False,0,0,0,0,False,False,8,,,2,2,0,0,1,1,0,0,0,1,1,0,0,1,1,0,True,93,18.6,20,93,,fail | |
| publish,gpt-5.3-codex,gpt-5-3-codex,clean-final,skill-with-shell-gpt-5-3-codex-publication-final,code-review,results/publish/models/gpt-5-3-codex/artifacts/code-review.html,results/publish/models/gpt-5-3-codex/reports/screenshots/code-review-desktop.png,results/publish/models/gpt-5-3-codex/reports/screenshots/code-review-mobile.png,results/publish/models/gpt-5-3-codex/reports/screenshots/code-review-deep.png,results/publish/models/gpt-5-3-codex/reports/screenshots/code-review-mobile-deep.png,39494,True,94.334,461816,6027,467843,467843,2855,0,0,0,384640,384640,77176,461816,17,18,17,True,True,False,1,1,0,0,False,False,17,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-gpt-5-3-codex-publication-final/code-r,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,gpt-5.3-codex,gpt-5-3-codex,clean-final,skill-with-shell-gpt-5-3-codex-publication-final,module-explainer,results/publish/models/gpt-5-3-codex/artifacts/module-explainer.html,results/publish/models/gpt-5-3-codex/reports/screenshots/module-explainer-desktop.png,results/publish/models/gpt-5-3-codex/reports/screenshots/module-explainer-mobile.png,results/publish/models/gpt-5-3-codex/reports/screenshots/module-explainer-deep.png,results/publish/models/gpt-5-3-codex/reports/screenshots/module-explainer-mobile-deep.png,46290,True,93.641,555669,7177,562846,562846,1701,0,0,0,450304,450304,105365,555669,17,23,17,True,True,True,2,1,1,0,False,True,17,"checker-cli-error,checker-shell-reference,read-checker,run-checker-cli",read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | shell referenced checker: rg '^def ' -n /home/shaun/source/birch-html/scripts/check_birch_renderings.py | ran checker CLI: mkdir -p /home/shaun/source/birch-html/eval-runs/skill-with-shell-gpt-5-3-codex-publication-final && cat > /home/shaun/source/birch-html/eval-runs/skill-with-shell-gpt-5-3-codex-pu | ran checker CLI: uv run --with pillow python /home/shaun/source/birch-html/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-gpt-5-3-codex-publication-final/module-explainer.h | checker CLI usage error,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,True,91,18.2,20,91,,fail | |
| publish,gpt-5.3-codex,gpt-5-3-codex,clean-final,skill-with-shell-gpt-5-3-codex-publication-final,implementation-plan,results/publish/models/gpt-5-3-codex/artifacts/implementation-plan.html,results/publish/models/gpt-5-3-codex/reports/screenshots/implementation-plan-desktop.png,results/publish/models/gpt-5-3-codex/reports/screenshots/implementation-plan-mobile.png,results/publish/models/gpt-5-3-codex/reports/screenshots/implementation-plan-deep.png,results/publish/models/gpt-5-3-codex/reports/screenshots/implementation-plan-mobile-deep.png,45485,True,59.362,90659,4766,95425,95425,589,0,0,0,71168,71168,19491,90659,9,10,9,True,True,True,2,1,1,0,False,True,9,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-gpt-5-3-codex-publication-final/implem | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-gpt-5-3-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,gpt-5.3-codex,gpt-5-3-codex,clean-final,skill-with-shell-gpt-5-3-codex-publication-final,benchmark-comparison,results/publish/models/gpt-5-3-codex/artifacts/benchmark-comparison.html,results/publish/models/gpt-5-3-codex/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/gpt-5-3-codex/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/gpt-5-3-codex/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/gpt-5-3-codex/reports/screenshots/benchmark-comparison-mobile-deep.png,46793,True,61.812,60483,5615,66098,66098,746,0,0,0,53376,53376,7107,60483,7,8,7,False,False,False,0,0,0,0,False,False,7,,,4,0,0,0,2,0,0,0,2,0,0,0,2,0,0,0,True,88,17.6,20,88,,fail | |
| publish,grok-4.3,grok-4-3,clean-final,skill-with-shell-grok-4-3-publication-final,numeric-data,results/publish/models/grok-4-3/artifacts/numeric-data.html,results/publish/models/grok-4-3/reports/screenshots/numeric-data-desktop.png,results/publish/models/grok-4-3/reports/screenshots/numeric-data-mobile.png,results/publish/models/grok-4-3/reports/screenshots/numeric-data-deep.png,results/publish/models/grok-4-3/reports/screenshots/numeric-data-mobile-deep.png,36903,True,49.028,73338,3307,76645,76645,925,0,0,0,62720,62720,10618,73338,10,9,10,False,False,False,0,0,0,0,False,False,10,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,grok-4.3,grok-4-3,clean-final,skill-with-shell-grok-4-3-publication-final,code-review,results/publish/models/grok-4-3/artifacts/code-review.html,results/publish/models/grok-4-3/reports/screenshots/code-review-desktop.png,results/publish/models/grok-4-3/reports/screenshots/code-review-mobile.png,results/publish/models/grok-4-3/reports/screenshots/code-review-deep.png,results/publish/models/grok-4-3/reports/screenshots/code-review-mobile-deep.png,38297,True,55.392,190492,4553,195045,195045,2340,0,0,0,147520,147520,42972,190492,11,10,11,False,False,False,0,0,0,0,False,False,11,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,grok-4.3,grok-4-3,clean-final,skill-with-shell-grok-4-3-publication-final,module-explainer,results/publish/models/grok-4-3/artifacts/module-explainer.html,results/publish/models/grok-4-3/reports/screenshots/module-explainer-desktop.png,results/publish/models/grok-4-3/reports/screenshots/module-explainer-mobile.png,results/publish/models/grok-4-3/reports/screenshots/module-explainer-deep.png,results/publish/models/grok-4-3/reports/screenshots/module-explainer-mobile-deep.png,9279,False,40.052,125766,3826,129592,129592,1202,0,0,0,46784,46784,53433,100217,15,6,7,True,False,False,0,0,0,0,False,False,15,read-checker,read /home/shaun/source/birch-html/scripts/check_birch_renderings.py,8,0,3,0,2,0,2,0,2,0,2,0,2,0,2,0,True,35.0,7.0,20,35.0,missing_birch_css,fail | |
| publish,grok-4.3,grok-4-3,clean-final,skill-with-shell-grok-4-3-publication-final,implementation-plan,results/publish/models/grok-4-3/artifacts/implementation-plan.html,results/publish/models/grok-4-3/reports/screenshots/implementation-plan-desktop.png,results/publish/models/grok-4-3/reports/screenshots/implementation-plan-mobile.png,results/publish/models/grok-4-3/reports/screenshots/implementation-plan-deep.png,results/publish/models/grok-4-3/reports/screenshots/implementation-plan-mobile-deep.png,16152,False,41.596,32235,5236,37471,37471,1207,0,0,0,39488,39488,20479,59967,8,4,5,False,False,False,0,0,0,0,False,False,8,,,4,0,4,0,1,0,1,0,1,0,1,0,1,0,1,0,True,20.0,4.0,20,20.0,missing_birch_css_and_visibly_unstyled,fail | |
| publish,grok-4.3,grok-4-3,clean-final,skill-with-shell-grok-4-3-publication-final,benchmark-comparison,results/publish/models/grok-4-3/artifacts/benchmark-comparison.html,results/publish/models/grok-4-3/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/grok-4-3/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/grok-4-3/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/grok-4-3/reports/screenshots/benchmark-comparison-mobile-deep.png,10364,False,98.19,153411,7388,160799,160799,2517,0,0,0,39488,39488,6645,46133,8,15,16,False,False,False,0,0,0,0,False,False,8,,,4,0,4,1,1,0,1,1,1,0,1,0,1,0,1,0,True,35.0,7.0,20,35.0,missing_birch_css,fail | |
| publish,haiku45,haiku45,clean-final,skill-with-shell-haiku45-publication-final,numeric-data,results/publish/models/haiku45/artifacts/numeric-data.html,results/publish/models/haiku45/reports/screenshots/numeric-data-desktop.png,results/publish/models/haiku45/reports/screenshots/numeric-data-mobile.png,results/publish/models/haiku45/reports/screenshots/numeric-data-deep.png,results/publish/models/haiku45/reports/screenshots/numeric-data-mobile-deep.png,23937,False,67.62,119520,7707,127227,127227,0,0,7297,12081,0,19378,11280,30658,4,9,10,False,False,False,0,0,0,0,False,False,4,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-haiku45-publication-final/numeric-data,16,12,1,0,4,3,1,0,4,3,4,3,4,3,4,3,True,35.0,7.0,20,35.0,missing_birch_css,fail | |
| publish,haiku45,haiku45,clean-final,skill-with-shell-haiku45-publication-final,code-review,results/publish/models/haiku45/artifacts/code-review.html,results/publish/models/haiku45/reports/screenshots/code-review-desktop.png,results/publish/models/haiku45/reports/screenshots/code-review-mobile.png,results/publish/models/haiku45/reports/screenshots/code-review-deep.png,results/publish/models/haiku45/reports/screenshots/code-review-mobile-deep.png,53526,True,94.461,301467,10117,311584,311584,0,0,228528,34499,0,263027,38440,301467,11,11,11,True,True,True,1,0,1,0,False,False,11,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-haiku45-,6,0,0,2,2,0,0,1,1,0,2,0,1,0,2,0,True,87,17.4,20,87,,fail | |
| publish,haiku45,haiku45,clean-final,skill-with-shell-haiku45-publication-final,module-explainer,results/publish/models/haiku45/artifacts/module-explainer.html,results/publish/models/haiku45/reports/screenshots/module-explainer-desktop.png,results/publish/models/haiku45/reports/screenshots/module-explainer-mobile.png,results/publish/models/haiku45/reports/screenshots/module-explainer-deep.png,results/publish/models/haiku45/reports/screenshots/module-explainer-mobile-deep.png,57853,False,75.42,211164,9407,220571,220571,0,0,0,55031,0,55031,80985,136016,3,10,6,True,False,False,0,0,0,0,False,False,3,read-checker,read /home/shaun/source/birch-html/scripts/check_birch_renderings.py,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,haiku45,haiku45,clean-final,skill-with-shell-haiku45-publication-final,implementation-plan,results/publish/models/haiku45/artifacts/implementation-plan.html,results/publish/models/haiku45/reports/screenshots/implementation-plan-desktop.png,results/publish/models/haiku45/reports/screenshots/implementation-plan-mobile.png,results/publish/models/haiku45/reports/screenshots/implementation-plan-deep.png,results/publish/models/haiku45/reports/screenshots/implementation-plan-mobile-deep.png,50641,True,67.418,123711,7166,130877,130877,0,0,91600,16126,0,107726,15985,123711,9,9,9,True,True,True,1,0,1,0,False,False,9,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-haiku45-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,haiku45,haiku45,clean-final,skill-with-shell-haiku45-publication-final,benchmark-comparison,results/publish/models/haiku45/artifacts/benchmark-comparison.html,results/publish/models/haiku45/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/haiku45/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/haiku45/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/haiku45/reports/screenshots/benchmark-comparison-mobile-deep.png,49137,True,65.28,151349,7796,159145,159145,0,0,122743,12640,0,135383,15966,151349,11,10,11,False,False,False,0,0,0,0,False,False,11,,,4,0,0,3,1,0,0,1,1,0,1,0,1,0,1,0,True,93,18.6,20,93,,fail | |
| publish,kimi,kimi,clean-final,skill-with-shell-kimi-publication-final,numeric-data,results/publish/models/kimi/artifacts/numeric-data.html,results/publish/models/kimi/reports/screenshots/numeric-data-desktop.png,results/publish/models/kimi/reports/screenshots/numeric-data-mobile.png,results/publish/models/kimi/reports/screenshots/numeric-data-deep.png,results/publish/models/kimi/reports/screenshots/numeric-data-mobile-deep.png,67620,True,194.344,470039,5317,475356,475356,0,0,0,0,425472,425472,44567,470039,20,23,20,True,True,True,3,1,2,0,False,True,20,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-kimi-publication-final/numeric-data.ht | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-kimi-pub,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,kimi,kimi,clean-final,skill-with-shell-kimi-publication-final,code-review,results/publish/models/kimi/artifacts/code-review.html,results/publish/models/kimi/reports/screenshots/code-review-desktop.png,results/publish/models/kimi/reports/screenshots/code-review-mobile.png,results/publish/models/kimi/reports/screenshots/code-review-deep.png,results/publish/models/kimi/reports/screenshots/code-review-mobile-deep.png,44300,True,627.536,1248543,24596,1273139,1273139,0,0,0,0,1192448,1192448,56095,1248543,33,36,33,True,True,True,2,1,1,0,False,True,33,"checker-shell-reference,read-checker,run-checker-cli","read /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py | shell referenced checker: grep -n ""CANDLE_CLASSES\|BIRCH_CLASSES\|LAYOUT_CLASSES\|SEMANTIC_CLASSES"" /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py | head -20 | shell referenced checker: grep -n ""callout"" /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py | shell referenced checker: grep -n ""eyebrow\|lede\|muted\|caption\|subtle\|note\|entity\|label-cell"" /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py | head -20 | shell referenced checker: grep -n ""code-block"" /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py | head -20 | shell referenced checker: grep -n ""data-tone"" /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py | head -20",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,kimi,kimi,clean-final,skill-with-shell-kimi-publication-final,module-explainer,results/publish/models/kimi/artifacts/module-explainer.html,results/publish/models/kimi/reports/screenshots/module-explainer-desktop.png,results/publish/models/kimi/reports/screenshots/module-explainer-mobile.png,results/publish/models/kimi/reports/screenshots/module-explainer-deep.png,results/publish/models/kimi/reports/screenshots/module-explainer-mobile-deep.png,17730,False,142.653,54919,5427,60346,60346,0,0,0,0,0,0,54919,54919,5,10,5,True,False,False,0,0,0,0,False,False,5,read-checker,read /home/shaun/source/birch-html/scripts/check_birch_renderings.py,6,0,7,1,2,0,3,1,1,0,2,0,1,0,2,0,True,20.0,4.0,20,20.0,missing_birch_css_and_visibly_unstyled,fail | |
| publish,kimi,kimi,clean-final,skill-with-shell-kimi-publication-final,implementation-plan,results/publish/models/kimi/artifacts/implementation-plan.html,results/publish/models/kimi/reports/screenshots/implementation-plan-desktop.png,results/publish/models/kimi/reports/screenshots/implementation-plan-mobile.png,results/publish/models/kimi/reports/screenshots/implementation-plan-deep.png,results/publish/models/kimi/reports/screenshots/implementation-plan-mobile-deep.png,50937,True,372.779,468652,19358,488010,488010,0,0,0,0,415232,415232,53420,468652,15,16,15,True,True,True,1,0,1,0,False,False,15,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-kimi-publication-final/implementation-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,kimi,kimi,clean-final,skill-with-shell-kimi-publication-final,benchmark-comparison,results/publish/models/kimi/artifacts/benchmark-comparison.html,results/publish/models/kimi/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/kimi/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/kimi/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/kimi/reports/screenshots/benchmark-comparison-mobile-deep.png,51725,True,427.336,358341,15297,373638,373638,0,0,0,0,299776,299776,58565,358341,14,14,14,True,True,True,1,0,1,0,False,False,14,run-checker-cli,ran checker CLI: uv run --with pillow python /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-kimi-publicati,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,True,99,19.8,20,99,,warn | |
| publish,minimax27,minimax27,clean-final,skill-with-shell-minimax27-publication-final,numeric-data,results/publish/models/minimax27/artifacts/numeric-data.html,results/publish/models/minimax27/reports/screenshots/numeric-data-desktop.png,results/publish/models/minimax27/reports/screenshots/numeric-data-mobile.png,results/publish/models/minimax27/reports/screenshots/numeric-data-deep.png,results/publish/models/minimax27/reports/screenshots/numeric-data-mobile-deep.png,50838,False,160.154,87235,10902,98137,98137,0,0,0,0,116736,116736,81499,198235,12,9,10,True,True,True,2,1,1,0,False,True,12,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,minimax27,minimax27,clean-final,skill-with-shell-minimax27-publication-final,code-review,results/publish/models/minimax27/artifacts/code-review.html,results/publish/models/minimax27/reports/screenshots/code-review-desktop.png,results/publish/models/minimax27/reports/screenshots/code-review-mobile.png,results/publish/models/minimax27/reports/screenshots/code-review-deep.png,results/publish/models/minimax27/reports/screenshots/code-review-mobile-deep.png,43165,True,211.215,444148,7213,451361,451361,0,0,0,0,355328,355328,88820,444148,18,20,18,False,False,False,0,0,0,0,False,False,18,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,minimax27,minimax27,clean-final,skill-with-shell-minimax27-publication-final,module-explainer,results/publish/models/minimax27/artifacts/module-explainer.html,results/publish/models/minimax27/reports/screenshots/module-explainer-desktop.png,results/publish/models/minimax27/reports/screenshots/module-explainer-mobile.png,results/publish/models/minimax27/reports/screenshots/module-explainer-deep.png,results/publish/models/minimax27/reports/screenshots/module-explainer-mobile-deep.png,50511,False,183.748,185140,15068,200208,200208,0,0,0,0,232320,232320,148313,380633,9,9,5,True,False,False,0,0,0,0,False,False,9,read-checker,read /home/shaun/source/birch-html/scripts/check_birch_renderings.py,4,0,4,0,1,0,1,0,1,0,1,0,1,0,1,0,True,20.0,4.0,20,20.0,missing_birch_css_and_visibly_unstyled,fail | |
| publish,minimax27,minimax27,clean-final,skill-with-shell-minimax27-publication-final,implementation-plan,results/publish/models/minimax27/artifacts/implementation-plan.html,results/publish/models/minimax27/reports/screenshots/implementation-plan-desktop.png,results/publish/models/minimax27/reports/screenshots/implementation-plan-mobile.png,results/publish/models/minimax27/reports/screenshots/implementation-plan-deep.png,results/publish/models/minimax27/reports/screenshots/implementation-plan-mobile-deep.png,21904,False,64.763,27146,4563,31709,31709,0,0,0,0,7040,7040,11494,18534,3,3,4,False,False,False,0,0,0,0,False,False,3,,,14,4,0,0,4,1,0,0,3,1,4,1,3,1,4,1,True,35.0,7.0,20,35.0,missing_birch_css,fail | |
| publish,minimax27,minimax27,clean-final,skill-with-shell-minimax27-publication-final,benchmark-comparison,results/publish/models/minimax27/artifacts/benchmark-comparison.html,results/publish/models/minimax27/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/minimax27/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/minimax27/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/minimax27/reports/screenshots/benchmark-comparison-mobile-deep.png,79228,False,420.033,511926,33192,545118,545118,0,0,0,0,129664,129664,154885,284549,7,14,13,True,True,True,1,0,1,0,False,False,7,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-minimax27-publication-final/benchmark-comparison.html 2>&1 ,8,0,0,4,2,0,0,1,2,0,2,0,2,0,2,0,True,35.0,7.0,20,35.0,missing_birch_css,fail | |
| publish,opus47,opus47,clean-final,skill-with-shell-opus47-publication-final,numeric-data,results/publish/models/opus47/artifacts/numeric-data.html,results/publish/models/opus47/reports/screenshots/numeric-data-desktop.png,results/publish/models/opus47/reports/screenshots/numeric-data-mobile.png,results/publish/models/opus47/reports/screenshots/numeric-data-deep.png,results/publish/models/opus47/reports/screenshots/numeric-data-mobile-deep.png,45758,True,106.088,161380,8823,170203,170203,0,0,114642,25769,0,140411,20969,161380,10,12,10,True,True,True,2,0,2,0,False,False,10,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus47-publication-final/numeric-data. | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus47-p,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,opus47,opus47,clean-final,skill-with-shell-opus47-publication-final,code-review,results/publish/models/opus47/artifacts/code-review.html,results/publish/models/opus47/reports/screenshots/code-review-desktop.png,results/publish/models/opus47/reports/screenshots/code-review-mobile.png,results/publish/models/opus47/reports/screenshots/code-review-deep.png,results/publish/models/opus47/reports/screenshots/code-review-mobile-deep.png,50191,True,268.356,571314,17059,588373,588373,0,0,441950,55976,0,497926,73388,571314,14,18,14,True,True,True,3,0,3,0,False,False,14,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus47-publication-final/code-review.h | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus47-p,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,opus47,opus47,clean-final,skill-with-shell-opus47-publication-final,module-explainer,results/publish/models/opus47/artifacts/module-explainer.html,results/publish/models/opus47/reports/screenshots/module-explainer-desktop.png,results/publish/models/opus47/reports/screenshots/module-explainer-mobile.png,results/publish/models/opus47/reports/screenshots/module-explainer-deep.png,results/publish/models/opus47/reports/screenshots/module-explainer-mobile-deep.png,58814,True,206.748,653611,15632,669243,669243,0,0,502232,65941,0,568173,85438,653611,13,19,13,True,True,True,1,0,1,0,False,False,13,"read-checker,run-checker-cli",read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus47-publication-final/module-explainer.ht,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,opus47,opus47,clean-final,skill-with-shell-opus47-publication-final,implementation-plan,results/publish/models/opus47/artifacts/implementation-plan.html,results/publish/models/opus47/reports/screenshots/implementation-plan-desktop.png,results/publish/models/opus47/reports/screenshots/implementation-plan-mobile.png,results/publish/models/opus47/reports/screenshots/implementation-plan-deep.png,results/publish/models/opus47/reports/screenshots/implementation-plan-mobile-deep.png,53012,True,141.632,206186,9414,215600,215600,0,0,160139,23940,0,184079,22107,206186,11,12,11,True,True,True,2,0,2,0,False,False,11,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus47-publication-final/implementatio | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus47-p,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,opus47,opus47,clean-final,skill-with-shell-opus47-publication-final,benchmark-comparison,results/publish/models/opus47/artifacts/benchmark-comparison.html,results/publish/models/opus47/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/opus47/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/opus47/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/opus47/reports/screenshots/benchmark-comparison-mobile-deep.png,64934,True,150.046,388331,9617,397948,397948,0,0,328368,33477,0,361845,26486,388331,19,22,19,True,True,True,2,0,2,0,False,False,19,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus47-publication-final/benchmark-com | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus47-p,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,sonnet46,sonnet46,clean-final,skill-with-shell-sonnet46-publication-final,numeric-data,results/publish/models/sonnet46/artifacts/numeric-data.html,results/publish/models/sonnet46/reports/screenshots/numeric-data-desktop.png,results/publish/models/sonnet46/reports/screenshots/numeric-data-mobile.png,results/publish/models/sonnet46/reports/screenshots/numeric-data-deep.png,results/publish/models/sonnet46/reports/screenshots/numeric-data-mobile-deep.png,52394,True,203.959,302149,14758,316907,316907,0,0,234504,38197,0,272701,29448,302149,13,15,13,True,True,True,2,1,1,0,False,True,13,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-sonnet46-publication-final/numeric-dat | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-sonnet46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,sonnet46,sonnet46,clean-final,skill-with-shell-sonnet46-publication-final,code-review,results/publish/models/sonnet46/artifacts/code-review.html,results/publish/models/sonnet46/reports/screenshots/code-review-desktop.png,results/publish/models/sonnet46/reports/screenshots/code-review-mobile.png,results/publish/models/sonnet46/reports/screenshots/code-review-deep.png,results/publish/models/sonnet46/reports/screenshots/code-review-mobile-deep.png,57805,True,302.047,477280,18427,495707,495707,0,0,368349,44875,0,413224,64056,477280,14,18,14,True,True,True,2,0,2,0,False,False,14,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-sonnet46-publication-final/code-review | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-sonnet46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,sonnet46,sonnet46,clean-final,skill-with-shell-sonnet46-publication-final,module-explainer,results/publish/models/sonnet46/artifacts/module-explainer.html,results/publish/models/sonnet46/reports/screenshots/module-explainer-desktop.png,results/publish/models/sonnet46/reports/screenshots/module-explainer-mobile.png,results/publish/models/sonnet46/reports/screenshots/module-explainer-deep.png,results/publish/models/sonnet46/reports/screenshots/module-explainer-mobile-deep.png,66525,True,978.64,2649057,62243,2711300,2711300,0,0,2413844,135163,0,2549007,100050,2649057,34,38,34,True,True,True,2,1,1,0,False,True,34,"read-checker,run-checker-cli",read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-sonnet46-publication-final/module-explainer. | ran checker CLI: cd /home/shaun/source/birch-html && uv run skill/scripts/finish_birch_html.py eval-runs/skill-with-shell-sonnet46-publication-final/module-explainer.html && uv run --with pillow py,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,sonnet46,sonnet46,clean-final,skill-with-shell-sonnet46-publication-final,implementation-plan,results/publish/models/sonnet46/artifacts/implementation-plan.html,results/publish/models/sonnet46/reports/screenshots/implementation-plan-desktop.png,results/publish/models/sonnet46/reports/screenshots/implementation-plan-mobile.png,results/publish/models/sonnet46/reports/screenshots/implementation-plan-deep.png,results/publish/models/sonnet46/reports/screenshots/implementation-plan-mobile-deep.png,49926,True,196.05,257093,12916,270009,270009,0,0,210864,24527,0,235391,21702,257093,14,15,14,True,True,True,2,0,2,0,False,False,14,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-sonnet46-publication-final/implementat | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-sonnet46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| publish,sonnet46,sonnet46,clean-final,skill-with-shell-sonnet46-publication-final,benchmark-comparison,results/publish/models/sonnet46/artifacts/benchmark-comparison.html,results/publish/models/sonnet46/reports/screenshots/benchmark-comparison-desktop.png,results/publish/models/sonnet46/reports/screenshots/benchmark-comparison-mobile.png,results/publish/models/sonnet46/reports/screenshots/benchmark-comparison-deep.png,results/publish/models/sonnet46/reports/screenshots/benchmark-comparison-mobile-deep.png,122208,True,623.147,1192904,48270,1241174,1241174,0,0,987803,129337,0,1117140,75764,1192904,18,22,18,True,True,True,3,0,3,0,False,False,18,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-sonnet46-publication-final/benchmark-c | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-sonnet46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,codexresponses.gpt-5.4,codexresponses-gpt-5-4,clean-final,skill-with-shell-codexresponses-gpt-5-4-new-model-day,numeric-data,results/new-model-day/models/codexresponses-gpt-5-4/artifacts/numeric-data.html,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/numeric-data-desktop.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/numeric-data-mobile.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/numeric-data-deep.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/numeric-data-mobile-deep.png,42074,True,192.9,110293,6574,116867,116867,0,0,0,0,59904,59904,50389,110293,9,14,9,True,True,True,1,0,1,0,False,False,9,run-checker-cli,ran checker CLI: uv run --with pillow python /home/shaun/source/birch-html/skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-codexresponses,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,codexresponses.gpt-5.4,codexresponses-gpt-5-4,clean-final,skill-with-shell-codexresponses-gpt-5-4-new-model-day,code-review,results/new-model-day/models/codexresponses-gpt-5-4/artifacts/code-review.html,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/code-review-desktop.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/code-review-mobile.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/code-review-deep.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/code-review-mobile-deep.png,44000,True,151.5,257526,7500,265026,265026,0,0,0,0,182272,182272,75254,257526,8,19,8,True,False,False,0,0,0,0,False,False,8,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,codexresponses.gpt-5.4,codexresponses-gpt-5-4,clean-final,skill-with-shell-codexresponses-gpt-5-4-new-model-day,module-explainer,results/new-model-day/models/codexresponses-gpt-5-4/artifacts/module-explainer.html,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/module-explainer-desktop.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/module-explainer-mobile.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/module-explainer-deep.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/module-explainer-mobile-deep.png,55726,True,173.2,183748,8837,192585,192585,0,0,0,0,108032,108032,75716,183748,7,23,7,True,False,False,0,0,0,0,False,False,7,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,codexresponses.gpt-5.4,codexresponses-gpt-5-4,clean-final,skill-with-shell-codexresponses-gpt-5-4-new-model-day,implementation-plan,results/new-model-day/models/codexresponses-gpt-5-4/artifacts/implementation-plan.html,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/implementation-plan-desktop.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/implementation-plan-mobile.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/implementation-plan-deep.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/implementation-plan-mobile-deep.png,53200,True,153.0,66314,6819,73133,73133,0,0,0,0,24576,24576,41738,66314,6,9,6,True,False,False,0,0,0,0,False,False,6,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,codexresponses.gpt-5.4,codexresponses-gpt-5-4,clean-final,skill-with-shell-codexresponses-gpt-5-4-new-model-day,benchmark-comparison,results/new-model-day/models/codexresponses-gpt-5-4/artifacts/benchmark-comparison.html,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/benchmark-comparison-desktop.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/benchmark-comparison-mobile.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/benchmark-comparison-deep.png,results/new-model-day/models/codexresponses-gpt-5-4/reports/screenshots/benchmark-comparison-mobile-deep.png,93563,True,337.4,180917,15758,196675,196675,0,0,0,0,93696,93696,87221,180917,10,16,10,True,True,True,1,0,1,0,False,False,10,,,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,True,92,18.4,20,92,,fail | |
| new-model-day,opus?task_budget=200000,opus-task-budget-200000,clean-final,skill-with-shell-opus-task-budget-200000-new-model-day,numeric-data,results/new-model-day/models/opus-task-budget-200000/artifacts/numeric-data.html,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/numeric-data-desktop.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/numeric-data-mobile.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/numeric-data-deep.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/numeric-data-mobile-deep.png,47110,True,138.509,328931,11473,340404,340404,0,0,262308,39981,0,302289,26642,328931,16,17,16,True,True,True,2,0,2,0,False,False,16,run-checker-cli,"ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus-task-budget-200000-new-model-day/ | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact ""$(pwd)/eval-runs/skill-with-shell-opus-task-budget-200000-new-mo",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus?task_budget=200000,opus-task-budget-200000,clean-final,skill-with-shell-opus-task-budget-200000-new-model-day,code-review,results/new-model-day/models/opus-task-budget-200000/artifacts/code-review.html,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/code-review-desktop.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/code-review-mobile.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/code-review-deep.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/code-review-mobile-deep.png,47511,True,176.741,411266,14151,425417,425417,0,0,304812,48453,0,353265,58001,411266,11,13,11,True,True,True,2,0,2,0,False,False,11,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus-task-budget-200000-new-model-day/ | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus-tas,4,0,0,2,2,0,0,1,0,0,2,0,0,0,2,0,True,87,17.4,20,87,,fail | |
| new-model-day,opus?task_budget=200000,opus-task-budget-200000,clean-final,skill-with-shell-opus-task-budget-200000-new-model-day,module-explainer,results/new-model-day/models/opus-task-budget-200000/artifacts/module-explainer.html,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/module-explainer-desktop.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/module-explainer-mobile.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/module-explainer-deep.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/module-explainer-mobile-deep.png,52511,True,460.502,1500017,34600,1534617,1534617,0,0,1318059,97252,0,1415311,84706,1500017,23,30,23,True,True,True,3,0,3,0,False,False,23,"read-checker,run-checker-cli","read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus-task-budget-200000-new-model-day/module | ran checker CLI: cd /home/shaun/source/birch-html && uv run skill/scripts/finish_birch_html.py eval-runs/skill-with-shell-opus-task-budget-200000-new-model-day/module-explainer.html >/dev/null && u | ran checker CLI: cd /home/shaun/source/birch-html && python3 -c ""import json;d=json.load(open('reports/me-check.json'));print([f['evidence'][:80] for f in d['artifacts'][0]['findings'] if f['level'",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus?task_budget=200000,opus-task-budget-200000,clean-final,skill-with-shell-opus-task-budget-200000-new-model-day,implementation-plan,results/new-model-day/models/opus-task-budget-200000/artifacts/implementation-plan.html,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/implementation-plan-desktop.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/implementation-plan-mobile.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/implementation-plan-deep.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/implementation-plan-mobile-deep.png,53919,True,132.769,332156,11607,343763,343763,0,0,267724,22416,0,290140,42016,332156,16,17,16,True,True,True,2,0,2,0,False,False,16,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus-task-budget-200000-new-model-day/ | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus-tas,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus?task_budget=200000,opus-task-budget-200000,clean-final,skill-with-shell-opus-task-budget-200000-new-model-day,benchmark-comparison,results/new-model-day/models/opus-task-budget-200000/artifacts/benchmark-comparison.html,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/benchmark-comparison-desktop.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/benchmark-comparison-mobile.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/benchmark-comparison-deep.png,results/new-model-day/models/opus-task-budget-200000/reports/screenshots/benchmark-comparison-mobile-deep.png,67486,True,281.111,1012407,24357,1036764,1036764,0,0,853500,58779,0,912279,100128,1012407,22,28,22,True,True,True,3,0,3,0,False,False,22,run-checker-cli,"ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus-task-budget-200000-new-model-day/ | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact ""$PWD/eval-runs/skill-with-shell-opus-task-budget-200000-new-mode",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus?task_budget=50000,opus-task-budget-50000,clean-final,skill-with-shell-opus-task-budget-50000-new-model-day,numeric-data,results/new-model-day/models/opus-task-budget-50000/artifacts/numeric-data.html,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/numeric-data-desktop.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/numeric-data-mobile.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/numeric-data-deep.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/numeric-data-mobile-deep.png,39382,True,66.763,90085,5361,95446,95446,0,0,56965,16529,0,73494,16591,90085,7,7,7,True,True,True,2,0,2,0,False,False,7,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus-task-budget-50000-new-model-day/n | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus-tas,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus?task_budget=50000,opus-task-budget-50000,clean-final,skill-with-shell-opus-task-budget-50000-new-model-day,code-review,results/new-model-day/models/opus-task-budget-50000/artifacts/code-review.html,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/code-review-desktop.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/code-review-mobile.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/code-review-deep.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/code-review-mobile-deep.png,41220,True,63.323,104544,5043,109587,109587,0,0,12772,35644,0,48416,56128,104544,4,5,4,False,False,False,0,0,0,0,False,False,4,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus?task_budget=50000,opus-task-budget-50000,clean-final,skill-with-shell-opus-task-budget-50000-new-model-day,module-explainer,results/new-model-day/models/opus-task-budget-50000/artifacts/module-explainer.html,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/module-explainer-desktop.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/module-explainer-mobile.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/module-explainer-deep.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/module-explainer-mobile-deep.png,9962,False,56.079,82544,4834,87378,87378,0,0,11901,1798,0,13699,68845,82544,3,3,3,True,False,False,0,0,0,0,False,False,3,read-checker,read /home/shaun/source/birch-html/scripts/check_birch_renderings.py,4,0,1,1,1,0,1,1,1,0,1,0,1,0,1,0,True,35.0,7.0,20,35.0,missing_birch_css,fail | |
| new-model-day,opus?task_budget=50000,opus-task-budget-50000,clean-final,skill-with-shell-opus-task-budget-50000-new-model-day,implementation-plan,results/new-model-day/models/opus-task-budget-50000/artifacts/implementation-plan.html,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/implementation-plan-desktop.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/implementation-plan-mobile.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/implementation-plan-deep.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/implementation-plan-mobile-deep.png,42710,True,62.202,106572,5249,111821,111821,0,0,69127,15224,0,84351,22221,106572,7,7,7,True,True,True,2,0,2,0,False,False,7,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus-task-budget-50000-new-model-day/i | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus-tas,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus?task_budget=50000,opus-task-budget-50000,clean-final,skill-with-shell-opus-task-budget-50000-new-model-day,benchmark-comparison,results/new-model-day/models/opus-task-budget-50000/artifacts/benchmark-comparison.html,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/benchmark-comparison-desktop.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/benchmark-comparison-mobile.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/benchmark-comparison-deep.png,results/new-model-day/models/opus-task-budget-50000/reports/screenshots/benchmark-comparison-mobile-deep.png,44574,True,76.846,105163,6612,111775,111775,0,0,69216,15449,0,84665,20498,105163,7,7,7,True,True,True,2,0,2,0,False,False,7,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus-task-budget-50000-new-model-day/b | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus-tas,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus46,opus46,clean-final,skill-with-shell-opus46-new-model-day,numeric-data,results/new-model-day/models/opus46/artifacts/numeric-data.html,results/new-model-day/models/opus46/reports/screenshots/numeric-data-desktop.png,results/new-model-day/models/opus46/reports/screenshots/numeric-data-mobile.png,results/new-model-day/models/opus46/reports/screenshots/numeric-data-deep.png,results/new-model-day/models/opus46/reports/screenshots/numeric-data-mobile-deep.png,50342,True,165.446,346224,9640,355864,355864,0,0,293597,26093,0,319690,26534,346224,20,21,20,True,True,True,3,1,2,0,False,True,20,run-checker-cli,"ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus46-new-model-day/numeric-data.html | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact ""$(pwd)/eval-runs/skill-with-shell-opus46-new-model-day/numeric-d",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus46,opus46,clean-final,skill-with-shell-opus46-new-model-day,code-review,results/new-model-day/models/opus46/artifacts/code-review.html,results/new-model-day/models/opus46/reports/screenshots/code-review-desktop.png,results/new-model-day/models/opus46/reports/screenshots/code-review-mobile.png,results/new-model-day/models/opus46/reports/screenshots/code-review-deep.png,results/new-model-day/models/opus46/reports/screenshots/code-review-mobile-deep.png,51991,True,237.048,528342,11743,540085,540085,0,0,445820,41626,0,487446,40896,528342,17,29,17,True,True,True,2,0,2,0,False,False,17,run-checker-cli,"ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus46-new-model-day/code-review.html | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact ""$(pwd)/eval-runs/skill-with-shell-opus46-new-model-day/code-revi",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus46,opus46,clean-final,skill-with-shell-opus46-new-model-day,module-explainer,results/new-model-day/models/opus46/artifacts/module-explainer.html,results/new-model-day/models/opus46/reports/screenshots/module-explainer-desktop.png,results/new-model-day/models/opus46/reports/screenshots/module-explainer-mobile.png,results/new-model-day/models/opus46/reports/screenshots/module-explainer-deep.png,results/new-model-day/models/opus46/reports/screenshots/module-explainer-mobile-deep.png,61250,True,192.786,406724,11067,417791,417791,0,0,301904,60133,0,362037,44687,406724,11,18,11,True,True,True,1,0,1,0,False,False,11,"read-checker,run-checker-cli",read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus46-new-model-day/module-explainer.html -,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus46,opus46,clean-final,skill-with-shell-opus46-new-model-day,implementation-plan,results/new-model-day/models/opus46/artifacts/implementation-plan.html,results/new-model-day/models/opus46/reports/screenshots/implementation-plan-desktop.png,results/new-model-day/models/opus46/reports/screenshots/implementation-plan-mobile.png,results/new-model-day/models/opus46/reports/screenshots/implementation-plan-deep.png,results/new-model-day/models/opus46/reports/screenshots/implementation-plan-mobile-deep.png,52816,True,130.271,159833,7328,167161,167161,0,0,116309,20689,0,136998,22835,159833,11,12,11,True,True,True,2,1,1,0,False,True,11,run-checker-cli,"ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus46-new-model-day/implementation-pl | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact ""$(pwd)/eval-runs/skill-with-shell-opus46-new-model-day/implement",2,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,True,94,18.8,20,94,,fail | |
| new-model-day,opus46,opus46,clean-final,skill-with-shell-opus46-new-model-day,benchmark-comparison,results/new-model-day/models/opus46/artifacts/benchmark-comparison.html,results/new-model-day/models/opus46/reports/screenshots/benchmark-comparison-desktop.png,results/new-model-day/models/opus46/reports/screenshots/benchmark-comparison-mobile.png,results/new-model-day/models/opus46/reports/screenshots/benchmark-comparison-deep.png,results/new-model-day/models/opus46/reports/screenshots/benchmark-comparison-mobile-deep.png,69598,True,271.957,351900,19121,371021,371021,0,0,251140,44066,0,295206,56694,351900,14,18,14,True,True,True,1,0,1,0,False,False,14,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus46-new-model-day/benchmark-compari,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus48,opus48,clean-final,skill-with-shell-opus48-new-model-day,numeric-data,results/new-model-day/models/opus48/artifacts/numeric-data.html,results/new-model-day/models/opus48/reports/screenshots/numeric-data-desktop.png,results/new-model-day/models/opus48/reports/screenshots/numeric-data-mobile.png,results/new-model-day/models/opus48/reports/screenshots/numeric-data-deep.png,results/new-model-day/models/opus48/reports/screenshots/numeric-data-mobile-deep.png,54625,True,109.048,271070,6914,277984,277984,0,0,206336,37010,0,243346,27724,271070,14,16,14,True,True,True,2,0,2,0,False,False,14,run-checker-cli,"ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus48-new-model-day/numeric-data.html | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact ""$(pwd)/eval-runs/skill-with-shell-opus48-new-model-day/numeric-d",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus48,opus48,clean-final,skill-with-shell-opus48-new-model-day,code-review,results/new-model-day/models/opus48/artifacts/code-review.html,results/new-model-day/models/opus48/reports/screenshots/code-review-desktop.png,results/new-model-day/models/opus48/reports/screenshots/code-review-mobile.png,results/new-model-day/models/opus48/reports/screenshots/code-review-deep.png,results/new-model-day/models/opus48/reports/screenshots/code-review-mobile-deep.png,46736,True,197.043,459662,14571,474233,474233,0,0,342689,44671,0,387360,72302,459662,12,15,12,True,True,True,2,0,2,0,False,False,12,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus48-new-model-day/code-review.html | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus48-n,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus48,opus48,clean-final,skill-with-shell-opus48-new-model-day,module-explainer,results/new-model-day/models/opus48/artifacts/module-explainer.html,results/new-model-day/models/opus48/reports/screenshots/module-explainer-desktop.png,results/new-model-day/models/opus48/reports/screenshots/module-explainer-mobile.png,results/new-model-day/models/opus48/reports/screenshots/module-explainer-deep.png,results/new-model-day/models/opus48/reports/screenshots/module-explainer-mobile-deep.png,51357,True,218.593,618129,15008,633137,633137,0,0,471560,74460,0,546020,72109,618129,12,21,12,True,True,True,2,0,2,0,False,False,12,"read-checker,run-checker-cli",read /home/shaun/source/birch-html/scripts/check_birch_renderings.py | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus48-new-model-day/module-explainer.html -,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus48,opus48,clean-final,skill-with-shell-opus48-new-model-day,implementation-plan,results/new-model-day/models/opus48/artifacts/implementation-plan.html,results/new-model-day/models/opus48/reports/screenshots/implementation-plan-desktop.png,results/new-model-day/models/opus48/reports/screenshots/implementation-plan-mobile.png,results/new-model-day/models/opus48/reports/screenshots/implementation-plan-deep.png,results/new-model-day/models/opus48/reports/screenshots/implementation-plan-mobile-deep.png,51781,True,196.392,252260,12073,264333,264333,0,0,186054,26277,0,212331,39929,252260,12,13,12,True,True,True,2,0,2,0,False,False,12,run-checker-cli,ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus48-new-model-day/implementation-pl | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact /home/shaun/source/birch-html/eval-runs/skill-with-shell-opus48-n,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
| new-model-day,opus48,opus48,clean-final,skill-with-shell-opus48-new-model-day,benchmark-comparison,results/new-model-day/models/opus48/artifacts/benchmark-comparison.html,results/new-model-day/models/opus48/reports/screenshots/benchmark-comparison-desktop.png,results/new-model-day/models/opus48/reports/screenshots/benchmark-comparison-mobile.png,results/new-model-day/models/opus48/reports/screenshots/benchmark-comparison-deep.png,results/new-model-day/models/opus48/reports/screenshots/benchmark-comparison-mobile-deep.png,55489,True,258.31,685790,18643,704433,704433,0,0,576055,53824,0,629879,55911,685790,21,26,21,True,True,True,2,0,2,0,False,False,21,run-checker-cli,"ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact eval-runs/skill-with-shell-opus48-new-model-day/benchmark-compari | ran checker CLI: cd /home/shaun/source/birch-html && uv run --with pillow python skill/scripts/check_birch_renderings.py --artifact ""$(pwd)/eval-runs/skill-with-shell-opus48-new-model-day/benchmark",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True,100.0,20.0,20,100.0,,clean | |
Xet Storage Details
- Size:
- 95.1 kB
- Xet hash:
- a7f33495f89966b9edd787d284b2723ffa8fd13142b4b7c42b340c3e918798ed
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.