FlakyGym Space

FlakyGym Control Center

This console runs flaky-test benchmark episodes and streams live logs. Use it to configure runs, estimate runtime, and review grader outcomes quickly.

Quick Brief: Dataset + Graders

Dataset: `dataset/py_tasks.csv`

Each row is one flaky-test investigation task created from py-data.csv (repo + SHA + target test + labels + optional known fix diff).

Headers:

repo_urlshatest_nametest_file categorylabelstatuspr_link task_typestest_codeknown_fix_diff

3 Graders (short)

Task 1 (`classify`): exact-match flaky vs stable.
Task 2 (`root_cause`): category similarity matrix (partial credit allowed).
Task 3 (`fix_proposal`): weighted score from pattern match, patch applicability, and LLM judge.

Run Configuration

Dataset Path

Episodes Per Task

1 episode(s) 1-100

Max Steps

20 step(s) 1-100

Task Types

Add from dropdown, remove with x on each chip.

Benchmark Label

Runtime ETA

~09m 00s

3 task(s) × 1 episode(s) × 180s/episode

API Base URL (optional)

Model Name (optional)

API Key (optional)

Open API Docs

Tip: if no API key is provided, inference.py falls back to its heuristic agent.

Run Status

idle

Job ID: -

Return Code: -

Started: -

Finished: -

Live Logs 0 lines