FlakyGym Control Center
This console runs flaky-test benchmark episodes and streams live logs. Use it to configure runs, estimate runtime, and review grader outcomes quickly.
Quick Brief: Dataset + Graders
Dataset: dataset/py_tasks.csv
Each row is one flaky-test investigation task created from py-data.csv (repo + SHA + target test + labels + optional known fix diff).
Headers:
repo_urlshatest_nametest_file
categorylabelstatuspr_link
task_typestest_codeknown_fix_diff
3 Graders (short)
- Task 1 (`classify`): exact-match flaky vs stable.
- Task 2 (`root_cause`): category similarity matrix (partial credit allowed).
- Task 3 (`fix_proposal`): weighted score from pattern match, patch applicability, and LLM judge.
Run Configuration
Tip: if no API key is provided, inference.py falls back to its heuristic agent.
Run Status
idle
Live Logs
0 lines
No run started yet.