sql_env / REVIEW_REPORT.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified
# Code Review Report: F011 Step 1.3 (`notebooks/compare_methods.ipynb`)
**Risk Tier:** Medium
**Status:** Passed with Warnings
**Verdict:** APPROVE
## Summary
`LLMToolCallingPolicy` is implemented per Step 1.3 intent: it builds episode messages, uses chat-template tool calling, forces ANSWER at low budget, and falls back to `parse_error` on unparseable output. No correctness or security blockers were found in the scoped notebook change.
## Evidence
### Tests
- **Status:** Mixed (targeted checks passed; existing unrelated smoke failures persist)
- **Commands:**
- `uv run python - <<'PY' ... compile notebook cells ... PY`
- `uv run python - <<'PY' ... runtime checks for valid action / budget fallback / parse fallback ... PY`
- `uv run pytest tests/test_smoke.py -v`
- **Results:**
- Notebook code-cell compilation: passed (`Compiled 6 code cells successfully`)
- Policy runtime checks: passed (`QUERY` valid path, `ANSWER budget_exhausted`, `ANSWER parse_error`)
- Smoke tests: `21 passed, 4 failed` (pre-existing reward expectation mismatches in environment tests)
### Security (Medium)
- **Status:** Clear
- **Checks:** Medium-tier quick checks on parsing/generation fallback paths; no secret handling, auth, or privilege-sensitive paths added.
## Issues
### Critical
None.
### Important
None.
### Minor
1. **Episode reset heuristic is question-text based and can theoretically leak history if two consecutive episodes start with identical question text.**
- **Location:** `notebooks/compare_methods.ipynb:313-316`
- **Recommendation:** Consider adding a stronger episode boundary signal (e.g., explicit wrapper reset hook or observation-based reset trigger).
## Next Actions
1. Proceed to Step 1.4.
2. Optionally harden reset boundary logic before large-scale eval runs.