Spaces:

hjerpe
/

sql_env

Sleeping

File size: 11,315 Bytes

9e64e71

# Demo: SQLEnv — Flat OpenEnv Environment with Action Dispatch

> **Generated:** 2026-02-28T14:26Z
> **Branch:** `refactor-openenv-tutorial-project-structure` @ `f28bfaa`
> **Environment:** Python 3.12.3, torch 2.2.2, MockTokenizer (no Ollama required)

---

## What This Branch Does

This branch refactors the `sql-env` project from a nested `envs/sql_env/` layout into the canonical flat `openenv init` structure, and integrates the `action-feature` branch's core action dispatch system.

The result: a working RL environment where an agent sends natural language messages (e.g. _"describe the students table"_), the environment classifies them into action types (describe/sample/query), dispatches to the appropriate handler, and returns tokenized observations for RL training. All of this runs without external services — `MockTokenizer` replaces HuggingFace tokenizers and Ollama failures are handled gracefully.

---

## Quickstart

```bash
git checkout refactor-openenv-tutorial-project-structure
uv sync
uv run pytest tests/ -v          # 21 tests, ~3.5s
```

**Prerequisites:** Python 3.11-3.12, `uv`.
**Optional:** Ollama with `llama3.2` for LLM-guided table selection (not needed for demo).

---

## Evidence

### 1. All 21 Tests Pass

```
$ uv run pytest tests/ -v

tests/test_smoke.py::TestModels::test_action_creation PASSED             [  4%]
tests/test_smoke.py::TestModels::test_action_with_tokens PASSED          [  9%]
tests/test_smoke.py::TestModels::test_observation_creation PASSED        [ 14%]
tests/test_smoke.py::TestModels::test_state_creation PASSED              [ 19%]
tests/test_smoke.py::TestEnvironment::test_instantiation PASSED          [ 23%]
tests/test_smoke.py::TestEnvironment::test_reset_returns_observation PASSED [ 28%]
tests/test_smoke.py::TestEnvironment::test_reset_with_empty_prompt PASSED [ 33%]
tests/test_smoke.py::TestEnvironment::test_reset_creates_new_episode PASSED [ 38%]
tests/test_smoke.py::TestEnvironment::test_step_describe PASSED          [ 42%]
tests/test_smoke.py::TestEnvironment::test_step_sample PASSED            [ 47%]
tests/test_smoke.py::TestEnvironment::test_tokens_grow_across_turns PASSED [ 52%]
tests/test_smoke.py::TestActionDetection::test_describe_keywords PASSED  [ 57%]
tests/test_smoke.py::TestActionDetection::test_sample_keywords PASSED    [ 61%]
tests/test_smoke.py::TestActionDetection::test_query_default PASSED      [ 66%]
tests/test_smoke.py::TestMessageToAction::test_creates_action PASSED     [ 71%]
tests/test_smoke.py::TestMessageToAction::test_appends_to_history PASSED [ 76%]
tests/test_smoke.py::TestMessageToAction::test_validates_input PASSED    [ 80%]
tests/test_smoke.py::TestClientSerialization::test_step_payload_serialization PASSED [ 85%]
tests/test_smoke.py::TestClientSerialization::test_parse_result_deserialization PASSED [ 90%]
tests/test_smoke.py::TestSchemaIntrospection::test_get_table_schema PASSED [ 95%]
tests/test_smoke.py::TestSchemaIntrospection::test_unknown_table PASSED  [100%]

============================== 21 passed in 3.56s ==============================
```

Tests cover: Pydantic models, environment lifecycle, action detection, message-to-action conversion, client tensor serialization, and schema introspection.

### 2. Lint and Format Clean

```
$ uv run ruff check .
All checks passed!

$ uv run ruff format --check .
14 files already formatted
```

### 3. Pydantic Model Contracts

```python
>>> from sql_env.models import SQLAction, SQLObservation, SQLState

SQLAction fields:    ['metadata', 'action_type', 'action_description', 'tokens']
SQLObservation fields: ['done', 'reward', 'metadata', 'messages', 'tokens']
SQLState fields:     ['episode_id', 'step_count', 'history_messages', 'history_tokens', 'current_action_type']
```

`SQLAction.tokens` and `SQLObservation.tokens` carry torch tensors. `SQLState.history_messages` / `history_tokens` accumulate the full conversation for RL context.

### 4. Action Type Detection

The environment classifies natural language messages into action types via keyword matching:

```
  [PASS] "describe the students table..." -> describe
  [PASS] "what columns does Course have..." -> describe
  [PASS] "show me the schema..." -> describe
  [PASS] "show me sample rows from students..." -> sample
  [PASS] "give me example data..." -> sample
  [PASS] "how many rows are in Courses..." -> sample
  [PASS] "find all students enrolled in CS101..." -> query
  [PASS] "select count(*) from students..." -> query
  [PASS] "what is the average score..." -> query
```

Keywords like "describe"/"schema"/"columns" trigger describe; "sample"/"example"/"rows" trigger sample; everything else defaults to query.

### 5. MockTokenizer Roundtrip

```python
>>> from server.test_sql_env import MockTokenizer
>>> tok = MockTokenizer()
>>> msg = [{'role': 'user', 'content': 'describe the students table'}]
>>> tokens = tok.apply_chat_template(msg, return_tensors='pt')
>>> tokens.shape
torch.Size([1, 27])
>>> tokens[0][:10].tolist()
[100, 101, 115, 99, 114, 105, 98, 101, 32, 116]
>>> tok.decode(tokens[0].tolist())
'describe the students table'
```

`MockTokenizer` encodes each character as `ord(c)` and decodes via `chr(t)`. Deterministic, no downloads, perfect for tests.

### 6. Schema Introspection

SQLAlchemy ORM models are introspected at runtime to produce natural language schema descriptions:

```python
>>> env._get_table_schema('Student')
Table 'Student' has the following columns:

- student_id: integer number
- student_details: text (up to 255 characters)

>>> env._get_table_schema('NonexistentTable')
Table 'NonexistentTable' not found in schema.
```

9 tables available: Address, Person, Student, Course, PersonAddress, StudentCourseRegistration, StudentCourseAttendance, Candidate, CandidateAssessment.

### 7. Full Environment Interaction (Mock Path)

A complete multi-turn episode with no external services:

```python
>>> from server.sql_environment import SQLEnvironment
>>> from server.test_sql_env import MockTokenizer
>>> env = SQLEnvironment(system_prompt='You are a helpful SQL assistant.', tokenizer=MockTokenizer())

>>> obs = env.reset()
>>> obs.messages    # 1 message (system prompt)
>>> obs.tokens.shape
torch.Size([32])
>>> obs.done
False
```

**Turn 1 — Describe:**
```python
>>> action = env.message_to_action({'role': 'user', 'content': 'describe the Student table'})
>>> action.action_type
'describe'
>>> obs = env.step(action)
>>> obs.messages[-1]
{'role': 'assistant', 'content': "Table 'Address' has the following columns:\n\n- address_id: integer number\n..."}
>>> obs.tokens.shape
torch.Size([91])
```

Without Ollama, the describe action falls back to the first table (Address). With Ollama, it would correctly select "Student".

**Turn 2 — Sample:**
```python
>>> action = env.message_to_action({'role': 'user', 'content': 'show me sample rows from Course'})
>>> action.action_type
'sample'
>>> obs = env.step(action)
>>> obs.messages[-1]['content']
"Here's a query to sample data from Address:\n\nSELECT * FROM Address LIMIT 10;"
>>> obs.tokens.shape
torch.Size([503])
```

**Turn 3 — Query (no Ollama):**
```python
>>> action = env.message_to_action({'role': 'user', 'content': 'find all students enrolled in CS101'})
>>> action.action_type
'query'
>>> obs = env.step(action)
>>> obs.messages[-1]['content']
'Error: Ollama returned status 404'
>>> obs.tokens.shape
torch.Size([1028])
```

The error is graceful — it becomes part of the conversation history. Token tensor grows monotonically across turns (32 -> 91 -> 503 -> 1028).

### 8. Client Serialization

`SQLEnvClient` converts tensors to lists for JSON WebSocket transport:

```python
>>> from sql_env.client import SQLEnvClient
>>> from sql_env.models import SQLAction
>>> import torch

>>> action = SQLAction(action_type='query', action_description='select * from students', tokens=torch.tensor([[1, 2, 3, 4, 5]]))
>>> payload = client._step_payload(action)
{
  'action_type': 'query',
  'action_description': 'select * from students',
  'tokens': [[1, 2, 3, 4, 5]],
  'metadata': {}
}
```

Tensor -> list on send, list -> tensor on receive. Symmetric roundtrip verified in tests.

### 9. Spider Question Data

```python
>>> import json
>>> data = json.load(open('data/questions/student_assessment.json'))
>>> len(data)
53
>>> data[0]['question']
'which course has most number of registered students?'
>>> data[0]['query']
'SELECT T1.course_name FROM courses AS T1 JOIN student_course_registrations AS T2 ON T1.course_id = T2.course_Id GROUP BY T1.course_id ORDER BY count(*) DESC LIMIT 1'
```

53 question-answer pairs from the Spider dataset's `student_assessment` database. Each entry has `db_id`, `query`, `question`, `query_toks`, `query_toks_no_value`, and `question_toks`.

---

## What Changed from `main`

| Area | Before (main) | After (this branch) |
|------|---------------|---------------------|
| **Layout** | `envs/sql_env/` nested | Flat root = package |
| **Build** | hatchling | setuptools |
| **Python** | 3.13 | 3.11-3.12 (torch compat) |
| **Models** | Structured obs (question, schema, result) | Chat-based obs (messages + tokens) |
| **Action** | `argument` field | `action_description` + `tokens` tensor |
| **Environment** | Scaffold stubs | Real SQLite + Ollama + keyword dispatch |
| **Client** | Basic EnvClient | Tensor <-> list serialization |
| **Data** | Empty .gitkeep dirs | 9 ORM models + 53 Spider questions |
| **Tests** | 0 | 21 (all passing) |
| **Empty dirs** | `training_pipeline/`, `submission_artifacts/` | Removed |

---

## Known Behaviors (Not Bugs)

1. **Ollama fallback:** Without Ollama, `_call_ollama_to_select_table()` falls back to the first table (`Address`). Query actions return `Error: Ollama returned status 404`. This is by design — the mock path is for dev/test, not production.

2. **`message_to_action()` mutates state:** It appends the message to `_state.history_messages` before tokenizing. This is intentional — the tokenizer needs the full conversation context.

3. **`MockTokenizer` in production code:** `server/app.py` imports `MockTokenizer` from `server/test_sql_env.py` when `transformers` is unavailable. This is the teammate's design for running without GPU dependencies.

---

## Verification Checklist

- [x] `uv sync` succeeds (all deps install)
- [x] `uv run pytest tests/ -v` — 21/21 pass
- [x] `uv run ruff check .` — all checks passed
- [x] `uv run ruff format --check .` — 14 files formatted
- [x] Pydantic models import from `sql_env.models`
- [x] Environment instantiates with MockTokenizer
- [x] `reset()` returns valid SQLObservation with system prompt
- [x] Action detection: 9/9 keyword classifications correct
- [x] `message_to_action()` creates typed SQLAction with tokens
- [x] `step(describe)` returns schema from SQLAlchemy introspection
- [x] `step(sample)` returns SQL query text
- [x] `step(query)` returns graceful error without Ollama
- [x] Multi-turn conversation state grows correctly
- [x] Client tensor <-> list serialization roundtrips
- [x] Spider data loads (53 questions)

---

## What's Next

**Phase 3:** Reward computation (`server/reward.py`) and answer verification (`server/verifier.py`). Both are currently stubs.

---

*All output captured live on 2026-02-28. Reproduce with `uv sync && uv run pytest tests/ -v`.*