| --- |
| license: apache-2.0 |
| language: |
| - en |
| - vi |
| base_model: |
| - Qwen/Qwen3-4B-Instruct-2507 |
| tags: |
| - text-generation-inference |
| --- |
| # Text-to-SQL Evaluation Pipeline |
|
|
| ## Overview |
|
|
| This repository implements a **Text-to-SQL evaluation pipeline** using the OpenAI API. The system is designed for the **banking domain**, where strict business rules and deterministic SQL generation are required. |
|
|
| Key goals: |
|
|
| - Translate **natural language questions** into **valid SQLite SQL queries** |
| - Enforce **domain-specific constraints** via prompt engineering |
| - Benchmark model outputs using **multiple evaluation metrics** |
|
|
| --- |
|
|
| ## Key Features |
|
|
| > **Important:** The existing `MyModel` class and its formatting are **kept exactly as-is**.\ |
| > This project does **not** modify, refactor, or reformat the demo code. The README only documents how the current implementation works. |
|
|
| - The `MyModel` class structure, method names, and prompt formatting remain unchanged |
| - No code auto-formatting or refactoring is applied |
| - All behavior described below reflects the **original demo code** |
|
|
| ## Key Features |
|
|
| - Deterministic SQL generation (`temperature = 0`) |
| - Strong prompt constraints (no markdown, no explanations, SQL only) |
| - Banking-specific metric grouping and unit conversion logic |
| - Multi-metric evaluation for both syntactic and semantic quality |
|
|
| --- |
|
|
| ## High-level Architecture |
|
|
| ```text |
| User Question |
| │ |
| ▼ |
| Prompt Builder (System + User) |
| │ |
| ▼ |
| OpenAI ChatCompletion API |
| │ |
| ▼ |
| Generated SQL |
| │ |
| ▼ |
| Evaluation Metrics |
| ``` |
|
|
| --- |
|
|
| ## Suggested Project Structure |
|
|
| ```text |
| . |
| ├── main.py # Entry point, runs inference and prints metrics |
| ├── model.py # OpenAI client wrapper (MyModel) |
| ├── evaluator.py # Evaluation metrics implementation |
| ├── prompts/ |
| │ └── text2sql.txt # System prompt with banking rules |
| ├── README.md |
| └── requirements.txt |
| ``` |
|
|
| --- |
|
|
| ## Prompt Design |
|
|
| ### System Prompt Responsibilities |
|
|
| The system prompt enforces the following rules: |
|
|
| - **Always perform** an `INNER JOIN` between `system_data` and `branch` |
| - **Always SELECT** the following columns: |
| - `system_data.data_code` |
| - `system_data.year` |
| - `system_data.branch_id` |
| - `branch.name` |
| - `system_data.value` |
| - SQL keywords must be **UPPERCASE** |
| - Text filters must use `LIKE '%keyword%'` |
| - Vietnamese location names must use **exact accents** |
| - Output **SQL only** (no markdown, no explanations) |
|
|
| ### Metric Grouping Logic |
|
|
| Metrics are classified by `metric_code` prefix: |
|
|
| | Group | Description | |
| | ----- | ------------------------------------ | |
| | A | Inbound metrics (`MET_A_%`) | |
| | B | Outbound metrics (`MET_B_%`) | |
| | C | Stock / snapshot metrics (`MET_C_%`) | |
| | D | Exposure / obligation metrics | |
| | E | Resource mobilization metrics | |
| | F | Ratio & efficiency metrics | |
|
|
| ### Unit Conversion Rule |
|
|
| - Stored unit: **Million VND** |
| - If the question mentions **"Billion VND"**, multiply value by `1000` |
|
|
| --- |
|
|
| ## Example Input (Schema) |
|
|
| ```sql |
| CREATE TABLE entity_a ( |
| id INTEGER, |
| group_id INTEGER, |
| org_id INTEGER, |
| code VARCHAR(100), |
| name VARCHAR(255), |
| attr_1 VARCHAR(255), |
| attr_2 VARCHAR(255), |
| attr_3 TEXT |
| ); |
| |
| CREATE TABLE entity_b ( |
| id INTEGER, |
| group_id INTEGER, |
| entity_a_id INTEGER, |
| time_key INTEGER, |
| metric_name VARCHAR(255), |
| metric_code VARCHAR(100), |
| metric_value REAL, |
| metric_unit VARCHAR(100) |
| ); |
| ``` |
|
|
| --- |
|
|
| ## Evaluation Metrics |
|
|
| Evaluation results are printed **at the very top of the output**: |
|
|
| | Label | Value | |
| | -------------- | ------------ | |
| | rouge | 0.96 | |
| | meteor | 0.95 | |
| | binary | 0.65 | |
| | llm-as-a-judge | 0.82 | |
|
|
| ### Metric Definitions |
|
|
| - **ROUGE**: Token-level overlap between generated and reference SQL |
| - **METEOR**: Semantic similarity with synonym awareness |
| - **Binary Match**: Exact string match (0 or 1) |
| - **LLM-as-a-Judge**: LLM-based holistic judgment of correctness |
|
|
| --- |
|
|
| ## Full Demo Code (Kept Exactly As-Is) |
|
|
| The following is the **original demo code**, included verbatim for clarity and ease of understanding. No refactoring, no reformatting, no behavioral changes have been applied. |
|
|
| ```python |
| import argparse |
| |
| from openai import OpenAI |
| |
| DEFAULT_QUESTION = """CREATE TABLE entity_a ( |
| id INTEGER, |
| group_id INTEGER, |
| org_id INTEGER, |
| code VARCHAR(100), |
| name VARCHAR(255), |
| attr_1 VARCHAR(255), |
| attr_2 VARCHAR(255), |
| attr_3 TEXT |
| ); |
| CREATE TABLE entity_b ( |
| id INTEGER, |
| group_id INTEGER, |
| entity_a_id INTEGER, |
| time_key INTEGER, |
| metric_name VARCHAR(255), |
| metric_code VARCHAR(100), |
| metric_value REAL, |
| metric_unit VARCHAR(100) |
| ); |
| ENTITIES = { |
| "metric": { |
| "metric_code": "METRIC_X", |
| "metric_unit": "UNIT_A" |
| }, |
| "entity_a_field": { |
| "attr_1": [], |
| "attr_2": [], |
| "attr_3": [], |
| "id": [] |
| }, |
| "time_key": [year], |
| Query: |
| } |
| |
| |
| """ |
| |
| |
| class MyModel(object): |
| def __init__(self, model_name: str, api_key: str): |
| self.model_name = model_name |
| self.client = OpenAI(base_url="", api_key=api_key) |
| def get_prompt( |
| self, |
| question: str, |
| ) -> list[dict[str, str]]: |
| return [ |
| { |
| "role": "system", |
| "content": """ |
| You are a problem solving model working on task_description XML block: |
| <task_description>You are a specialized Text-to-SQL assistant in the banking domain. Your objective is to translate natural language questions into valid SQLite queries using the provided schema and banking business logic. |
| ### Input: |
| - Schema: Table definitions in SQL DDL format. |
| - Relationships: Key linking logic between tables (system_data.branch_id = branch.id). |
| - Data Content Context: |
| Indicator_Categories: |
| Group_A: |
| description: Primary metrics – inbound type |
| rule: |
| - metric_code LIKE 'MET_A_%' |
| |
| Group_B: |
| description: Primary metrics – outbound type |
| rule: |
| - metric_code LIKE 'MET_B_%' |
| |
| Group_C: |
| description: Stock / snapshot metrics |
| rule: |
| - metric_code LIKE 'MET_C_%' |
| |
| Group_D: |
| description: Exposure / obligation related metrics |
| rule: |
| - metric_code LIKE 'MET_D_%' |
| - metric_code LIKE 'MET_D_TOTAL_%' |
| - metric_code = 'MET_D_SPECIAL' |
| |
| Group_E: |
| description: Resource mobilization metrics |
| rule: |
| - metric_code LIKE 'MET_E_%' |
| |
| Group_F: |
| description: Ratio & efficiency indicators |
| rule: |
| - Unit Logic: {Which dmain} data is stored in 'Triệu VND'. If the Question mentions 'Tỷ', multiply the value by 1000. |
| - Entities: Extracted key information including data_code, year, and branch filtering criteria. |
| ### Rules: |
| 1. ALWAYS perform an INNER JOIN between system_data and branch on system_data.branch_id = branch.id. |
| 2. ALWAYS SELECT system_data.data_code, system_data.year, system_data.branch_id, branch.name, system_data.value. |
| 3. Use exact Vietnamese accents for location values. |
| 4. Use LIKE '%keyword%' for text matching. |
| 5. Use UPPERCASE for SQL keywords. |
| 6. Output ONLY the SQL query. No explanations or markdown blocks.</task_description> |
| You will be given a single task in the question XML block |
| Solve only the task in question block. |
| Generate only the answer, do not generate anything else |
| """, |
| }, |
| { |
| "role": "user", |
| "content": f""" |
| Now for the real task, solve the task in question block. |
| Generate only the solution, do not generate anything else |
| <question>{question}</question> |
| """, |
| }, |
| ] |
| def invoke(self, question: str) -> str: |
| chat_response = self.client.chat.completions.create( |
| model=self.model_name, |
| messages=self.get_prompt(question), |
| temperature=0, |
| reasoning_effort="none", |
| ) |
| return chat_response.choices[0].message.content |
| |
| if __name__ == "__main__": |
| parser = argparse.ArgumentParser() |
| parser.add_argument("--question", type=str, default=DEFAULT_QUESTION, required=False) |
| parser.add_argument("--api-key", type=str, default="", required=False) |
| parser.add_argument("--model", type=str, default="model", required=False) |
| args = parser.parse_args() |
| client = MyModel(model_name=args.model, api_key=args.api_key) |
| print(client.invoke(args.question)) |
| ``` |
|
|
| --- |
|
|
| ## How to Run |
|
|
| ```bash |
| python main.py \ |
| --question "<QUESTION_TEXT>" \ |
| --api-key "YOUR_OPENAI_API_KEY" \ |
| --model "gpt-4.1-mini" |
| ``` |
|
|
| --- |
|
|
| ## Important Notes |
|
|
| - `temperature = 0` ensures reproducible results |
| - Function calling is intentionally avoided to prevent JSON-wrapped SQL |
| - The prompt is optimized for **SQLite dialect** |
|
|
| --- |
|
|
| ## Possible Extensions |
|
|
| - Multi-year queries using `IN` or ranges |
| - Queries combining multiple metric groups |
| - Execution-based evaluation (SQL result comparison) |
| - Support for additional SQL dialects (PostgreSQL, MySQL) |