Update README.md

ebc11ad verified about 1 month ago

9.16 kB

	---
	license: apache-2.0
	language:
	- en
	- vi
	base_model:
	- Qwen/Qwen3-4B-Instruct-2507
	tags:
	- text-generation-inference
	---
	# Text-to-SQL Evaluation Pipeline

	## Overview

	This repository implements a Text-to-SQL evaluation pipeline using the OpenAI API. The system is designed for the banking domain, where strict business rules and deterministic SQL generation are required.

	Key goals:

	- Translate natural language questions into valid SQLite SQL queries
	- Enforce domain-specific constraints via prompt engineering
	- Benchmark model outputs using multiple evaluation metrics

	---

	## Key Features

	> Important: The existing `MyModel` class and its formatting are kept exactly as-is.\
	> This project does not modify, refactor, or reformat the demo code. The README only documents how the current implementation works.

	- The `MyModel` class structure, method names, and prompt formatting remain unchanged
	- No code auto-formatting or refactoring is applied
	- All behavior described below reflects the original demo code

	## Key Features

	- Deterministic SQL generation (`temperature = 0`)
	- Strong prompt constraints (no markdown, no explanations, SQL only)
	- Banking-specific metric grouping and unit conversion logic
	- Multi-metric evaluation for both syntactic and semantic quality

	---

	## High-level Architecture

	```text
	User Question
	│
	▼
	Prompt Builder (System + User)
	│
	▼
	OpenAI ChatCompletion API
	│
	▼
	Generated SQL
	│
	▼
	Evaluation Metrics
	```

	---

	## Suggested Project Structure

	```text
	.
	├── main.py # Entry point, runs inference and prints metrics
	├── model.py # OpenAI client wrapper (MyModel)
	├── evaluator.py # Evaluation metrics implementation
	├── prompts/
	│ └── text2sql.txt # System prompt with banking rules
	├── README.md
	└── requirements.txt
	```

	---

	## Prompt Design

	### System Prompt Responsibilities

	The system prompt enforces the following rules:

	- Always perform an `INNER JOIN` between `system_data` and `branch`
	- Always SELECT the following columns:
	- `system_data.data_code`
	- `system_data.year`
	- `system_data.branch_id`
	- `branch.name`
	- `system_data.value`
	- SQL keywords must be UPPERCASE
	- Text filters must use `LIKE '%keyword%'`
	- Vietnamese location names must use exact accents
	- Output SQL only (no markdown, no explanations)

	### Metric Grouping Logic

	Metrics are classified by `metric_code` prefix:

	\| Group \| Description \|
	\| ----- \| ------------------------------------ \|
	\| A \| Inbound metrics (`MET_A_%`) \|
	\| B \| Outbound metrics (`MET_B_%`) \|
	\| C \| Stock / snapshot metrics (`MET_C_%`) \|
	\| D \| Exposure / obligation metrics \|
	\| E \| Resource mobilization metrics \|
	\| F \| Ratio & efficiency metrics \|

	### Unit Conversion Rule

	- Stored unit: Million VND
	- If the question mentions "Billion VND", multiply value by `1000`

	---

	## Example Input (Schema)

	```sql
	CREATE TABLE entity_a (
	id INTEGER,
	group_id INTEGER,
	org_id INTEGER,
	code VARCHAR(100),
	name VARCHAR(255),
	attr_1 VARCHAR(255),
	attr_2 VARCHAR(255),
	attr_3 TEXT
	);

	CREATE TABLE entity_b (
	id INTEGER,
	group_id INTEGER,
	entity_a_id INTEGER,
	time_key INTEGER,
	metric_name VARCHAR(255),
	metric_code VARCHAR(100),
	metric_value REAL,
	metric_unit VARCHAR(100)
	);
	```

	---

	## Evaluation Metrics

	Evaluation results are printed at the very top of the output:

	\| Label \| Value \|
	\| -------------- \| ------------ \|
	\| rouge \| 0.96 \|
	\| meteor \| 0.95 \|
	\| binary \| 0.65 \|
	\| llm-as-a-judge \| 0.82 \|

	### Metric Definitions

	- ROUGE: Token-level overlap between generated and reference SQL
	- METEOR: Semantic similarity with synonym awareness
	- Binary Match: Exact string match (0 or 1)
	- LLM-as-a-Judge: LLM-based holistic judgment of correctness

	---

	## Full Demo Code (Kept Exactly As-Is)

	The following is the original demo code, included verbatim for clarity and ease of understanding. No refactoring, no reformatting, no behavioral changes have been applied.

	```python
	import argparse

	from openai import OpenAI

	DEFAULT_QUESTION = """CREATE TABLE entity_a (
	id INTEGER,
	group_id INTEGER,
	org_id INTEGER,
	code VARCHAR(100),
	name VARCHAR(255),
	attr_1 VARCHAR(255),
	attr_2 VARCHAR(255),
	attr_3 TEXT
	);
	CREATE TABLE entity_b (
	id INTEGER,
	group_id INTEGER,
	entity_a_id INTEGER,
	time_key INTEGER,
	metric_name VARCHAR(255),
	metric_code VARCHAR(100),
	metric_value REAL,
	metric_unit VARCHAR(100)
	);
	ENTITIES = {
	"metric": {
	"metric_code": "METRIC_X",
	"metric_unit": "UNIT_A"
	},
	"entity_a_field": {
	"attr_1": [],
	"attr_2": [],
	"attr_3": [],
	"id": []
	},
	"time_key": [year],
	Query:
	}


	"""


	class MyModel(object):
	def __init__(self, model_name: str, api_key: str):
	self.model_name = model_name
	self.client = OpenAI(base_url="", api_key=api_key)
	def get_prompt(
	self,
	question: str,
	) -> list[dict[str, str]]:
	return [
	{
	"role": "system",
	"content": """
	You are a problem solving model working on task_description XML block:
	<task_description>You are a specialized Text-to-SQL assistant in the banking domain. Your objective is to translate natural language questions into valid SQLite queries using the provided schema and banking business logic.
	### Input:
	- Schema: Table definitions in SQL DDL format.
	- Relationships: Key linking logic between tables (system_data.branch_id = branch.id).
	- Data Content Context:
	Indicator_Categories:
	Group_A:
	description: Primary metrics – inbound type
	rule:
	- metric_code LIKE 'MET_A_%'

	Group_B:
	description: Primary metrics – outbound type
	rule:
	- metric_code LIKE 'MET_B_%'

	Group_C:
	description: Stock / snapshot metrics
	rule:
	- metric_code LIKE 'MET_C_%'

	Group_D:
	description: Exposure / obligation related metrics
	rule:
	- metric_code LIKE 'MET_D_%'
	- metric_code LIKE 'MET_D_TOTAL_%'
	- metric_code = 'MET_D_SPECIAL'

	Group_E:
	description: Resource mobilization metrics
	rule:
	- metric_code LIKE 'MET_E_%'

	Group_F:
	description: Ratio & efficiency indicators
	rule:
	- Unit Logic: {Which dmain} data is stored in 'Triệu VND'. If the Question mentions 'Tỷ', multiply the value by 1000.
	- Entities: Extracted key information including data_code, year, and branch filtering criteria.
	### Rules:
	1. ALWAYS perform an INNER JOIN between system_data and branch on system_data.branch_id = branch.id.
	2. ALWAYS SELECT system_data.data_code, system_data.year, system_data.branch_id, branch.name, system_data.value.
	3. Use exact Vietnamese accents for location values.
	4. Use LIKE '%keyword%' for text matching.
	5. Use UPPERCASE for SQL keywords.
	6. Output ONLY the SQL query. No explanations or markdown blocks.</task_description>
	You will be given a single task in the question XML block
	Solve only the task in question block.
	Generate only the answer, do not generate anything else
	""",
	},
	{
	"role": "user",
	"content": f"""
	Now for the real task, solve the task in question block.
	Generate only the solution, do not generate anything else
	<question>{question}</question>
	""",
	},
	]
	def invoke(self, question: str) -> str:
	chat_response = self.client.chat.completions.create(
	model=self.model_name,
	messages=self.get_prompt(question),
	temperature=0,
	reasoning_effort="none",
	)
	return chat_response.choices[0].message.content

	if __name__ == "__main__":
	parser = argparse.ArgumentParser()
	parser.add_argument("--question", type=str, default=DEFAULT_QUESTION, required=False)
	parser.add_argument("--api-key", type=str, default="", required=False)
	parser.add_argument("--model", type=str, default="model", required=False)
	args = parser.parse_args()
	client = MyModel(model_name=args.model, api_key=args.api_key)
	print(client.invoke(args.question))
	```

	---

	## How to Run

	```bash
	python main.py \
	--question "<QUESTION_TEXT>" \
	--api-key "YOUR_OPENAI_API_KEY" \
	--model "gpt-4.1-mini"
	```

	---

	## Important Notes

	- `temperature = 0` ensures reproducible results
	- Function calling is intentionally avoided to prevent JSON-wrapped SQL
	- The prompt is optimized for SQLite dialect

	---

	## Possible Extensions

	- Multi-year queries using `IN` or ranges
	- Queries combining multiple metric groups
	- Execution-based evaluation (SQL result comparison)
	- Support for additional SQL dialects (PostgreSQL, MySQL)