Upload README.md with huggingface_hub

f72c1df verified 11 days ago

10.3 kB

	---
	language:
	- ko
	license: apache-2.0
	tags:
	- task-specific
	- structured-prediction
	- korean
	- public-sector
	- qwen3
	- domain-specific
	- merge
	base_model: Qwen/Qwen3-4B
	datasets: []
	pipeline_tag: text-generation
	model-index:
	- name: DLM-NL2JSON-4B
	results:
	- task:
	type: structured-prediction
	name: Korean NL-to-JSON Schema Extraction
	dataset:
	type: custom
	name: Busan Public Data Query Test Set
	args:
	num_samples: 2041
	metrics:
	- type: exact_match
	value: 94.4
	name: Exact Match Accuracy (raw)
	- type: exact_match
	value: 96.8
	name: Exact Match Accuracy (adjusted)
	---

	# DLM-NL2JSON-4B

	A 4B-parameter service-specific LLM that outperforms GPT-4o (+14%p) and Qwen3.5-35B (+22%p) on structured JSON extraction from Korean natural language queries.

	DLM (Domain-specific Language Model) is a series of task-specialized models by [Data Science Lab., Ltd.](https://huggingface.co/dataslab). This model is a LoRA-merged Qwen3-4B fine-tuned for structured JSON extraction in the Busan Metropolitan City public data analytics service.

	## Key Results

	Evaluated on 2,041 test samples across 10 task categories (field-level exact match, summary excluded):

	\| Model \| Params \| Accuracy \| Accuracy (adj*) \| Avg Latency \|
	\|-------\|--------\|----------\|-----------------\|-------------\|
	\| DLM-NL2JSON-4B \| 4B \| 94.4% \| 96.8% \| 2.59s \|
	\| GPT-4o \| ~200B+ \| 80.5% \| 82.5% \| 1.58s \|
	\| Qwen3.5-35B-A3B \| 35B \| 72.2% \| 73.9% \| 0.85s \|

	\adj: 64 CSM samples with known gold label noise excluded (see Evaluation section)*

	### Per-Category Breakdown

	\| Category \| N \| DLM-NL2JSON-4B \| GPT-4o \| Qwen3.5-35B \|
	\|----------\|---\|-------------\|--------\|-------------\|
	\| ALP-A (population pattern) \| 250 \| 99.6% \| 56.0% \| 47.6% \|
	\| ALP-B (population flow) \| 250 \| 98.4% \| 50.4% \| 46.8% \|
	\| CSM (consumer spending) \| 700 \| 90.6% \| 90.1% \| 86.1% \|
	\| CREDIT-Income \| 58 \| 94.8% \| 53.4% \| 34.5% \|
	\| CREDIT-Spending \| 77 \| 97.4% \| 92.2% \| 51.9% \|
	\| CREDIT-Loan/Default \| 73 \| 98.6% \| 94.5% \| 72.6% \|
	\| CPI (business status) \| 219 \| 86.3% \| 87.2% \| 54.8% \|
	\| GIS-Inflow \| 72 \| 97.2% \| 79.2% \| 93.1% \|
	\| GIS-Outflow \| 62 \| 98.4% \| 77.4% \| 98.4% \|
	\| GIS-Consumption \| 280 \| 98.2% \| 99.6% \| 97.5% \|

	DLM-NL2JSON-4B wins 8 out of 10 categories, with the largest gains on ALP (+43%p vs GPT-4o) and CREDIT-Income (+41%p).

	## Important: This is a Service-Specific Model

	> This model is NOT a general-purpose NL-to-JSON converter. It is trained exclusively for a fixed set of predefined schemas used in a specific production service. It will not generalize to arbitrary JSON schemas or different prompt formats.

	To use this model correctly, you must:
	1. Use the exact system prompts it was trained on (one per task category — see Usage section)
	2. Include the corresponding special token (`<TASK_CSM>`, `<TASK_CREDIT>`, `<TASK_GIS>`, `<TASK_ALP>`, `<TASK_CPI>`) in the input
	3. Expect output conforming only to the predefined schemas listed below

	Why publish a service-specific model? This model serves as a reference implementation demonstrating that task-specific LoRA fine-tuning on a 4B model can dramatically outperform GPT-4o and larger open-source models on constrained structured output tasks. We believe the DLM (Domain-specific Language Model) approach — training small, cheap-to-serve models for specific service endpoints — is an underexplored but highly practical paradigm.

	## Intended Use

	This model converts Korean natural language queries about public/economic data into structured JSON conforming to its predefined schemas. It is designed for and deployed in the Busan Metropolitan City Big Data Wave analytics dashboard.

	Input: Free-form Korean query + task-specific system prompt

	Output: Single-line JSON with exact schema compliance:
	```json
	{"summary":"##2025년 5월 부산광역시 해운대구 유통/의료 소비분석##","base_ym":202505,"region_nm":"부산광역시 해운대구","industry_select":{"3":[],"8":[]},"sex_cd":[1],"age_cd":[30],"category":2}
	```

	### Task Categories

	\| ID \| Name \| Schema Type \|
	\|----\|------\|-------------\|
	\| 0 \| ALP-A \| Population pattern (ptrn: residence/work/visit) \|
	\| 1 \| ALP-B \| Population flow (flow_cd: inflow/outflow) \|
	\| 2 \| CSM \| Consumer spending by industry \|
	\| 3 \| CREDIT-Income \| Income statistics \|
	\| 4 \| CREDIT-Spending \| Spending statistics \|
	\| 5 \| CREDIT-Loan \| Loan/default statistics \|
	\| 6 \| CPI \| Business/enterprise status \|
	\| 9 \| GIS-Inflow \| Geographic inflow analysis \|
	\| 10 \| GIS-Outflow \| Geographic outflow analysis \|
	\| 11 \| GIS-Consumption \| Geographic consumption analysis \|

	## Training Details

	\| Item \| Value \|
	\|------\|-------\|
	\| Base model \| [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) \|
	\| Method \| LoRA SFT → merged full model \|
	\| Training samples \| 16,292 (Korean) \|
	\| Validation samples \| 2,034 \|
	\| Special tokens \| `<TASK_CSM>`, `<TASK_CREDIT>`, `<TASK_GIS>`, `<TASK_ALP>`, `<TASK_CPI>` \|
	\| Max sequence length \| 6,144 \|
	\| Architecture \| Qwen3ForCausalLM (36 layers, 2560 hidden, 32 heads) \|

	Training data consists of synthetically generated Korean natural language queries paired with structured JSON outputs, covering the Busan public data analytics domain.

	## Evaluation Methodology

	- Metric: Field-level exact match — each JSON key's value is compared against the gold label. The `summary` field is excluded from comparison.
	- Test set: 2,041 samples, stratified by category
	- Gold label noise: 64/700 CSM samples have `age_cd` capped at `[10..60]` instead of `[10..70]` for "all ages" queries, conflicting with the prompt specification. These affect all models equally and are excluded in the adjusted metric.
	- Train/Test overlap: 16/2,041 input strings (0.78%) appear in both sets — retained for consistency.
	- All models received identical system prompts per category.

	### Hardware

	\| Model \| Serving \| GPU \|
	\|-------\|---------\|-----\|
	\| DLM-NL2JSON-4B \| TensorRT-LLM \| NVIDIA L4 24GB \|
	\| GPT-4o \| OpenAI API \| N/A \|
	\| Qwen3.5-35B-A3B \| vLLM \| NVIDIA A6000 48GB \|

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "dataslab/DLM-NL2JSON-4B"
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

	# System prompt (example: CSM consumer spending schema — abbreviated for readability)
	# Full prompts per category are available in the repository's eval/prompts.py
	system_prompt = """너는 반드시 JSON 한 줄만 출력한다. 설명/텍스트/코멘트/마크다운/코드블록/이모지/공백 줄 금지.
	출력은 항상 { 로 시작하고 } 로 끝난다.

	[스키마: TASK_CSM] (키/타입/순서 엄수)
	{"summary":string,"base_ym":int,"region_nm":string,"industry_select":object,"sex_cd":[int],"age_cd":[int],"category":2}

	[기본값]
	- base_ym: 0, region_nm: "부산광역시"
	- industry_select: 업종 미지정 시 전 대분류 키를 []로 설정
	- sex_cd: [0,1], age_cd: [10,20,30,40,50,60,70]
	- category: 항상 2

	[대분류 코드표] 1:여행/숙박 2:여가/문화 3:유통 4:음식/주점 5:음식료품
	6:의류/잡화 7:미용 8:의료 9:교육 10:생활 11:자동차"""

	# Note: special token <TASK_CSM> must be included in the user message
	user_query = "<TASK_CSM> 2024년 1월 해운대구 중동 의류/잡화랑 뷰티 쪽 남성 20~40대 위주로 알려줘"

	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": user_query}
	]

	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)
	print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
	# {"summary":"##2024년 1월 부산광역시 해운대구 중동 의류/잡화/미용 소비분석##","base_ym":202401,"region_nm":"부산광역시 해운대구 중동","industry_select":{"6":[],"7":[]},"sex_cd":[0],"age_cd":[20,30,40],"category":2}
	# Note: "뷰티" → mapped to 미용(code 7), "해운대구 중동" → normalized to "부산광역시 해운대구 중동"
	```

	### vLLM / OpenAI-compatible serving

	```python
	from openai import OpenAI

	client = OpenAI(base_url="http://your-server:8006/v1", api_key="token")
	resp = client.chat.completions.create(
	model="DLM-NL2JSON-4B",
	messages=[
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": "<TASK_CSM> 2024년 1월 해운대구 중동 의류/잡화랑 뷰티 쪽 남성 20~40대 위주로 알려줘"}
	],
	max_tokens=512,
	temperature=0.0,
	extra_body={"chat_template_kwargs": {"enable_thinking": False}} # disable thinking mode
	)
	print(resp.choices[0].message.content)
	```

	> Important: When serving with vLLM/TensorRT-LLM, pass `chat_template_kwargs: {"enable_thinking": false}` to disable the Qwen3 thinking mode. Otherwise, reasoning tokens will consume the output budget and truncate the JSON.

	## Known Limitations

	1. CPI category (86.3%) is the weakest — complex industry classification codes (A~U with sub-codes) are harder to extract.
	2. CSM training data noise: ~8% of CSM training samples have `age_cd` capped at 60 instead of 70 for "all ages" queries, introducing inconsistency.
	3. Domain-specific only: This model is trained exclusively for the Busan public data schema extraction task. It has no general-purpose capabilities and should not be used as a general chatbot.
	4. Korean only: All training data and prompts are in Korean.

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{dsl-dlm-nl2json-4b,
	title={DLM-NL2JSON-4B: A Domain-Specific Language Model for Korean Public Data Schema Extraction},
	author={Data Science Lab., Ltd.},
	year={2026},
	url={https://huggingface.co/dataslab/DLM-NL2JSON-4B}
	}
	```

	## Contact

	- Organization: Data Science Lab., Ltd.
	- Project: Busan Metropolitan City Big Data Wave