DLM-NL2JSON-4B / README.md
hkyoo89's picture
Upload README.md with huggingface_hub
f72c1df verified
---
language:
- ko
license: apache-2.0
tags:
- task-specific
- structured-prediction
- korean
- public-sector
- qwen3
- domain-specific
- merge
base_model: Qwen/Qwen3-4B
datasets: []
pipeline_tag: text-generation
model-index:
- name: DLM-NL2JSON-4B
results:
- task:
type: structured-prediction
name: Korean NL-to-JSON Schema Extraction
dataset:
type: custom
name: Busan Public Data Query Test Set
args:
num_samples: 2041
metrics:
- type: exact_match
value: 94.4
name: Exact Match Accuracy (raw)
- type: exact_match
value: 96.8
name: Exact Match Accuracy (adjusted)
---
# DLM-NL2JSON-4B
**A 4B-parameter service-specific LLM that outperforms GPT-4o (+14%p) and Qwen3.5-35B (+22%p) on structured JSON extraction from Korean natural language queries.**
DLM (Domain-specific Language Model) is a series of task-specialized models by [Data Science Lab., Ltd.](https://huggingface.co/dataslab). This model is a LoRA-merged Qwen3-4B fine-tuned for structured JSON extraction in the Busan Metropolitan City public data analytics service.
## Key Results
Evaluated on 2,041 test samples across 10 task categories (field-level exact match, summary excluded):
| Model | Params | Accuracy | Accuracy (adj*) | Avg Latency |
|-------|--------|----------|-----------------|-------------|
| **DLM-NL2JSON-4B** | **4B** | **94.4%** | **96.8%** | 2.59s |
| GPT-4o | ~200B+ | 80.5% | 82.5% | 1.58s |
| Qwen3.5-35B-A3B | 35B | 72.2% | 73.9% | 0.85s |
*\*adj: 64 CSM samples with known gold label noise excluded (see Evaluation section)*
### Per-Category Breakdown
| Category | N | DLM-NL2JSON-4B | GPT-4o | Qwen3.5-35B |
|----------|---|-------------|--------|-------------|
| ALP-A (population pattern) | 250 | **99.6%** | 56.0% | 47.6% |
| ALP-B (population flow) | 250 | **98.4%** | 50.4% | 46.8% |
| CSM (consumer spending) | 700 | **90.6%** | 90.1% | 86.1% |
| CREDIT-Income | 58 | **94.8%** | 53.4% | 34.5% |
| CREDIT-Spending | 77 | **97.4%** | 92.2% | 51.9% |
| CREDIT-Loan/Default | 73 | **98.6%** | 94.5% | 72.6% |
| CPI (business status) | 219 | 86.3% | **87.2%** | 54.8% |
| GIS-Inflow | 72 | **97.2%** | 79.2% | 93.1% |
| GIS-Outflow | 62 | **98.4%** | 77.4% | 98.4% |
| GIS-Consumption | 280 | 98.2% | **99.6%** | 97.5% |
DLM-NL2JSON-4B wins **8 out of 10 categories**, with the largest gains on ALP (+43%p vs GPT-4o) and CREDIT-Income (+41%p).
## Important: This is a Service-Specific Model
> **This model is NOT a general-purpose NL-to-JSON converter.** It is trained exclusively for a fixed set of predefined schemas used in a specific production service. It will not generalize to arbitrary JSON schemas or different prompt formats.
To use this model correctly, you **must**:
1. Use the **exact system prompts** it was trained on (one per task category β€” see Usage section)
2. Include the corresponding **special token** (`<TASK_CSM>`, `<TASK_CREDIT>`, `<TASK_GIS>`, `<TASK_ALP>`, `<TASK_CPI>`) in the input
3. Expect output conforming only to the **predefined schemas** listed below
**Why publish a service-specific model?** This model serves as a reference implementation demonstrating that **task-specific LoRA fine-tuning on a 4B model can dramatically outperform GPT-4o and larger open-source models** on constrained structured output tasks. We believe the DLM (Domain-specific Language Model) approach β€” training small, cheap-to-serve models for specific service endpoints β€” is an underexplored but highly practical paradigm.
## Intended Use
This model converts **Korean natural language queries about public/economic data** into **structured JSON** conforming to its predefined schemas. It is designed for and deployed in the **Busan Metropolitan City Big Data Wave** analytics dashboard.
**Input**: Free-form Korean query + task-specific system prompt
**Output**: Single-line JSON with exact schema compliance:
```json
{"summary":"##2025λ…„ 5μ›” λΆ€μ‚°κ΄‘μ—­μ‹œ ν•΄μš΄λŒ€κ΅¬ μœ ν†΅/의료 μ†ŒλΉ„λΆ„μ„##","base_ym":202505,"region_nm":"λΆ€μ‚°κ΄‘μ—­μ‹œ ν•΄μš΄λŒ€κ΅¬","industry_select":{"3":[],"8":[]},"sex_cd":[1],"age_cd":[30],"category":2}
```
### Task Categories
| ID | Name | Schema Type |
|----|------|-------------|
| 0 | ALP-A | Population pattern (ptrn: residence/work/visit) |
| 1 | ALP-B | Population flow (flow_cd: inflow/outflow) |
| 2 | CSM | Consumer spending by industry |
| 3 | CREDIT-Income | Income statistics |
| 4 | CREDIT-Spending | Spending statistics |
| 5 | CREDIT-Loan | Loan/default statistics |
| 6 | CPI | Business/enterprise status |
| 9 | GIS-Inflow | Geographic inflow analysis |
| 10 | GIS-Outflow | Geographic outflow analysis |
| 11 | GIS-Consumption | Geographic consumption analysis |
## Training Details
| Item | Value |
|------|-------|
| Base model | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
| Method | LoRA SFT β†’ merged full model |
| Training samples | 16,292 (Korean) |
| Validation samples | 2,034 |
| Special tokens | `<TASK_CSM>`, `<TASK_CREDIT>`, `<TASK_GIS>`, `<TASK_ALP>`, `<TASK_CPI>` |
| Max sequence length | 6,144 |
| Architecture | Qwen3ForCausalLM (36 layers, 2560 hidden, 32 heads) |
Training data consists of synthetically generated Korean natural language queries paired with structured JSON outputs, covering the Busan public data analytics domain.
## Evaluation Methodology
- **Metric**: Field-level exact match β€” each JSON key's value is compared against the gold label. The `summary` field is excluded from comparison.
- **Test set**: 2,041 samples, stratified by category
- **Gold label noise**: 64/700 CSM samples have `age_cd` capped at `[10..60]` instead of `[10..70]` for "all ages" queries, conflicting with the prompt specification. These affect all models equally and are excluded in the adjusted metric.
- **Train/Test overlap**: 16/2,041 input strings (0.78%) appear in both sets β€” retained for consistency.
- **All models** received identical system prompts per category.
### Hardware
| Model | Serving | GPU |
|-------|---------|-----|
| DLM-NL2JSON-4B | TensorRT-LLM | NVIDIA L4 24GB |
| GPT-4o | OpenAI API | N/A |
| Qwen3.5-35B-A3B | vLLM | NVIDIA A6000 48GB |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "dataslab/DLM-NL2JSON-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
# System prompt (example: CSM consumer spending schema β€” abbreviated for readability)
# Full prompts per category are available in the repository's eval/prompts.py
system_prompt = """λ„ˆλŠ” λ°˜λ“œμ‹œ **JSON ν•œ 쀄**만 좜λ ₯ν•œλ‹€. μ„€λͺ…/ν…μŠ€νŠΈ/μ½”λ©˜νŠΈ/λ§ˆν¬λ‹€μš΄/μ½”λ“œλΈ”λ‘/이λͺ¨μ§€/곡백 쀄 κΈˆμ§€.
좜λ ₯은 항상 { 둜 μ‹œμž‘ν•˜κ³  } 둜 λλ‚œλ‹€.
[μŠ€ν‚€λ§ˆ: TASK_CSM] (ν‚€/νƒ€μž…/μˆœμ„œ μ—„μˆ˜)
{"summary":string,"base_ym":int,"region_nm":string,"industry_select":object,"sex_cd":[int],"age_cd":[int],"category":2}
[κΈ°λ³Έκ°’]
- base_ym: 0, region_nm: "λΆ€μ‚°κ΄‘μ—­μ‹œ"
- industry_select: μ—…μ’… λ―Έμ§€μ • μ‹œ μ „ λŒ€λΆ„λ₯˜ ν‚€λ₯Ό []둜 μ„€μ •
- sex_cd: [0,1], age_cd: [10,20,30,40,50,60,70]
- category: 항상 2
[λŒ€λΆ„λ₯˜ μ½”λ“œν‘œ] 1:μ—¬ν–‰/μˆ™λ°• 2:μ—¬κ°€/λ¬Έν™” 3:μœ ν†΅ 4:μŒμ‹/주점 5:μŒμ‹λ£Œν’ˆ
6:의λ₯˜/μž‘ν™” 7:미용 8:의료 9:ꡐ윑 10:μƒν™œ 11:μžλ™μ°¨"""
# Note: special token <TASK_CSM> must be included in the user message
user_query = "<TASK_CSM> 2024λ…„ 1μ›” ν•΄μš΄λŒ€κ΅¬ 쀑동 의λ₯˜/μž‘ν™”λž‘ λ·°ν‹° μͺ½ 남성 20~40λŒ€ μœ„μ£Όλ‘œ μ•Œλ €μ€˜"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
# {"summary":"##2024λ…„ 1μ›” λΆ€μ‚°κ΄‘μ—­μ‹œ ν•΄μš΄λŒ€κ΅¬ 쀑동 의λ₯˜/μž‘ν™”/미용 μ†ŒλΉ„λΆ„μ„##","base_ym":202401,"region_nm":"λΆ€μ‚°κ΄‘μ—­μ‹œ ν•΄μš΄λŒ€κ΅¬ 쀑동","industry_select":{"6":[],"7":[]},"sex_cd":[0],"age_cd":[20,30,40],"category":2}
# Note: "λ·°ν‹°" β†’ mapped to 미용(code 7), "ν•΄μš΄λŒ€κ΅¬ 쀑동" β†’ normalized to "λΆ€μ‚°κ΄‘μ—­μ‹œ ν•΄μš΄λŒ€κ΅¬ 쀑동"
```
### vLLM / OpenAI-compatible serving
```python
from openai import OpenAI
client = OpenAI(base_url="http://your-server:8006/v1", api_key="token")
resp = client.chat.completions.create(
model="DLM-NL2JSON-4B",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "<TASK_CSM> 2024λ…„ 1μ›” ν•΄μš΄λŒ€κ΅¬ 쀑동 의λ₯˜/μž‘ν™”λž‘ λ·°ν‹° μͺ½ 남성 20~40λŒ€ μœ„μ£Όλ‘œ μ•Œλ €μ€˜"}
],
max_tokens=512,
temperature=0.0,
extra_body={"chat_template_kwargs": {"enable_thinking": False}} # disable thinking mode
)
print(resp.choices[0].message.content)
```
> **Important**: When serving with vLLM/TensorRT-LLM, pass `chat_template_kwargs: {"enable_thinking": false}` to disable the Qwen3 thinking mode. Otherwise, reasoning tokens will consume the output budget and truncate the JSON.
## Known Limitations
1. **CPI category** (86.3%) is the weakest β€” complex industry classification codes (A~U with sub-codes) are harder to extract.
2. **CSM training data noise**: ~8% of CSM training samples have `age_cd` capped at 60 instead of 70 for "all ages" queries, introducing inconsistency.
3. **Domain-specific only**: This model is trained exclusively for the Busan public data schema extraction task. It has no general-purpose capabilities and should not be used as a general chatbot.
4. **Korean only**: All training data and prompts are in Korean.
## Citation
If you use this model, please cite:
```bibtex
@misc{dsl-dlm-nl2json-4b,
title={DLM-NL2JSON-4B: A Domain-Specific Language Model for Korean Public Data Schema Extraction},
author={Data Science Lab., Ltd.},
year={2026},
url={https://huggingface.co/dataslab/DLM-NL2JSON-4B}
}
```
## Contact
- **Organization**: Data Science Lab., Ltd.
- **Project**: Busan Metropolitan City Big Data Wave