| --- |
| language: |
| - ko |
| license: apache-2.0 |
| tags: |
| - task-specific |
| - structured-prediction |
| - korean |
| - public-sector |
| - qwen3 |
| - domain-specific |
| - merge |
| base_model: Qwen/Qwen3-4B |
| datasets: [] |
| pipeline_tag: text-generation |
| model-index: |
| - name: DLM-NL2JSON-4B |
| results: |
| - task: |
| type: structured-prediction |
| name: Korean NL-to-JSON Schema Extraction |
| dataset: |
| type: custom |
| name: Busan Public Data Query Test Set |
| args: |
| num_samples: 2041 |
| metrics: |
| - type: exact_match |
| value: 94.4 |
| name: Exact Match Accuracy (raw) |
| - type: exact_match |
| value: 96.8 |
| name: Exact Match Accuracy (adjusted) |
| --- |
| |
| # DLM-NL2JSON-4B |
|
|
| **A 4B-parameter service-specific LLM that outperforms GPT-4o (+14%p) and Qwen3.5-35B (+22%p) on structured JSON extraction from Korean natural language queries.** |
|
|
| DLM (Domain-specific Language Model) is a series of task-specialized models by [Data Science Lab., Ltd.](https://huggingface.co/dataslab). This model is a LoRA-merged Qwen3-4B fine-tuned for structured JSON extraction in the Busan Metropolitan City public data analytics service. |
|
|
| ## Key Results |
|
|
| Evaluated on 2,041 test samples across 10 task categories (field-level exact match, summary excluded): |
|
|
| | Model | Params | Accuracy | Accuracy (adj*) | Avg Latency | |
| |-------|--------|----------|-----------------|-------------| |
| | **DLM-NL2JSON-4B** | **4B** | **94.4%** | **96.8%** | 2.59s | |
| | GPT-4o | ~200B+ | 80.5% | 82.5% | 1.58s | |
| | Qwen3.5-35B-A3B | 35B | 72.2% | 73.9% | 0.85s | |
| |
| *\*adj: 64 CSM samples with known gold label noise excluded (see Evaluation section)* |
|
|
| ### Per-Category Breakdown |
|
|
| | Category | N | DLM-NL2JSON-4B | GPT-4o | Qwen3.5-35B | |
| |----------|---|-------------|--------|-------------| |
| | ALP-A (population pattern) | 250 | **99.6%** | 56.0% | 47.6% | |
| | ALP-B (population flow) | 250 | **98.4%** | 50.4% | 46.8% | |
| | CSM (consumer spending) | 700 | **90.6%** | 90.1% | 86.1% | |
| | CREDIT-Income | 58 | **94.8%** | 53.4% | 34.5% | |
| | CREDIT-Spending | 77 | **97.4%** | 92.2% | 51.9% | |
| | CREDIT-Loan/Default | 73 | **98.6%** | 94.5% | 72.6% | |
| | CPI (business status) | 219 | 86.3% | **87.2%** | 54.8% | |
| | GIS-Inflow | 72 | **97.2%** | 79.2% | 93.1% | |
| | GIS-Outflow | 62 | **98.4%** | 77.4% | 98.4% | |
| | GIS-Consumption | 280 | 98.2% | **99.6%** | 97.5% | |
|
|
| DLM-NL2JSON-4B wins **8 out of 10 categories**, with the largest gains on ALP (+43%p vs GPT-4o) and CREDIT-Income (+41%p). |
|
|
| ## Important: This is a Service-Specific Model |
|
|
| > **This model is NOT a general-purpose NL-to-JSON converter.** It is trained exclusively for a fixed set of predefined schemas used in a specific production service. It will not generalize to arbitrary JSON schemas or different prompt formats. |
|
|
| To use this model correctly, you **must**: |
| 1. Use the **exact system prompts** it was trained on (one per task category β see Usage section) |
| 2. Include the corresponding **special token** (`<TASK_CSM>`, `<TASK_CREDIT>`, `<TASK_GIS>`, `<TASK_ALP>`, `<TASK_CPI>`) in the input |
| 3. Expect output conforming only to the **predefined schemas** listed below |
|
|
| **Why publish a service-specific model?** This model serves as a reference implementation demonstrating that **task-specific LoRA fine-tuning on a 4B model can dramatically outperform GPT-4o and larger open-source models** on constrained structured output tasks. We believe the DLM (Domain-specific Language Model) approach β training small, cheap-to-serve models for specific service endpoints β is an underexplored but highly practical paradigm. |
|
|
| ## Intended Use |
|
|
| This model converts **Korean natural language queries about public/economic data** into **structured JSON** conforming to its predefined schemas. It is designed for and deployed in the **Busan Metropolitan City Big Data Wave** analytics dashboard. |
|
|
| **Input**: Free-form Korean query + task-specific system prompt |
|
|
| **Output**: Single-line JSON with exact schema compliance: |
| ```json |
| {"summary":"##2025λ
5μ λΆμ°κ΄μμ ν΄μ΄λꡬ μ ν΅/μλ£ μλΉλΆμ##","base_ym":202505,"region_nm":"λΆμ°κ΄μμ ν΄μ΄λꡬ","industry_select":{"3":[],"8":[]},"sex_cd":[1],"age_cd":[30],"category":2} |
| ``` |
|
|
| ### Task Categories |
|
|
| | ID | Name | Schema Type | |
| |----|------|-------------| |
| | 0 | ALP-A | Population pattern (ptrn: residence/work/visit) | |
| | 1 | ALP-B | Population flow (flow_cd: inflow/outflow) | |
| | 2 | CSM | Consumer spending by industry | |
| | 3 | CREDIT-Income | Income statistics | |
| | 4 | CREDIT-Spending | Spending statistics | |
| | 5 | CREDIT-Loan | Loan/default statistics | |
| | 6 | CPI | Business/enterprise status | |
| | 9 | GIS-Inflow | Geographic inflow analysis | |
| | 10 | GIS-Outflow | Geographic outflow analysis | |
| | 11 | GIS-Consumption | Geographic consumption analysis | |
| |
| ## Training Details |
| |
| | Item | Value | |
| |------|-------| |
| | Base model | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | |
| | Method | LoRA SFT β merged full model | |
| | Training samples | 16,292 (Korean) | |
| | Validation samples | 2,034 | |
| | Special tokens | `<TASK_CSM>`, `<TASK_CREDIT>`, `<TASK_GIS>`, `<TASK_ALP>`, `<TASK_CPI>` | |
| | Max sequence length | 6,144 | |
| | Architecture | Qwen3ForCausalLM (36 layers, 2560 hidden, 32 heads) | |
| |
| Training data consists of synthetically generated Korean natural language queries paired with structured JSON outputs, covering the Busan public data analytics domain. |
| |
| ## Evaluation Methodology |
| |
| - **Metric**: Field-level exact match β each JSON key's value is compared against the gold label. The `summary` field is excluded from comparison. |
| - **Test set**: 2,041 samples, stratified by category |
| - **Gold label noise**: 64/700 CSM samples have `age_cd` capped at `[10..60]` instead of `[10..70]` for "all ages" queries, conflicting with the prompt specification. These affect all models equally and are excluded in the adjusted metric. |
| - **Train/Test overlap**: 16/2,041 input strings (0.78%) appear in both sets β retained for consistency. |
| - **All models** received identical system prompts per category. |
|
|
| ### Hardware |
|
|
| | Model | Serving | GPU | |
| |-------|---------|-----| |
| | DLM-NL2JSON-4B | TensorRT-LLM | NVIDIA L4 24GB | |
| | GPT-4o | OpenAI API | N/A | |
| | Qwen3.5-35B-A3B | vLLM | NVIDIA A6000 48GB | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| model_id = "dataslab/DLM-NL2JSON-4B" |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) |
| |
| # System prompt (example: CSM consumer spending schema β abbreviated for readability) |
| # Full prompts per category are available in the repository's eval/prompts.py |
| system_prompt = """λλ λ°λμ **JSON ν μ€**λ§ μΆλ ₯νλ€. μ€λͺ
/ν
μ€νΈ/μ½λ©νΈ/λ§ν¬λ€μ΄/μ½λλΈλ‘/μ΄λͺ¨μ§/곡백 μ€ κΈμ§. |
| μΆλ ₯μ νμ { λ‘ μμνκ³ } λ‘ λλλ€. |
| |
| [μ€ν€λ§: TASK_CSM] (ν€/νμ
/μμ μμ) |
| {"summary":string,"base_ym":int,"region_nm":string,"industry_select":object,"sex_cd":[int],"age_cd":[int],"category":2} |
| |
| [κΈ°λ³Έκ°] |
| - base_ym: 0, region_nm: "λΆμ°κ΄μμ" |
| - industry_select: μ
μ’
λ―Έμ§μ μ μ λλΆλ₯ ν€λ₯Ό []λ‘ μ€μ |
| - sex_cd: [0,1], age_cd: [10,20,30,40,50,60,70] |
| - category: νμ 2 |
| |
| [λλΆλ₯ μ½λν] 1:μ¬ν/μλ° 2:μ¬κ°/λ¬Έν 3:μ ν΅ 4:μμ/μ£Όμ 5:μμλ£ν |
| 6:μλ₯/μ‘ν 7:λ―Έμ© 8:μλ£ 9:κ΅μ‘ 10:μν 11:μλμ°¨""" |
| |
| # Note: special token <TASK_CSM> must be included in the user message |
| user_query = "<TASK_CSM> 2024λ
1μ ν΄μ΄λꡬ μ€λ μλ₯/μ‘νλ λ·°ν° μͺ½ λ¨μ± 20~40λ μμ£Όλ‘ μλ €μ€" |
| |
| messages = [ |
| {"role": "system", "content": system_prompt}, |
| {"role": "user", "content": user_query} |
| ] |
| |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False) |
| print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)) |
| # {"summary":"##2024λ
1μ λΆμ°κ΄μμ ν΄μ΄λꡬ μ€λ μλ₯/μ‘ν/λ―Έμ© μλΉλΆμ##","base_ym":202401,"region_nm":"λΆμ°κ΄μμ ν΄μ΄λꡬ μ€λ","industry_select":{"6":[],"7":[]},"sex_cd":[0],"age_cd":[20,30,40],"category":2} |
| # Note: "λ·°ν°" β mapped to λ―Έμ©(code 7), "ν΄μ΄λꡬ μ€λ" β normalized to "λΆμ°κ΄μμ ν΄μ΄λꡬ μ€λ" |
| ``` |
|
|
| ### vLLM / OpenAI-compatible serving |
|
|
| ```python |
| from openai import OpenAI |
| |
| client = OpenAI(base_url="http://your-server:8006/v1", api_key="token") |
| resp = client.chat.completions.create( |
| model="DLM-NL2JSON-4B", |
| messages=[ |
| {"role": "system", "content": system_prompt}, |
| {"role": "user", "content": "<TASK_CSM> 2024λ
1μ ν΄μ΄λꡬ μ€λ μλ₯/μ‘νλ λ·°ν° μͺ½ λ¨μ± 20~40λ μμ£Όλ‘ μλ €μ€"} |
| ], |
| max_tokens=512, |
| temperature=0.0, |
| extra_body={"chat_template_kwargs": {"enable_thinking": False}} # disable thinking mode |
| ) |
| print(resp.choices[0].message.content) |
| ``` |
|
|
| > **Important**: When serving with vLLM/TensorRT-LLM, pass `chat_template_kwargs: {"enable_thinking": false}` to disable the Qwen3 thinking mode. Otherwise, reasoning tokens will consume the output budget and truncate the JSON. |
| |
| ## Known Limitations |
| |
| 1. **CPI category** (86.3%) is the weakest β complex industry classification codes (A~U with sub-codes) are harder to extract. |
| 2. **CSM training data noise**: ~8% of CSM training samples have `age_cd` capped at 60 instead of 70 for "all ages" queries, introducing inconsistency. |
| 3. **Domain-specific only**: This model is trained exclusively for the Busan public data schema extraction task. It has no general-purpose capabilities and should not be used as a general chatbot. |
| 4. **Korean only**: All training data and prompts are in Korean. |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{dsl-dlm-nl2json-4b, |
| title={DLM-NL2JSON-4B: A Domain-Specific Language Model for Korean Public Data Schema Extraction}, |
| author={Data Science Lab., Ltd.}, |
| year={2026}, |
| url={https://huggingface.co/dataslab/DLM-NL2JSON-4B} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| - **Organization**: Data Science Lab., Ltd. |
| - **Project**: Busan Metropolitan City Big Data Wave |
|
|