suneeldk
/

json-extract

Text Generation

structured-output

Model card Files Files and versions

json-extract / README.md

suneeldk's picture

Update README.md

4c77892 verified 2 months ago

|

history blame contribute delete

2.83 kB

	---
	base_model: unsloth/Qwen2.5-1.5B-Instruct
	library_name: peft
	license: apache-2.0
	language:
	- en
	tags:
	- unsloth
	- lora
	- json
	- extraction
	- structured-output
	- qwen2.5
	pipeline_tag: text-generation
	---

	# json-extract

	A fine-tuned Qwen2.5-1.5B-Instruct model with LoRA adapters for extracting structured JSON from natural language text.

	## What it does

	Give it any unstructured text and a target JSON schema — it returns clean, structured JSON output.

	Input:
	```
	Paid 500 to Ravi for lunch on Jan 5
	```

	Output:
	```json
	{
	"amount": 500,
	"person": "Ravi",
	"date": "2025-01-05",
	"note": "lunch"
	}
	```

	## How to use

	### With Unsloth (recommended)

	```python
	from unsloth import FastLanguageModel
	import json

	model, tokenizer = FastLanguageModel.from_pretrained(
	"suneeldk/json-extract",
	load_in_4bit=True,
	)
	FastLanguageModel.for_inference(model)

	def extract(text, schema):
	prompt = f"### Input: {text}\n### Schema: {json.dumps(schema)}\n### Output:"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.1,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id,
	use_cache=False,
	)
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	output_part = result.split("### Output:")[-1].strip()
	return json.loads(output_part)

	schema = {
	"amount": "number",
	"person": "string\|null",
	"date": "ISO date\|null",
	"note": "string\|null"
	}

	result = extract("Paid 500 to Ravi for lunch on Jan 5", schema)
	print(json.dumps(result, indent=2))
	```

	### With Transformers + PEFT

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
	model = PeftModel.from_pretrained(base_model, "YOUR_USERNAME/json-extract")
	tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/json-extract")
	```

	## Training details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Base model \| Qwen2.5-1.5B-Instruct \|
	\| Method \| LoRA (r=16, alpha=16) \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Epochs \| 3 \|
	\| Learning rate \| 2e-4 \|
	\| Batch size \| 4 (x4 gradient accumulation) \|
	\| Scheduler \| Cosine \|
	\| Optimizer \| AdamW 8-bit \|
	\| Precision \| 4-bit quantized (QLoRA) \|
	\| Max sequence length \| 2048 \|

	## Prompt format

	```
	### Input: <your text here>
	### Schema: <json schema>
	### Output:
	```

	The model will generate a JSON object matching the provided schema.

	## Limitations

	- Optimized for short-to-medium length text inputs
	- Works best with schemas similar to the training data
	- May not handle highly nested or complex JSON structures
	- English language only

	## License

	Apache 2.0