Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

MiniCPM-V-4 Floor-Plan Element Detection (LoRA + vision tuning)

Fine-tune of MiniCPM-V-4 (4B) for structured extraction from CAD floor plans: walls, doors, windows, stairs, fixtures, and furniture as JSON with bounding boxes normalized to [0, 1000].

{"elements": [{"type": "double_door", "bbox": [623, 730, 710, 763]}, ...]}

Results (held-out FloorPlanCAD, detection F1 @ IoU 0.5, greedy decoding)

	JSON valid	Precision	Recall	F1
MiniCPM-V-4 zero-shot	32%	0.027	0.010	0.015
LoRA, LLM-only (300 steps, 800 samples)	16%	0.0	0.0	0.0
LoRA + vision tuning (800 steps, 3.3k samples)	40%	0.439	0.060	0.105

The decisive factor was unfreezing the vision tower: LLM-only LoRA learned the output format but stayed image-blind (high-confidence repetition of dataset priors). With vision tuning, precision rose 16x over zero-shot — the model genuinely grounds boxes in the drawing. Recall remains the open weakness (long element lists; outputs sometimes truncate before the JSON closes).

Training

Base: openbmb/MiniCPM-V-4, official MiniCPM-V finetune harness (llm_type ChatML; note: the harness's qwen2 target-span detection needs a patch for V-4's tokenizer — spans located by token-id lookup of '<|im_start|>'/'assistant' only exist in Qwen2's vocab)
LoRA on LLM attention projections (q/k/v/o) + full vision-tower tuning
3,281 train / held-out eval from FloorPlanCAD (FiftyOne detections converted to conversation JSON)
800 steps, effective batch 8, lr 1e-5 cosine, bf16, single NVIDIA L4 (Modal), ~3.5 h

Usage

import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

base = "openbmb/MiniCPM-V-4"
tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModel.from_pretrained(base, trust_remote_code=True,
                                  torch_dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(model, "Barath/minicpmv4-floorplan-lora")
model = model.merge_and_unload().eval()

from PIL import Image
img = Image.open("floorplan.png").convert("RGB")
prompt = ('Detect the architectural elements in this floor plan (walls, doors, '
          'windows, stairs, fixtures, furniture). Return only JSON: '
          '{"elements": [{"type": str, "bbox": [x1, y1, x2, y2]}]} with integer '
          'coordinates normalized to [0, 1000].')
out = model.chat(msgs=[{"role": "user", "content": [img, prompt]}],
                 tokenizer=tokenizer, sampling=False,
                 max_new_tokens=1500, repetition_penalty=1.1)
print(out)

Limitations

Recall is low (0.06): dense plans with dozens of elements are only partially enumerated, and long outputs may truncate mid-JSON. Use a truncation-tolerant parser in production.
Trained on CAD-style monochrome drawings (FloorPlanCAD); photographed or hand-drawn plans are out of distribution.
Coordinates are model estimates, not measurements.

Trained for the Hugging Face Build Small Hackathon 2026.

Downloads last month: 55

Model tree for Barath/minicpmv4-floorplan-lora

Base model

openbmb/MiniCPM-V-4

Adapter

(1)

this model

Barath
/

minicpmv4-floorplan-lora