Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

MiniCPM-V-4 Floor-Plan Element Detection (LoRA + vision tuning)

Fine-tune of MiniCPM-V-4 (4B) for structured extraction from CAD floor plans: walls, doors, windows, stairs, fixtures, and furniture as JSON with bounding boxes normalized to [0, 1000].

{"elements": [{"type": "double_door", "bbox": [623, 730, 710, 763]}, ...]}

Results (held-out FloorPlanCAD, detection F1 @ IoU 0.5, greedy decoding)

JSON valid Precision Recall F1
MiniCPM-V-4 zero-shot 32% 0.027 0.010 0.015
LoRA, LLM-only (300 steps, 800 samples) 16% 0.0 0.0 0.0
LoRA + vision tuning (800 steps, 3.3k samples) 40% 0.439 0.060 0.105

The decisive factor was unfreezing the vision tower: LLM-only LoRA learned the output format but stayed image-blind (high-confidence repetition of dataset priors). With vision tuning, precision rose 16x over zero-shot — the model genuinely grounds boxes in the drawing. Recall remains the open weakness (long element lists; outputs sometimes truncate before the JSON closes).

Training

  • Base: openbmb/MiniCPM-V-4, official MiniCPM-V finetune harness (llm_type ChatML; note: the harness's qwen2 target-span detection needs a patch for V-4's tokenizer — spans located by token-id lookup of '<|im_start|>'/'assistant' only exist in Qwen2's vocab)
  • LoRA on LLM attention projections (q/k/v/o) + full vision-tower tuning
  • 3,281 train / held-out eval from FloorPlanCAD (FiftyOne detections converted to conversation JSON)
  • 800 steps, effective batch 8, lr 1e-5 cosine, bf16, single NVIDIA L4 (Modal), ~3.5 h

Usage

import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel

base = "openbmb/MiniCPM-V-4"
tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModel.from_pretrained(base, trust_remote_code=True,
                                  torch_dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(model, "Barath/minicpmv4-floorplan-lora")
model = model.merge_and_unload().eval()

from PIL import Image
img = Image.open("floorplan.png").convert("RGB")
prompt = ('Detect the architectural elements in this floor plan (walls, doors, '
          'windows, stairs, fixtures, furniture). Return only JSON: '
          '{"elements": [{"type": str, "bbox": [x1, y1, x2, y2]}]} with integer '
          'coordinates normalized to [0, 1000].')
out = model.chat(msgs=[{"role": "user", "content": [img, prompt]}],
                 tokenizer=tokenizer, sampling=False,
                 max_new_tokens=1500, repetition_penalty=1.1)
print(out)

Limitations

  • Recall is low (0.06): dense plans with dozens of elements are only partially enumerated, and long outputs may truncate mid-JSON. Use a truncation-tolerant parser in production.
  • Trained on CAD-style monochrome drawings (FloorPlanCAD); photographed or hand-drawn plans are out of distribution.
  • Coordinates are model estimates, not measurements.

Trained for the Hugging Face Build Small Hackathon 2026.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Barath/minicpmv4-floorplan-lora

Adapter
(1)
this model

Dataset used to train Barath/minicpmv4-floorplan-lora