Instructions to use Barath/minicpmv4-floorplan-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Barath/minicpmv4-floorplan-lora with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string
MiniCPM-V-4 Floor-Plan Element Detection (LoRA + vision tuning)
Fine-tune of MiniCPM-V-4 (4B) for structured extraction from CAD floor
plans: walls, doors, windows, stairs, fixtures, and furniture as JSON with
bounding boxes normalized to [0, 1000].
{"elements": [{"type": "double_door", "bbox": [623, 730, 710, 763]}, ...]}
Results (held-out FloorPlanCAD, detection F1 @ IoU 0.5, greedy decoding)
| JSON valid | Precision | Recall | F1 | |
|---|---|---|---|---|
| MiniCPM-V-4 zero-shot | 32% | 0.027 | 0.010 | 0.015 |
| LoRA, LLM-only (300 steps, 800 samples) | 16% | 0.0 | 0.0 | 0.0 |
| LoRA + vision tuning (800 steps, 3.3k samples) | 40% | 0.439 | 0.060 | 0.105 |
The decisive factor was unfreezing the vision tower: LLM-only LoRA learned the output format but stayed image-blind (high-confidence repetition of dataset priors). With vision tuning, precision rose 16x over zero-shot — the model genuinely grounds boxes in the drawing. Recall remains the open weakness (long element lists; outputs sometimes truncate before the JSON closes).
Training
- Base:
openbmb/MiniCPM-V-4, official MiniCPM-V finetune harness (llm_typeChatML; note: the harness's qwen2 target-span detection needs a patch for V-4's tokenizer — spans located by token-id lookup of'<|im_start|>'/'assistant'only exist in Qwen2's vocab) - LoRA on LLM attention projections (q/k/v/o) + full vision-tower tuning
- 3,281 train / held-out eval from FloorPlanCAD (FiftyOne detections converted to conversation JSON)
- 800 steps, effective batch 8, lr 1e-5 cosine, bf16, single NVIDIA L4 (Modal), ~3.5 h
Usage
import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
base = "openbmb/MiniCPM-V-4"
tokenizer = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModel.from_pretrained(base, trust_remote_code=True,
torch_dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(model, "Barath/minicpmv4-floorplan-lora")
model = model.merge_and_unload().eval()
from PIL import Image
img = Image.open("floorplan.png").convert("RGB")
prompt = ('Detect the architectural elements in this floor plan (walls, doors, '
'windows, stairs, fixtures, furniture). Return only JSON: '
'{"elements": [{"type": str, "bbox": [x1, y1, x2, y2]}]} with integer '
'coordinates normalized to [0, 1000].')
out = model.chat(msgs=[{"role": "user", "content": [img, prompt]}],
tokenizer=tokenizer, sampling=False,
max_new_tokens=1500, repetition_penalty=1.1)
print(out)
Limitations
- Recall is low (0.06): dense plans with dozens of elements are only partially enumerated, and long outputs may truncate mid-JSON. Use a truncation-tolerant parser in production.
- Trained on CAD-style monochrome drawings (FloorPlanCAD); photographed or hand-drawn plans are out of distribution.
- Coordinates are model estimates, not measurements.
Trained for the Hugging Face Build Small Hackathon 2026.
- Downloads last month
- -
Model tree for Barath/minicpmv4-floorplan-lora
Base model
openbmb/MiniCPM-V-4