Update ScreenParser for ScreenParse v2

f029e56 verified 8 days ago

4.6 kB

	---
	license: apache-2.0
	datasets:
	- docling-project/screenparse
	tags:
	- object-detection
	- yolo
	- ui-understanding
	- screen-parsing
	- grounding
	- web
	- ultralytics
	language:
	- en
	pipeline_tag: object-detection
	library_name: ultralytics
	---

	# ScreenParser

	ScreenParser is a YOLO-based UI element detector fine-tuned on [ScreenParse](https://huggingface.co/datasets/docling-project/screenparse), a large-scale dataset of web page screenshots with dense UI annotations across 55 UI element classes. Given a screenshot, it detects and classifies visible UI components with bounding boxes and confidence scores.

	## News

	- May 2026: Updated `main` with the detector trained on ScreenParse v2, which contains 1,447,100 high-quality training screenshots, leaf-element annotations, and varied viewport resolutions. The original detector trained on ScreenParse v1 is retained on the `v1` branch.

	- Developed by: IBM Research - ETH Zurich
	- Model type: Object detection (YOLO11-L)
	- License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
	- Paper: [ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing](https://arxiv.org/abs/2602.14276)
	- Code: https://github.com/Saidgurbuz/screenparse
	- Dataset: [docling-project/screenparse](https://huggingface.co/datasets/docling-project/screenparse)

	## Model Summary

	ScreenParser is a [YOLO11-Large](https://docs.ultralytics.com/models/yolo11/) model fine-tuned at 1280px resolution on ScreenParse v2. The v2 detector uses the filtered leaf-element annotations released with the current dataset `main` branch.

	### Supported Classes (55)

	Table, Column/Browser, Button, Utility Button, App Icon, Navigation Bar, Status Bar, Search Field, Toolbar, Tooltip, Video, Tab Bar, Side Bar, Slider, Picker, ContextMenu, DockMenu, EditMenu, Image, Scroll, Switch, File Icon, Chart, Window, Screen, List, List Item, PopUp Menu, Steppers, Toggles, Text Input, Rating Indicator, Checkbox, Radiobox, Select, Avatar, Badge, Alert, Progress bar, Bottom navigation, Breadcrumb, Page control, Link, Menu, Pagination, Tab, Search Bar, Date-Time picker, Calendar, Text, Heading, Code snippet, Carousel, Notification, Logo

	## Usage

	### Single Image Inference

	```python
	from ultralytics import YOLO

	model = YOLO("docling-project/ScreenParser")
	results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10)

	for r in results:
	for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf):
	x1, y1, x2, y2 = box.tolist()
	label = model.names[int(cls_id)]
	print(f"{label:20s} conf={conf:.2f} bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})")
	```

	### Batch Inference

	```python
	import os
	from ultralytics import YOLO

	model = YOLO("docling-project/ScreenParser")
	image_dir = "screenshots/"
	images = sorted(
	os.path.join(image_dir, f)
	for f in os.listdir(image_dir)
	if f.lower().endswith((".png", ".jpg", ".jpeg"))
	)

	results = model.predict(images, imgsz=1280, conf=0.10, iou=0.10, batch=16)
	```

	### Save Visualizations

	```python
	from ultralytics import YOLO

	model = YOLO("docling-project/ScreenParser")
	results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10, save=True)
	# Annotated image saved under runs/detect/predict/
	```

	## Training Data

	The current `main` checkpoint was trained on [ScreenParse v2](https://huggingface.co/datasets/docling-project/screenparse), which provides 1,447,100 high-quality training screenshots and 25,575,213 UI element annotations. The dataset uses filtered leaf-element annotations to reduce noisy nested boxes and includes multiple viewport resolutions.

	The original ScreenParser checkpoint trained on ScreenParse v1 remains available with `revision="v1"`.

	## Limitations

	- Produces bounding boxes and element labels only; it does not produce text content for detected elements. Pair it with OCR or [ScreenVLM](https://huggingface.co/docling-project/ScreenVLM) when text extraction is needed.
	- The model is trained on rendered web screenshots, so performance may vary on native desktop, mobile, or application screenshots outside the training distribution.

	## Citation

	```bibtex
	@misc{gurbuz2026movingsparsegroundingcomplete,
	title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision},
	author={A. Said Gurbuz and Sunghwan Hong and Ahmed Nassar and Marc Pollefeys and Peter Staar},
	year={2026},
	eprint={2602.14276},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2602.14276},
	}
	```