File size: 4,597 Bytes
74dbff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f029e56
 
 
 
 
74dbff2
 
 
 
f029e56
 
 
74dbff2
 
 
f029e56
74dbff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f029e56
74dbff2
f029e56
 
74dbff2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f029e56
 
 
 
 
74dbff2
 
 
f029e56
 
74dbff2
 
 
 
6b82877
e7a6e13
6b82877
 
 
 
 
 
74dbff2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
license: apache-2.0
datasets:
- docling-project/screenparse
tags:
- object-detection
- yolo
- ui-understanding
- screen-parsing
- grounding
- web
- ultralytics
language:
- en
pipeline_tag: object-detection
library_name: ultralytics
---

# ScreenParser

**ScreenParser** is a YOLO-based UI element detector fine-tuned on [ScreenParse](https://huggingface.co/datasets/docling-project/screenparse), a large-scale dataset of web page screenshots with dense UI annotations across **55 UI element classes**. Given a screenshot, it detects and classifies visible UI components with bounding boxes and confidence scores.

## News

- **May 2026**: Updated `main` with the detector trained on ScreenParse v2, which contains 1,447,100 high-quality training screenshots, leaf-element annotations, and varied viewport resolutions. The original detector trained on ScreenParse v1 is retained on the `v1` branch.

- **Developed by**: IBM Research - ETH Zurich
- **Model type**: Object detection (YOLO11-L)
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Paper**: [ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing](https://arxiv.org/abs/2602.14276)
- **Code**: https://github.com/Saidgurbuz/screenparse
- **Dataset**: [docling-project/screenparse](https://huggingface.co/datasets/docling-project/screenparse)

## Model Summary

ScreenParser is a [YOLO11-Large](https://docs.ultralytics.com/models/yolo11/) model fine-tuned at 1280px resolution on ScreenParse v2. The v2 detector uses the filtered leaf-element annotations released with the current dataset `main` branch.

### Supported Classes (55)

Table, Column/Browser, Button, Utility Button, App Icon, Navigation Bar, Status Bar, Search Field, Toolbar, Tooltip, Video, Tab Bar, Side Bar, Slider, Picker, ContextMenu, DockMenu, EditMenu, Image, Scroll, Switch, File Icon, Chart, Window, Screen, List, List Item, PopUp Menu, Steppers, Toggles, Text Input, Rating Indicator, Checkbox, Radiobox, Select, Avatar, Badge, Alert, Progress bar, Bottom navigation, Breadcrumb, Page control, Link, Menu, Pagination, Tab, Search Bar, Date-Time picker, Calendar, Text, Heading, Code snippet, Carousel, Notification, Logo

## Usage

### Single Image Inference

```python
from ultralytics import YOLO

model = YOLO("docling-project/ScreenParser")
results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10)

for r in results:
    for box, cls_id, conf in zip(r.boxes.xyxy, r.boxes.cls, r.boxes.conf):
        x1, y1, x2, y2 = box.tolist()
        label = model.names[int(cls_id)]
        print(f"{label:20s}  conf={conf:.2f}  bbox=({int(x1)}, {int(y1)}, {int(x2-x1)}, {int(y2-y1)})")
```

### Batch Inference

```python
import os
from ultralytics import YOLO

model = YOLO("docling-project/ScreenParser")
image_dir = "screenshots/"
images = sorted(
    os.path.join(image_dir, f)
    for f in os.listdir(image_dir)
    if f.lower().endswith((".png", ".jpg", ".jpeg"))
)

results = model.predict(images, imgsz=1280, conf=0.10, iou=0.10, batch=16)
```

### Save Visualizations

```python
from ultralytics import YOLO

model = YOLO("docling-project/ScreenParser")
results = model.predict("screenshot.png", imgsz=1280, conf=0.10, iou=0.10, save=True)
# Annotated image saved under runs/detect/predict/
```

## Training Data

The current `main` checkpoint was trained on [ScreenParse v2](https://huggingface.co/datasets/docling-project/screenparse), which provides 1,447,100 high-quality training screenshots and 25,575,213 UI element annotations. The dataset uses filtered leaf-element annotations to reduce noisy nested boxes and includes multiple viewport resolutions.

The original ScreenParser checkpoint trained on ScreenParse v1 remains available with `revision="v1"`.

## Limitations

- Produces bounding boxes and element labels only; it does not produce text content for detected elements. Pair it with OCR or [ScreenVLM](https://huggingface.co/docling-project/ScreenVLM) when text extraction is needed.
- The model is trained on rendered web screenshots, so performance may vary on native desktop, mobile, or application screenshots outside the training distribution.

## Citation

```bibtex
@misc{gurbuz2026movingsparsegroundingcomplete,
      title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision},
      author={A. Said Gurbuz and Sunghwan Hong and Ahmed Nassar and Marc Pollefeys and Peter Staar},
      year={2026},
      eprint={2602.14276},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.14276},
}
```