Image-Text-to-Text
Transformers
Safetensors
English
idefics3
text-generation
screen-parsing
ui-understanding
object-detection
grounding
web
screentag
docling
granite
conversational
Instructions to use docling-project/ScreenVLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use docling-project/ScreenVLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="docling-project/ScreenVLM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("docling-project/ScreenVLM") model = AutoModelForImageTextToText.from_pretrained("docling-project/ScreenVLM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use docling-project/ScreenVLM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "docling-project/ScreenVLM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "docling-project/ScreenVLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/docling-project/ScreenVLM
- SGLang
How to use docling-project/ScreenVLM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "docling-project/ScreenVLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "docling-project/ScreenVLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "docling-project/ScreenVLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "docling-project/ScreenVLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use docling-project/ScreenVLM with Docker Model Runner:
docker model run hf.co/docling-project/ScreenVLM
| license: apache-2.0 | |
| datasets: | |
| - docling-project/screenparse | |
| tags: | |
| - text-generation | |
| - screen-parsing | |
| - ui-understanding | |
| - object-detection | |
| - grounding | |
| - web | |
| - screentag | |
| - docling | |
| - granite | |
| language: | |
| - en | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| # ScreenVLM | |
| **ScreenVLM** is a compact multimodal vision-language model for **complete screen parsing**: detecting, classifying, localizing, and transcribing UI elements on web page screenshots. Given an image, it produces a structured **ScreenTag** representation with bounding boxes, semantic labels across 55 UI element classes, and text content for visible elements. | |
| ## News | |
| - **May 2026**: Updated `main` with the ScreenVLM checkpoint trained on ScreenParse v2. This release uses the v2 training data with more robust quality filtering, 1,447,100 high-quality screenshots, and varied viewport resolutions. The original ScreenVLM checkpoint trained on ScreenParse v1 is retained on the `v1` branch. | |
| - **Developed by**: IBM Research Zurich - ETH Zurich | |
| - **Model type**: Multi-modal model (image+text-to-text) | |
| - **Language(s)**: English | |
| - **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | |
| - **Paper**: [ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing](https://arxiv.org/abs/2602.14276) | |
| - **Code**: https://github.com/Saidgurbuz/screenparse | |
| - **Dataset**: [docling-project/screenparse](https://huggingface.co/datasets/docling-project/screenparse) | |
| ## Model Summary | |
| ScreenVLM builds upon the [Idefics3](https://huggingface.co/docs/transformers/en/model_doc/idefics3) architecture with [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512) as the vision encoder and a Granite 165M LLM as the language backbone. The current `main` checkpoint was trained on ScreenParse v2 full-element screen parsing supervision across 55 semantic UI classes. | |
| ### Key Features | |
| - **Complete screen parsing**: Detects all visible UI elements on a page, not only sparse grounding targets | |
| - **55 UI element classes**: Buttons, links, inputs, navigation bars, menus, images, text, and more | |
| - **ScreenTag output format**: Structured representation with semantic tags, location tokens, and text content | |
| - **Compact size**: Single-file safetensors checkpoint suitable for fast inference | |
| ## Output Format | |
| ScreenVLM generates output in **ScreenTag** format, where each UI element is wrapped in semantic tags with location tokens: | |
| ```html | |
| <screentag> | |
| <button><loc_10><loc_20><loc_50><loc_35>Submit</button> | |
| <link><loc_100><loc_200><loc_180><loc_210>Learn more</link> | |
| <navigation_bar><loc_0><loc_0><loc_500><loc_30> | |
| <link><loc_10><loc_5><loc_60><loc_25>Home</link> | |
| <link><loc_70><loc_5><loc_120><loc_25>About</link> | |
| </navigation_bar> | |
| </screentag> | |
| ``` | |
| Each `<loc_X>` token represents a coordinate in the normalized `[0, 500]` space. Four consecutive location tokens define `<left><top><right><bottom>` of the bounding box. | |
| ## Usage | |
| ### Inference with Transformers | |
| ```python | |
| import re | |
| import torch | |
| from transformers import AutoProcessor, AutoModelForVision2Seq | |
| from transformers.image_utils import load_image | |
| DEVICE = "cuda" if torch.cuda.is_available() else "cpu" | |
| MODEL_PATH = "docling-project/ScreenVLM" | |
| NORM_SIZE = 500 | |
| image = load_image("https://example.com/screenshot.png") | |
| processor = AutoProcessor.from_pretrained(MODEL_PATH) | |
| model = AutoModelForVision2Seq.from_pretrained( | |
| MODEL_PATH, | |
| torch_dtype=torch.bfloat16, | |
| _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "sdpa", | |
| ).to(DEVICE) | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image"}, | |
| {"type": "text", "text": "Generate the screen representation for this UI:"}, | |
| ], | |
| }, | |
| ] | |
| prompt = processor.apply_chat_template(messages, add_generation_prompt=True) | |
| inputs = processor(text=prompt, images=[image], return_tensors="pt").to(DEVICE) | |
| generated_ids = model.generate(**inputs, max_new_tokens=6192) | |
| prompt_length = inputs.input_ids.shape[1] | |
| output = processor.batch_decode( | |
| generated_ids[:, prompt_length:], | |
| skip_special_tokens=False, | |
| )[0].lstrip() | |
| def parse_screentag(text, width, height): | |
| pattern = re.compile( | |
| r"<(?P<tag>[a-zA-Z][a-zA-Z0-9_]*)>" | |
| r"\s*<loc_(?P<l>\d+)><loc_(?P<t>\d+)><loc_(?P<r>\d+)><loc_(?P<b>\d+)>" | |
| r"(?P<text>[^<]*)" | |
| ) | |
| elements = [] | |
| for m in pattern.finditer(text): | |
| l, t, r, b = [max(0, min(int(m.group(k)), NORM_SIZE)) for k in ("l", "t", "r", "b")] | |
| if r < l: | |
| l, r = r, l | |
| if b < t: | |
| t, b = b, t | |
| x = l / NORM_SIZE * width | |
| y = t / NORM_SIZE * height | |
| w = (r - l) / NORM_SIZE * width | |
| h = (b - t) / NORM_SIZE * height | |
| elements.append({ | |
| "label": m.group("tag"), | |
| "bbox": (x, y, w, h), | |
| "text": m.group("text").strip() or None, | |
| }) | |
| return elements | |
| elements = parse_screentag(output, *image.size) | |
| for el in elements: | |
| print(f"{el['label']:20s} bbox=({int(el['bbox'][0]):4d},{int(el['bbox'][1]):4d},{int(el['bbox'][2]):4d},{int(el['bbox'][3]):4d}) text={el['text']!r}") | |
| ``` | |
| ### Batch Inference with vLLM | |
| ```python | |
| import os | |
| import re | |
| import time | |
| from vllm import LLM, SamplingParams | |
| from transformers import AutoProcessor | |
| from PIL import Image | |
| MODEL_PATH = "docling-project/ScreenVLM" | |
| IMAGE_DIR = "screenshots/" | |
| PROMPT_TEXT = "Generate the screen representation for this UI:" | |
| NORM_SIZE = 500 | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image"}, | |
| {"type": "text", "text": PROMPT_TEXT}, | |
| ], | |
| }, | |
| ] | |
| llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1}) | |
| processor = AutoProcessor.from_pretrained(MODEL_PATH) | |
| sampling_params = SamplingParams( | |
| temperature=0.0, | |
| max_tokens=6192, | |
| skip_special_tokens=False, | |
| ) | |
| batched_inputs = [] | |
| image_sizes = [] | |
| for img_file in sorted(os.listdir(IMAGE_DIR)): | |
| if img_file.lower().endswith((".png", ".jpg", ".jpeg")): | |
| img_path = os.path.join(IMAGE_DIR, img_file) | |
| image = Image.open(img_path).convert("RGB") | |
| prompt = processor.apply_chat_template(messages, add_generation_prompt=True) | |
| batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}}) | |
| image_sizes.append((img_file, image.size)) | |
| start = time.time() | |
| outputs = llm.generate(batched_inputs, sampling_params=sampling_params) | |
| print(f"Total: {time.time() - start:.1f}s for {len(batched_inputs)} images") | |
| ``` | |
| ## Training | |
| ScreenVLM was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework with 16 NVIDIA H100 GPUs. | |
| **Training data**: ScreenParse v2 full-element annotations for complete screen parsing. The v2 training data contains 1,447,100 high-quality web page screenshots across varied viewport resolutions with dense UI element supervision, including bounding boxes, semantic labels, text content, interactability flags, and reading order. | |
| The original ScreenVLM checkpoint trained on ScreenParse v1 remains available with `revision="v1"`. | |
| ## Limitations | |
| - Optimized for **web page screenshots**; performance on mobile or desktop application UIs may vary | |
| - May struggle with very dense or highly dynamic UIs, such as complex dashboards with hundreds of elements | |
| - Produces structured screen parses, but downstream applications should still validate coordinates and text before using them for high-stakes automation | |
| ## Citation | |
| ```bibtex | |
| @misc{gurbuz2026movingsparsegroundingcomplete, | |
| title={ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision}, | |
| author={A. Said Gurbuz and Sunghwan Hong and Ahmed Nassar and Marc Pollefeys and Peter Staar}, | |
| year={2026}, | |
| eprint={2602.14276}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2602.14276}, | |
| } | |
| ``` | |