File size: 5,892 Bytes
05633a4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
license: mit
base_model:
- microsoft/Florence-2-large
library_name: transformers
tags:
- GUI
- VLM
- Agent
- GUI-Grounding
---
# π― GoClick-Large: Super Fast Lightweight GUI Grounding Expert
<div align="center">
[](https://github.com/ZJULiHongxin/GoClick)
[](https://arxiv.org/abs/2604.23941)
[](https://huggingface.co/HongxinLi/GoClick-Large)
[](https://huggingface.co/HongxinLi/GoClick-Base)
[](https://huggingface.co/datasets/HongxinLi/GoClick_Coreset_3814k)
[](https://huggingface.co/datasets/HongxinLi/GoClick_sft_data)
</div>
GoClick is a state-of-the-art two-stage framework for precise UI element grounding. Built on the Florence-2 architecture, it bridges the gap between high-level intent and low-level pixel coordinates by separating the Planning and Grounding tasks.
## ποΈ Agent Architecture Overview
1. Stage 1 (Planning): Analyze UI screenshot + Goal -> Output Function Description.
2. Stage 2 (Grounding): Screenshot + Function Description -> Output Precise Coordinates.Note: This model is the specialized Stage 2 Grounder, fine-tuned for extreme precision in locating elements based on their described functionality.
## π Quick Start (Inference of The Model)
Prerequisites
```
pip install transformers==4.45.0 timm
```
Note: The version of Transformers should not be too high. Adjust the version if model loading fails.
### Usage Example
```
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
def postprocess(text: str, image_size: tuple[int]):
"""Function that decodes model's generation into action json.
Args:
text: single generated sample
image_size: corresponding image size
"""
point_pattern = r"<loc_(\d+)>,<loc_(\d+)>"
try:
location = re.findall(point_pattern, text)[0]
if len(location) > 0:
point = [int(loc) for loc in location]
except Exception:
point = (0, 0)
return point
# Load model and processor
model = AutoModelForCausalLM.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
# Load UI screenshot
image = Image.open("ui_screenshot.png")
# Stage 1: Planning
# Functionality Grounding (For AutoGUI FuncPred Benchmark)
planning_prompt = f"Locate the element according to its detailed functionality description. {goal_info} (Output the center coordinates of the target)"
# Intent Grounding (For RefExp, MOTIF, and VisualWebBench Action Grounding)
planning_prompt = f"I want to {goal_info}. Please locate the target element I should interact with. (Output the center coordinates of the target)"
# Description Grounding (For ScreenSpot/v2 and VisualWebBench Element Grounding))
planning_prompt = f"Where is the {goal_info} element? (Output the center coordinates of the target)"
inputs = processor(
images=image,
text=prompt,
return_tensors="pt",
do_resize=True,
).to(model.device, dtype=model.dtype)
outputs = model.generate(
**inputs,
do_sample= False,
max_new_tokens=max_new_tokens,
use_cache=True
)
text_output = processor.tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
text_output = postprocess(text_output, img_size)
```
### π Benchmarks
GoClick-Base also achieves a good tradeoff between GUI element grounding accuracy and inference latency:
| Model | Size | TTFT β (ms) | TPOT β (ms/token) | FuncPred (F; M, W) | ScreenSpot (B; M, W, D) | ScreenSpot-v2 (B; M, W, D) | MOTIF (I; M) | RefExp (I; M) | VWB EG (T; W) | VWB AG (I; W) |
|-------|------|-------------|-------------------|--------------------|-------------------------|---------------------------|--------------|---------------|---------------|---------------|
| GPT-4o | - | - | - | 9.8 | 17.8 | 20.4 | 30.5 | 21.8 | 5.6 | 6.8 |
| Qwen2VL-7B | 8B | 118.9 | 21.2 | 38.7 | 66.4 | 66.9 | 75.1 | 64.8 | 55.9 | 62.1 |
| CogAgent | 18B | 1253.2 | 208.8 | 29.3 | 47.4 | 49.2 | 46.7 | 35.0 | 55.7 | 59.2 |
| SeeClick | 10B | 160.4 | 184.4 | 19.8 | 53.4 | 54.0 | 11.1 | 58.1 | 39.2 | 27.2 |
| Ferret-UI | 8B | 152.5 | 22.9 | 1.2 | 7.1 | 7.8 | 15.9 | 5.5 | 3.9 | 1.9 |
| UGround | 7B | 1034.6 | 27.9 | 48.8 | 74.8 | 76.5 | 72.4 | 73.6 | 85.2 | 63.1 |
| OS-ATLAS-8B | 8B | 137.5 | 19.9 | 52.1 | 82.5 | 84.1 | 78.8 | 66.5 | 82.6 | 69.9 |
| Aguvis | 8B | 119.7 | 21.2 | 52.0 | 83.8 | 85.6 | 73.8 | 80.9 | 91.3 | 68.0 |
| Qwen2-VL | 2B | 58.8 | 16.4 | 7.1 | 17.9 | 18.6 | 28.8 | 29.2 | 17.9 | 17.5 |
| OS-ATLAS-4B | 4B | 137.3 | 31.4 | 44.6 | 66.8 | 68.7 | 75.4 | 77.1 | 47.7 | 58.3 |
| Ferret-UI | 3B | 69.5 | 9.8 | 1.3 | 2.1 | 1.9 | 5.5 | 1.1 | 0.7 | 1.0 |
| ShowUI | 2B | 79.7 | 14.7 | 39.9 | 76.1 | 77.4 | 72.3 | 58.4 | 64.2 | 55.3 |
| **GoClick-L (ours)** | 0.8B | 91.1 | 8.3 | **69.5** | **78.5** | **81.1** | **80.4** | **78.2** | **90.3** | **68.0** |
| **GoClick-B (ours)** | 0.2B | **37.7** | **4.1** | 64.4 | 74.1 | 75.2 | 76.8 | 71.9 | 90.3 | 61.2 |
## π Citation
If you use GoClick in your research, please cite our paper:
```
@misc{li2026goclicklightweightelementgrounding,
title={GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction},
author={Hongxin Li and Yuntao Chen and Zhaoxiang Zhang},
year={2026},
eprint={2604.23941},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.23941},
}
```
|