File size: 5,892 Bytes
05633a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: mit
base_model:
- microsoft/Florence-2-large
library_name: transformers
tags:
- GUI
- VLM
- Agent
- GUI-Grounding
---


# 🎯 GoClick-Large: Super Fast Lightweight GUI Grounding Expert


<div align="center">
  
[![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/ZJULiHongxin/GoClick)
[![Paper](https://img.shields.io/badge/Paper-GoClick-blue?logo=adobeacrobatreader)](https://arxiv.org/abs/2604.23941)
[![GoClickLarge](https://img.shields.io/badge/πŸ€—%20GoClickLarge-Model-yellow)](https://huggingface.co/HongxinLi/GoClick-Large)
[![GoClickBase](https://img.shields.io/badge/πŸ€—%20GoClickBase-Model-yellow)](https://huggingface.co/HongxinLi/GoClick-Base)
[![SFTData](https://img.shields.io/badge/πŸ€—%20SFT-Dataset-yellow)](https://huggingface.co/datasets/HongxinLi/GoClick_Coreset_3814k)
[![SFTZipData](https://img.shields.io/badge/πŸ€—%20SFTZip-SFTData-yellow)](https://huggingface.co/datasets/HongxinLi/GoClick_sft_data)

</div>


GoClick is a state-of-the-art two-stage framework for precise UI element grounding. Built on the Florence-2 architecture, it bridges the gap between high-level intent and low-level pixel coordinates by separating the Planning and Grounding tasks.

## πŸ—οΈ Agent Architecture Overview

1. Stage 1 (Planning): Analyze UI screenshot + Goal -> Output Function Description.
2. Stage 2 (Grounding): Screenshot + Function Description -> Output Precise Coordinates.Note: This model is the specialized Stage 2 Grounder, fine-tuned for extreme precision in locating elements based on their described functionality.

## πŸš€ Quick Start (Inference of The Model)

Prerequisites

```
pip install transformers==4.45.0 timm
```

Note: The version of Transformers should not be too high. Adjust the version if model loading fails.

### Usage Example

```
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image


def postprocess(text: str, image_size: tuple[int]):
    """Function that decodes model's generation into action json.

    Args:
        text: single generated sample
        image_size: corresponding image size
    """
    point_pattern = r"<loc_(\d+)>,<loc_(\d+)>"

    try:
        location = re.findall(point_pattern, text)[0]
        if len(location) > 0:
            point = [int(loc) for loc in location]

    except Exception:
        point = (0, 0)

    return point

# Load model and processor
model = AutoModelForCausalLM.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)

# Load UI screenshot
image = Image.open("ui_screenshot.png")

# Stage 1: Planning

# Functionality Grounding (For AutoGUI FuncPred Benchmark)
planning_prompt = f"Locate the element according to its detailed functionality description. {goal_info} (Output the center coordinates of the target)"

# Intent Grounding (For RefExp, MOTIF, and VisualWebBench Action Grounding)
planning_prompt = f"I want to {goal_info}. Please locate the target element I should interact with. (Output the center coordinates of the target)"

# Description Grounding (For ScreenSpot/v2 and VisualWebBench Element Grounding))
planning_prompt = f"Where is the {goal_info} element? (Output the center coordinates of the target)"


inputs = processor(
    images=image,
    text=prompt,
    return_tensors="pt",
    do_resize=True,
).to(model.device, dtype=model.dtype)

outputs = model.generate(
            **inputs,
            do_sample= False,
            max_new_tokens=max_new_tokens,
            use_cache=True
        )

text_output = processor.tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
text_output = postprocess(text_output, img_size)

```

### πŸ“Š Benchmarks

GoClick-Base also achieves a good tradeoff between GUI element grounding accuracy and inference latency:

| Model | Size | TTFT ↓ (ms) | TPOT ↓ (ms/token) | FuncPred (F; M, W) | ScreenSpot (B; M, W, D) | ScreenSpot-v2 (B; M, W, D) | MOTIF (I; M) | RefExp (I; M) | VWB EG (T; W) | VWB AG (I; W) |
|-------|------|-------------|-------------------|--------------------|-------------------------|---------------------------|--------------|---------------|---------------|---------------|
| GPT-4o | - | - | - | 9.8 | 17.8 | 20.4 | 30.5 | 21.8 | 5.6 | 6.8 |
| Qwen2VL-7B | 8B | 118.9 | 21.2 | 38.7 | 66.4 | 66.9 | 75.1 | 64.8 | 55.9 | 62.1 |
| CogAgent | 18B | 1253.2 | 208.8 | 29.3 | 47.4 | 49.2 | 46.7 | 35.0 | 55.7 | 59.2 |
| SeeClick | 10B | 160.4 | 184.4 | 19.8 | 53.4 | 54.0 | 11.1 | 58.1 | 39.2 | 27.2 |
| Ferret-UI | 8B | 152.5 | 22.9 | 1.2 | 7.1 | 7.8 | 15.9 | 5.5 | 3.9 | 1.9 |
| UGround | 7B | 1034.6 | 27.9 | 48.8 | 74.8 | 76.5 | 72.4 | 73.6 | 85.2 | 63.1 |
| OS-ATLAS-8B | 8B | 137.5 | 19.9 | 52.1 | 82.5 | 84.1 | 78.8 | 66.5 | 82.6 | 69.9 |
| Aguvis | 8B | 119.7 | 21.2 | 52.0 | 83.8 | 85.6 | 73.8 | 80.9 | 91.3 | 68.0 |
| Qwen2-VL | 2B | 58.8 | 16.4 | 7.1 | 17.9 | 18.6 | 28.8 | 29.2 | 17.9 | 17.5 |
| OS-ATLAS-4B | 4B | 137.3 | 31.4 | 44.6 | 66.8 | 68.7 | 75.4 | 77.1 | 47.7 | 58.3 |
| Ferret-UI | 3B | 69.5 | 9.8 | 1.3 | 2.1 | 1.9 | 5.5 | 1.1 | 0.7 | 1.0 |
| ShowUI | 2B | 79.7 | 14.7 | 39.9 | 76.1 | 77.4 | 72.3 | 58.4 | 64.2 | 55.3 |
| **GoClick-L (ours)** | 0.8B | 91.1 | 8.3 | **69.5** | **78.5** | **81.1** | **80.4** | **78.2** | **90.3** | **68.0** |
| **GoClick-B (ours)** | 0.2B | **37.7** | **4.1** | 64.4 | 74.1 | 75.2 | 76.8 | 71.9 | 90.3 | 61.2 |


## πŸ“ Citation
If you use GoClick in your research, please cite our paper:

```
@misc{li2026goclicklightweightelementgrounding,
      title={GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction}, 
      author={Hongxin Li and Yuntao Chen and Zhaoxiang Zhang},
      year={2026},
      eprint={2604.23941},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.23941}, 
}
```