| | --- |
| | base_model: |
| | - Qwen/Qwen2.5-VL-3B-Instruct |
| | datasets: |
| | - GUI-Libra/GUI-Libra-81K-RL |
| | - GUI-Libra/GUI-Libra-81K-SFT |
| | language: |
| | - en |
| | license: apache-2.0 |
| | library_name: transformers |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - VLM |
| | - GUI |
| | - agent |
| | --- |
| | |
| | # GUI-Libra-3B |
| |
|
| | [**Project Page**](https://gui-libra.github.io) | [**Paper**](https://huggingface.co/papers/2602.22190) | [**GitHub**](https://github.com/GUI-Libra/GUI-Libra) |
| |
|
| | GUI-Libra is a post-training framework that turns open-source VLMs into strong native GUI agents—models that see a screenshot, think step-by-step, and output an executable action, all within a single forward pass. |
| |
|
| | This model is fine-tuned from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) using action-aware SFT and conservative reinforcement learning (GRPO). It addresses challenges such as action-grounding alignment and partial verifiability in GUI navigation tasks. |
| |
|
| | # Usage |
| | ## 1) Start an OpenAI-compatible vLLM server |
| |
|
| | ```bash |
| | pip install -U vllm |
| | vllm serve GUI-Libra/GUI-Libra-3B --port 8000 --api-key token-abc123 |
| | ``` |
| |
|
| | * Endpoint: `http://localhost:8000/v1` |
| | * The `api_key` here must match `--api-key`. |
| |
|
| |
|
| | ## 2) Minimal Python example (prompt + image → request) |
| |
|
| | Install dependencies: |
| |
|
| | ```bash |
| | pip install -U openai |
| | ``` |
| |
|
| | Create `minimal_infer.py`: |
| |
|
| | ```python |
| | import base64 |
| | from openai import OpenAI |
| | |
| | MODEL = "GUI-Libra/GUI-Libra-3B" |
| | client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123") |
| | |
| | def b64_image(path: str) -> str: |
| | with open(path, "rb") as f: |
| | return base64.b64encode(f.read()).decode("utf-8") |
| | |
| | # 1) Your screenshot path |
| | img_b64 = b64_image("screen.png") |
| | |
| | system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list: |
| | action_type: Click, action_target: Element description, value: None, point_2d: [x, y] |
| | ## Explanation: Tap or click a specific UI element and provide its coordinates |
| | |
| | action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None |
| | ## Explanation: Select an item from a list or dropdown menu |
| | |
| | action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None |
| | ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None |
| | |
| | action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None |
| | ## Explanation: Press a specified key on the keyboard |
| | |
| | action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None |
| | ## Explanation: Scroll a view or container in the specified direction |
| | """ |
| | |
| | # 2) Your prompt (instruction + desired output format) |
| | |
| | task_desc = 'Go to Amazon.com and buy a math book' |
| | prev_txt = '' |
| | question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions. |
| | |
| | Instruction: {} |
| | |
| | Interaction History: {} |
| | ''' |
| | img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1]) |
| | query = question_description.format(img_size_string, task_desc, prev_txt) |
| | |
| | query = query + ' |
| | ' + '''The response should be structured in the following format: |
| | <think>Your step-by-step thought process here...</think> |
| | <answer> |
| | { |
| | "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.", |
| | "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with", |
| | "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'", |
| | "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100] |
| | } |
| | </answer>''' |
| | |
| | resp = client.chat.completions.create( |
| | model=MODEL, |
| | messages=[ |
| | {"role": "system", "content": "You are a helpful GUI agent."}, |
| | {"role": "user", "content": [ |
| | {"type": "image_url", |
| | "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}}, |
| | {"type": "text", "text": prompt}, |
| | ]}, |
| | ], |
| | temperature=0.0, |
| | max_completion_tokens=1024, |
| | ) |
| | |
| | print(resp.choices[0].message.content) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you find GUI-Libra useful for your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{yang2026guilibratrainingnativegui, |
| | title={GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL}, |
| | author={Rui Yang and Qianhui Wu and Zhaoyang Wang and Hanyang Chen and Ke Yang and Hao Cheng and Huaxiu Yao and Baoling Peng and Huan Zhang and Jianfeng Gao and Tong Zhang}, |
| | year={2026}, |
| | eprint={2602.22190}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.LG}, |
| | url={https://arxiv.org/abs/2602.22190}, |
| | } |
| | ``` |