omniparser-api / README.md
Netrava's picture
Upload 4 files
18315be verified

A newer version of the Gradio SDK is available: 6.8.0

Upgrade
metadata
title: OmniParser v2.0 API
emoji: 🖼️
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.0.0
app_file: app_launcher.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

OmniParser v2.0 API

This is a public API endpoint for Microsoft's OmniParser v2.0, which can parse UI screenshots and return structured data.

Features

  • Parses UI screenshots into structured JSON data
  • Identifies interactive elements (buttons, menus, icons, etc.)
  • Provides captions describing the functionality of each element
  • Returns visualization of detected elements
  • Accessible via a simple REST API

Enhancement Opportunities

The current implementation provides a solid foundation, but there are several opportunities for enhancement:

Data Fusion

  • Current: YOLO for detection and VLM for captioning are used separately
  • Enhancement: Implement a more integrated approach that combines YOLO, VLM, OCR, and SAM
  • Benefits: More accurate detection, better context understanding, and more precise segmentation

OCR Integration

  • Current: OCR is used separately from YOLO detection
  • Enhancement: Use OCR results to refine YOLO detections and merge overlapping text and UI elements
  • Benefits: Better text recognition in UI elements and improved element classification

SAM Integration

  • Current: No segmentation model is used
  • Enhancement: Integrate SAM (Segment Anything Model) for precise segmentation of UI elements
  • Benefits: Better handling of complex UI layouts and irregular-shaped elements

Confidence Scoring

  • Current: Simple confidence scores from individual models
  • Enhancement: Combine confidence scores from multiple models and consider element context
  • Benefits: More reliable confidence scores and better prioritization of elements

Predictive Monitoring

  • Current: No verification of detected elements
  • Enhancement: Verify that detected elements make sense in the UI context
  • Benefits: Identify missing or incorrectly detected elements and improve detection accuracy

API Usage

You can use this API by sending a POST request with a file upload:

import requests

# Replace with your actual API URL after deployment
OMNIPARSER_API_URL = "https://your-username-omniparser-api.hf.space/api/parse"

# Upload a file
files = {'image': open('screenshot.png', 'rb')}

# Send request
response = requests.post(OMNIPARSER_API_URL, files=files)

# Get JSON result
result = response.json()

# Access parsed elements
elements = result["elements"]
for element in elements:
    print(f"Element {element['id']}: {element['text']} - {element['caption']}")
    print(f"Coordinates: {element['coordinates']}")
    print(f"Interactable: {element['is_interactable']}")
    print(f"Confidence: {element['confidence']}")
    print("---")

# Access visualization (base64 encoded image)
visualization_base64 = result["visualization"]

Response Format

The API returns a JSON object with the following structure:

{
  "status": "success",
  "elements": [
    {
      "id": 0,
      "text": "Button 1",
      "caption": "Click to submit form",
      "coordinates": [0.1, 0.1, 0.3, 0.2],
      "is_interactable": true,
      "confidence": 0.95
    },
    {
      "id": 1,
      "text": "Menu",
      "caption": "Navigation menu",
      "coordinates": [0.4, 0.5, 0.6, 0.6],
      "is_interactable": true,
      "confidence": 0.87
    }
  ],
  "visualization": "base64_encoded_image_string"
}

Deployment

This API is deployed on Hugging Face Spaces using Gradio. The deployment is free and provides a public URL that you can use in your applications.

Credits

This API uses Microsoft's OmniParser v2.0, which is a screen parsing tool for pure vision-based GUI agents. For more information, visit the OmniParser GitHub repository.

License

Please note that the OmniParser models have specific licenses:

  • icon_detect model is under AGPL license
  • icon_caption is under MIT license

Please refer to the LICENSE file in the folder of each model in the original repository.