Spaces:

Netrava
/

omniparser-api

Runtime error

App Files Files Community

omniparser-api / README.md

Netrava

Upload 4 files

18315be verified 7 months ago

preview code

raw

history blame contribute delete

4.33 kB

A newer version of the Gradio SDK is available: 6.8.0

Upgrade

metadata

title: OmniParser v2.0 API
emoji: 🖼️
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.0.0
app_file: app_launcher.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

OmniParser v2.0 API

This is a public API endpoint for Microsoft's OmniParser v2.0, which can parse UI screenshots and return structured data.

Features

Parses UI screenshots into structured JSON data
Identifies interactive elements (buttons, menus, icons, etc.)
Provides captions describing the functionality of each element
Returns visualization of detected elements
Accessible via a simple REST API

Enhancement Opportunities

The current implementation provides a solid foundation, but there are several opportunities for enhancement:

Data Fusion

Current: YOLO for detection and VLM for captioning are used separately
Enhancement: Implement a more integrated approach that combines YOLO, VLM, OCR, and SAM
Benefits: More accurate detection, better context understanding, and more precise segmentation

OCR Integration

Current: OCR is used separately from YOLO detection
Enhancement: Use OCR results to refine YOLO detections and merge overlapping text and UI elements
Benefits: Better text recognition in UI elements and improved element classification

SAM Integration

Current: No segmentation model is used
Enhancement: Integrate SAM (Segment Anything Model) for precise segmentation of UI elements
Benefits: Better handling of complex UI layouts and irregular-shaped elements

Confidence Scoring

Current: Simple confidence scores from individual models
Enhancement: Combine confidence scores from multiple models and consider element context
Benefits: More reliable confidence scores and better prioritization of elements

Predictive Monitoring

Current: No verification of detected elements
Enhancement: Verify that detected elements make sense in the UI context
Benefits: Identify missing or incorrectly detected elements and improve detection accuracy

API Usage

You can use this API by sending a POST request with a file upload:

import requests

# Replace with your actual API URL after deployment
OMNIPARSER_API_URL = "https://your-username-omniparser-api.hf.space/api/parse"

# Upload a file
files = {'image': open('screenshot.png', 'rb')}

# Send request
response = requests.post(OMNIPARSER_API_URL, files=files)

# Get JSON result
result = response.json()

# Access parsed elements
elements = result["elements"]
for element in elements:
    print(f"Element {element['id']}: {element['text']} - {element['caption']}")
    print(f"Coordinates: {element['coordinates']}")
    print(f"Interactable: {element['is_interactable']}")
    print(f"Confidence: {element['confidence']}")
    print("---")

# Access visualization (base64 encoded image)
visualization_base64 = result["visualization"]

Response Format

The API returns a JSON object with the following structure:

{
  "status": "success",
  "elements": [
    {
      "id": 0,
      "text": "Button 1",
      "caption": "Click to submit form",
      "coordinates": [0.1, 0.1, 0.3, 0.2],
      "is_interactable": true,
      "confidence": 0.95
    },
    {
      "id": 1,
      "text": "Menu",
      "caption": "Navigation menu",
      "coordinates": [0.4, 0.5, 0.6, 0.6],
      "is_interactable": true,
      "confidence": 0.87
    }
  ],
  "visualization": "base64_encoded_image_string"
}

Deployment

This API is deployed on Hugging Face Spaces using Gradio. The deployment is free and provides a public URL that you can use in your applications.

Credits

This API uses Microsoft's OmniParser v2.0, which is a screen parsing tool for pure vision-based GUI agents. For more information, visit the OmniParser GitHub repository.

License

Please note that the OmniParser models have specific licenses:

icon_detect model is under AGPL license
icon_caption is under MIT license

Please refer to the LICENSE file in the folder of each model in the original repository.