--- title: OmniParser v2.0 API emoji: 🖼️ colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 4.0.0 app_file: app_launcher.py pinned: false --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # OmniParser v2.0 API This is a public API endpoint for Microsoft's OmniParser v2.0, which can parse UI screenshots and return structured data. ## Features - Parses UI screenshots into structured JSON data - Identifies interactive elements (buttons, menus, icons, etc.) - Provides captions describing the functionality of each element - Returns visualization of detected elements - Accessible via a simple REST API ## Enhancement Opportunities The current implementation provides a solid foundation, but there are several opportunities for enhancement: ### Data Fusion - **Current**: YOLO for detection and VLM for captioning are used separately - **Enhancement**: Implement a more integrated approach that combines YOLO, VLM, OCR, and SAM - **Benefits**: More accurate detection, better context understanding, and more precise segmentation ### OCR Integration - **Current**: OCR is used separately from YOLO detection - **Enhancement**: Use OCR results to refine YOLO detections and merge overlapping text and UI elements - **Benefits**: Better text recognition in UI elements and improved element classification ### SAM Integration - **Current**: No segmentation model is used - **Enhancement**: Integrate SAM (Segment Anything Model) for precise segmentation of UI elements - **Benefits**: Better handling of complex UI layouts and irregular-shaped elements ### Confidence Scoring - **Current**: Simple confidence scores from individual models - **Enhancement**: Combine confidence scores from multiple models and consider element context - **Benefits**: More reliable confidence scores and better prioritization of elements ### Predictive Monitoring - **Current**: No verification of detected elements - **Enhancement**: Verify that detected elements make sense in the UI context - **Benefits**: Identify missing or incorrectly detected elements and improve detection accuracy ## API Usage You can use this API by sending a POST request with a file upload: ```python import requests # Replace with your actual API URL after deployment OMNIPARSER_API_URL = "https://your-username-omniparser-api.hf.space/api/parse" # Upload a file files = {'image': open('screenshot.png', 'rb')} # Send request response = requests.post(OMNIPARSER_API_URL, files=files) # Get JSON result result = response.json() # Access parsed elements elements = result["elements"] for element in elements: print(f"Element {element['id']}: {element['text']} - {element['caption']}") print(f"Coordinates: {element['coordinates']}") print(f"Interactable: {element['is_interactable']}") print(f"Confidence: {element['confidence']}") print("---") # Access visualization (base64 encoded image) visualization_base64 = result["visualization"] ``` ## Response Format The API returns a JSON object with the following structure: ```json { "status": "success", "elements": [ { "id": 0, "text": "Button 1", "caption": "Click to submit form", "coordinates": [0.1, 0.1, 0.3, 0.2], "is_interactable": true, "confidence": 0.95 }, { "id": 1, "text": "Menu", "caption": "Navigation menu", "coordinates": [0.4, 0.5, 0.6, 0.6], "is_interactable": true, "confidence": 0.87 } ], "visualization": "base64_encoded_image_string" } ``` ## Deployment This API is deployed on Hugging Face Spaces using Gradio. The deployment is free and provides a public URL that you can use in your applications. ## Credits This API uses Microsoft's OmniParser v2.0, which is a screen parsing tool for pure vision-based GUI agents. For more information, visit the [OmniParser GitHub repository](https://github.com/microsoft/OmniParser). ## License Please note that the OmniParser models have specific licenses: - icon_detect model is under AGPL license - icon_caption is under MIT license Please refer to the LICENSE file in the folder of each model in the original repository.