Spaces:
Runtime error
Runtime error
| title: OmniParser v2.0 API | |
| emoji: 🖼️ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 4.0.0 | |
| app_file: app_launcher.py | |
| pinned: false | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| # OmniParser v2.0 API | |
| This is a public API endpoint for Microsoft's OmniParser v2.0, which can parse UI screenshots and return structured data. | |
| ## Features | |
| - Parses UI screenshots into structured JSON data | |
| - Identifies interactive elements (buttons, menus, icons, etc.) | |
| - Provides captions describing the functionality of each element | |
| - Returns visualization of detected elements | |
| - Accessible via a simple REST API | |
| ## Enhancement Opportunities | |
| The current implementation provides a solid foundation, but there are several opportunities for enhancement: | |
| ### Data Fusion | |
| - **Current**: YOLO for detection and VLM for captioning are used separately | |
| - **Enhancement**: Implement a more integrated approach that combines YOLO, VLM, OCR, and SAM | |
| - **Benefits**: More accurate detection, better context understanding, and more precise segmentation | |
| ### OCR Integration | |
| - **Current**: OCR is used separately from YOLO detection | |
| - **Enhancement**: Use OCR results to refine YOLO detections and merge overlapping text and UI elements | |
| - **Benefits**: Better text recognition in UI elements and improved element classification | |
| ### SAM Integration | |
| - **Current**: No segmentation model is used | |
| - **Enhancement**: Integrate SAM (Segment Anything Model) for precise segmentation of UI elements | |
| - **Benefits**: Better handling of complex UI layouts and irregular-shaped elements | |
| ### Confidence Scoring | |
| - **Current**: Simple confidence scores from individual models | |
| - **Enhancement**: Combine confidence scores from multiple models and consider element context | |
| - **Benefits**: More reliable confidence scores and better prioritization of elements | |
| ### Predictive Monitoring | |
| - **Current**: No verification of detected elements | |
| - **Enhancement**: Verify that detected elements make sense in the UI context | |
| - **Benefits**: Identify missing or incorrectly detected elements and improve detection accuracy | |
| ## API Usage | |
| You can use this API by sending a POST request with a file upload: | |
| ```python | |
| import requests | |
| # Replace with your actual API URL after deployment | |
| OMNIPARSER_API_URL = "https://your-username-omniparser-api.hf.space/api/parse" | |
| # Upload a file | |
| files = {'image': open('screenshot.png', 'rb')} | |
| # Send request | |
| response = requests.post(OMNIPARSER_API_URL, files=files) | |
| # Get JSON result | |
| result = response.json() | |
| # Access parsed elements | |
| elements = result["elements"] | |
| for element in elements: | |
| print(f"Element {element['id']}: {element['text']} - {element['caption']}") | |
| print(f"Coordinates: {element['coordinates']}") | |
| print(f"Interactable: {element['is_interactable']}") | |
| print(f"Confidence: {element['confidence']}") | |
| print("---") | |
| # Access visualization (base64 encoded image) | |
| visualization_base64 = result["visualization"] | |
| ``` | |
| ## Response Format | |
| The API returns a JSON object with the following structure: | |
| ```json | |
| { | |
| "status": "success", | |
| "elements": [ | |
| { | |
| "id": 0, | |
| "text": "Button 1", | |
| "caption": "Click to submit form", | |
| "coordinates": [0.1, 0.1, 0.3, 0.2], | |
| "is_interactable": true, | |
| "confidence": 0.95 | |
| }, | |
| { | |
| "id": 1, | |
| "text": "Menu", | |
| "caption": "Navigation menu", | |
| "coordinates": [0.4, 0.5, 0.6, 0.6], | |
| "is_interactable": true, | |
| "confidence": 0.87 | |
| } | |
| ], | |
| "visualization": "base64_encoded_image_string" | |
| } | |
| ``` | |
| ## Deployment | |
| This API is deployed on Hugging Face Spaces using Gradio. The deployment is free and provides a public URL that you can use in your applications. | |
| ## Credits | |
| This API uses Microsoft's OmniParser v2.0, which is a screen parsing tool for pure vision-based GUI agents. For more information, visit the [OmniParser GitHub repository](https://github.com/microsoft/OmniParser). | |
| ## License | |
| Please note that the OmniParser models have specific licenses: | |
| - icon_detect model is under AGPL license | |
| - icon_caption is under MIT license | |
| Please refer to the LICENSE file in the folder of each model in the original repository. |