Spaces:

Netrava
/

omniparser-api

Runtime error

App Files Files Community

omniparser-api / README.md

Netrava

Upload 4 files

18315be verified 7 months ago

preview code

raw

history blame contribute delete

4.33 kB

	---
	title: OmniParser v2.0 API
	emoji: 🖼️
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	sdk_version: 4.0.0
	app_file: app_launcher.py
	pinned: false
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	# OmniParser v2.0 API

	This is a public API endpoint for Microsoft's OmniParser v2.0, which can parse UI screenshots and return structured data.

	## Features

	- Parses UI screenshots into structured JSON data
	- Identifies interactive elements (buttons, menus, icons, etc.)
	- Provides captions describing the functionality of each element
	- Returns visualization of detected elements
	- Accessible via a simple REST API

	## Enhancement Opportunities

	The current implementation provides a solid foundation, but there are several opportunities for enhancement:

	### Data Fusion
	- Current: YOLO for detection and VLM for captioning are used separately
	- Enhancement: Implement a more integrated approach that combines YOLO, VLM, OCR, and SAM
	- Benefits: More accurate detection, better context understanding, and more precise segmentation

	### OCR Integration
	- Current: OCR is used separately from YOLO detection
	- Enhancement: Use OCR results to refine YOLO detections and merge overlapping text and UI elements
	- Benefits: Better text recognition in UI elements and improved element classification

	### SAM Integration
	- Current: No segmentation model is used
	- Enhancement: Integrate SAM (Segment Anything Model) for precise segmentation of UI elements
	- Benefits: Better handling of complex UI layouts and irregular-shaped elements

	### Confidence Scoring
	- Current: Simple confidence scores from individual models
	- Enhancement: Combine confidence scores from multiple models and consider element context
	- Benefits: More reliable confidence scores and better prioritization of elements

	### Predictive Monitoring
	- Current: No verification of detected elements
	- Enhancement: Verify that detected elements make sense in the UI context
	- Benefits: Identify missing or incorrectly detected elements and improve detection accuracy

	## API Usage

	You can use this API by sending a POST request with a file upload:

	```python
	import requests

	# Replace with your actual API URL after deployment
	OMNIPARSER_API_URL = "https://your-username-omniparser-api.hf.space/api/parse"

	# Upload a file
	files = {'image': open('screenshot.png', 'rb')}

	# Send request
	response = requests.post(OMNIPARSER_API_URL, files=files)

	# Get JSON result
	result = response.json()

	# Access parsed elements
	elements = result["elements"]
	for element in elements:
	print(f"Element {element['id']}: {element['text']} - {element['caption']}")
	print(f"Coordinates: {element['coordinates']}")
	print(f"Interactable: {element['is_interactable']}")
	print(f"Confidence: {element['confidence']}")
	print("---")

	# Access visualization (base64 encoded image)
	visualization_base64 = result["visualization"]
	```

	## Response Format

	The API returns a JSON object with the following structure:

	```json
	{
	"status": "success",
	"elements": [
	{
	"id": 0,
	"text": "Button 1",
	"caption": "Click to submit form",
	"coordinates": [0.1, 0.1, 0.3, 0.2],
	"is_interactable": true,
	"confidence": 0.95
	},
	{
	"id": 1,
	"text": "Menu",
	"caption": "Navigation menu",
	"coordinates": [0.4, 0.5, 0.6, 0.6],
	"is_interactable": true,
	"confidence": 0.87
	}
	],
	"visualization": "base64_encoded_image_string"
	}
	```

	## Deployment

	This API is deployed on Hugging Face Spaces using Gradio. The deployment is free and provides a public URL that you can use in your applications.

	## Credits

	This API uses Microsoft's OmniParser v2.0, which is a screen parsing tool for pure vision-based GUI agents. For more information, visit the [OmniParser GitHub repository](https://github.com/microsoft/OmniParser).

	## License

	Please note that the OmniParser models have specific licenses:
	- icon_detect model is under AGPL license
	- icon_caption is under MIT license

	Please refer to the LICENSE file in the folder of each model in the original repository.