Spaces:
Runtime error
Runtime error
Upload 4 files
Browse files- README.md +76 -8
- app.py +165 -0
- app_simplified.py +100 -1
- requirements.txt +2 -1
README.md
CHANGED
|
@@ -1,28 +1,57 @@
|
|
| 1 |
---
|
| 2 |
-
title: OmniParser v2.0 API
|
| 3 |
emoji: 🖼️
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 4.0.0
|
| 8 |
-
app_file:
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
| 13 |
|
| 14 |
-
# OmniParser v2.0 API
|
| 15 |
|
| 16 |
-
This is a
|
| 17 |
|
| 18 |
## Features
|
| 19 |
|
| 20 |
-
-
|
| 21 |
- Identifies interactive elements (buttons, menus, icons, etc.)
|
| 22 |
- Provides captions describing the functionality of each element
|
| 23 |
- Returns visualization of detected elements
|
| 24 |
- Accessible via a simple REST API
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
## API Usage
|
| 27 |
|
| 28 |
You can use this API by sending a POST request with a file upload:
|
|
@@ -55,8 +84,47 @@ for element in elements:
|
|
| 55 |
visualization_base64 = result["visualization"]
|
| 56 |
```
|
| 57 |
|
| 58 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
title: OmniParser v2.0 API
|
| 3 |
emoji: 🖼️
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 4.0.0
|
| 8 |
+
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
| 13 |
|
| 14 |
+
# OmniParser v2.0 API
|
| 15 |
|
| 16 |
+
This is a public API endpoint for Microsoft's OmniParser v2.0, which can parse UI screenshots and return structured data.
|
| 17 |
|
| 18 |
## Features
|
| 19 |
|
| 20 |
+
- Parses UI screenshots into structured JSON data
|
| 21 |
- Identifies interactive elements (buttons, menus, icons, etc.)
|
| 22 |
- Provides captions describing the functionality of each element
|
| 23 |
- Returns visualization of detected elements
|
| 24 |
- Accessible via a simple REST API
|
| 25 |
|
| 26 |
+
## Enhancement Opportunities
|
| 27 |
+
|
| 28 |
+
The current implementation provides a solid foundation, but there are several opportunities for enhancement:
|
| 29 |
+
|
| 30 |
+
### Data Fusion
|
| 31 |
+
- **Current**: YOLO for detection and VLM for captioning are used separately
|
| 32 |
+
- **Enhancement**: Implement a more integrated approach that combines YOLO, VLM, OCR, and SAM
|
| 33 |
+
- **Benefits**: More accurate detection, better context understanding, and more precise segmentation
|
| 34 |
+
|
| 35 |
+
### OCR Integration
|
| 36 |
+
- **Current**: OCR is used separately from YOLO detection
|
| 37 |
+
- **Enhancement**: Use OCR results to refine YOLO detections and merge overlapping text and UI elements
|
| 38 |
+
- **Benefits**: Better text recognition in UI elements and improved element classification
|
| 39 |
+
|
| 40 |
+
### SAM Integration
|
| 41 |
+
- **Current**: No segmentation model is used
|
| 42 |
+
- **Enhancement**: Integrate SAM (Segment Anything Model) for precise segmentation of UI elements
|
| 43 |
+
- **Benefits**: Better handling of complex UI layouts and irregular-shaped elements
|
| 44 |
+
|
| 45 |
+
### Confidence Scoring
|
| 46 |
+
- **Current**: Simple confidence scores from individual models
|
| 47 |
+
- **Enhancement**: Combine confidence scores from multiple models and consider element context
|
| 48 |
+
- **Benefits**: More reliable confidence scores and better prioritization of elements
|
| 49 |
+
|
| 50 |
+
### Predictive Monitoring
|
| 51 |
+
- **Current**: No verification of detected elements
|
| 52 |
+
- **Enhancement**: Verify that detected elements make sense in the UI context
|
| 53 |
+
- **Benefits**: Identify missing or incorrectly detected elements and improve detection accuracy
|
| 54 |
+
|
| 55 |
## API Usage
|
| 56 |
|
| 57 |
You can use this API by sending a POST request with a file upload:
|
|
|
|
| 84 |
visualization_base64 = result["visualization"]
|
| 85 |
```
|
| 86 |
|
| 87 |
+
## Response Format
|
| 88 |
+
|
| 89 |
+
The API returns a JSON object with the following structure:
|
| 90 |
+
|
| 91 |
+
```json
|
| 92 |
+
{
|
| 93 |
+
"status": "success",
|
| 94 |
+
"elements": [
|
| 95 |
+
{
|
| 96 |
+
"id": 0,
|
| 97 |
+
"text": "Button 1",
|
| 98 |
+
"caption": "Click to submit form",
|
| 99 |
+
"coordinates": [0.1, 0.1, 0.3, 0.2],
|
| 100 |
+
"is_interactable": true,
|
| 101 |
+
"confidence": 0.95
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"id": 1,
|
| 105 |
+
"text": "Menu",
|
| 106 |
+
"caption": "Navigation menu",
|
| 107 |
+
"coordinates": [0.4, 0.5, 0.6, 0.6],
|
| 108 |
+
"is_interactable": true,
|
| 109 |
+
"confidence": 0.87
|
| 110 |
+
}
|
| 111 |
+
],
|
| 112 |
+
"visualization": "base64_encoded_image_string"
|
| 113 |
+
}
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
## Deployment
|
| 117 |
+
|
| 118 |
+
This API is deployed on Hugging Face Spaces using Gradio. The deployment is free and provides a public URL that you can use in your applications.
|
| 119 |
+
|
| 120 |
+
## Credits
|
| 121 |
+
|
| 122 |
+
This API uses Microsoft's OmniParser v2.0, which is a screen parsing tool for pure vision-based GUI agents. For more information, visit the [OmniParser GitHub repository](https://github.com/microsoft/OmniParser).
|
| 123 |
+
|
| 124 |
+
## License
|
| 125 |
|
| 126 |
+
Please note that the OmniParser models have specific licenses:
|
| 127 |
+
- icon_detect model is under AGPL license
|
| 128 |
+
- icon_caption is under MIT license
|
| 129 |
|
| 130 |
+
Please refer to the LICENSE file in the folder of each model in the original repository.
|
app.py
CHANGED
|
@@ -154,13 +154,57 @@ print(f"Using device: {device}")
|
|
| 154 |
|
| 155 |
# Initialize models with correct paths
|
| 156 |
try:
|
|
|
|
| 157 |
yolo_model = get_yolo_model(model_path='OmniParser/weights/icon_detect/model.pt')
|
|
|
|
|
|
|
| 158 |
caption_model_processor = get_caption_model_processor(
|
| 159 |
model_name="florence2",
|
| 160 |
model_name_or_path="OmniParser/weights/icon_caption_florence"
|
| 161 |
)
|
|
|
|
| 162 |
print("Models initialized successfully")
|
| 163 |
models_initialized = True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
except Exception as e:
|
| 165 |
print(f"Error initializing models: {str(e)}")
|
| 166 |
# Create dummy models for graceful failure
|
|
@@ -270,6 +314,38 @@ def process_image(
|
|
| 270 |
|
| 271 |
# Run OCR to detect text
|
| 272 |
try:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
ocr_bbox_rslt, is_goal_filtered = check_ocr_box(
|
| 274 |
image,
|
| 275 |
display_img=False,
|
|
@@ -291,6 +367,41 @@ def process_image(
|
|
| 291 |
|
| 292 |
# Process image with OmniParser
|
| 293 |
try:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 294 |
dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
|
| 295 |
image,
|
| 296 |
yolo_model,
|
|
@@ -315,6 +426,31 @@ def process_image(
|
|
| 315 |
# Create structured output
|
| 316 |
elements = []
|
| 317 |
for i, element in enumerate(parsed_content_list):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 318 |
elements.append({
|
| 319 |
"id": i,
|
| 320 |
"text": element.get("text", ""),
|
|
@@ -324,6 +460,35 @@ def process_image(
|
|
| 324 |
"confidence": element.get("confidence", 0.0)
|
| 325 |
})
|
| 326 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 327 |
# Return structured data and visualization
|
| 328 |
return {
|
| 329 |
"elements": elements,
|
|
|
|
| 154 |
|
| 155 |
# Initialize models with correct paths
|
| 156 |
try:
|
| 157 |
+
# YOLO model for object detection
|
| 158 |
yolo_model = get_yolo_model(model_path='OmniParser/weights/icon_detect/model.pt')
|
| 159 |
+
|
| 160 |
+
# VLM (Vision Language Model) for captioning
|
| 161 |
caption_model_processor = get_caption_model_processor(
|
| 162 |
model_name="florence2",
|
| 163 |
model_name_or_path="OmniParser/weights/icon_caption_florence"
|
| 164 |
)
|
| 165 |
+
|
| 166 |
print("Models initialized successfully")
|
| 167 |
models_initialized = True
|
| 168 |
+
|
| 169 |
+
# ENHANCEMENT OPPORTUNITY: Data Fusion
|
| 170 |
+
# The current implementation uses YOLO for detection and VLM for captioning separately.
|
| 171 |
+
# A more integrated approach could:
|
| 172 |
+
# 1. Use YOLO for initial detection of UI elements
|
| 173 |
+
# 2. Use VLM to refine the detections and provide more context
|
| 174 |
+
# 3. Implement a confidence-based merging strategy for overlapping detections
|
| 175 |
+
# 4. Use SAM (Segment Anything Model) for more precise segmentation of UI elements
|
| 176 |
+
#
|
| 177 |
+
# Example implementation:
|
| 178 |
+
# ```
|
| 179 |
+
# def enhanced_detection(image, yolo_model, vlm_model, sam_model):
|
| 180 |
+
# # Get YOLO detections
|
| 181 |
+
# yolo_boxes = yolo_model(image)
|
| 182 |
+
#
|
| 183 |
+
# # Use VLM to analyze the entire image for context
|
| 184 |
+
# global_context = vlm_model.analyze_image(image)
|
| 185 |
+
#
|
| 186 |
+
# # For each YOLO box, use VLM to get more detailed information
|
| 187 |
+
# refined_detections = []
|
| 188 |
+
# for box in yolo_boxes:
|
| 189 |
+
# # Crop the region
|
| 190 |
+
# region = crop_image(image, box)
|
| 191 |
+
#
|
| 192 |
+
# # Get VLM description
|
| 193 |
+
# description = vlm_model.describe_region(region, context=global_context)
|
| 194 |
+
#
|
| 195 |
+
# # Use SAM for precise segmentation
|
| 196 |
+
# mask = sam_model.segment(image, box)
|
| 197 |
+
#
|
| 198 |
+
# refined_detections.append({
|
| 199 |
+
# "box": box,
|
| 200 |
+
# "description": description,
|
| 201 |
+
# "mask": mask,
|
| 202 |
+
# "confidence": combine_confidence(box.conf, description.conf)
|
| 203 |
+
# })
|
| 204 |
+
#
|
| 205 |
+
# return refined_detections
|
| 206 |
+
# ```
|
| 207 |
+
|
| 208 |
except Exception as e:
|
| 209 |
print(f"Error initializing models: {str(e)}")
|
| 210 |
# Create dummy models for graceful failure
|
|
|
|
| 314 |
|
| 315 |
# Run OCR to detect text
|
| 316 |
try:
|
| 317 |
+
# ENHANCEMENT OPPORTUNITY: OCR Integration
|
| 318 |
+
# The current implementation uses OCR separately from YOLO detection.
|
| 319 |
+
# A more integrated approach could:
|
| 320 |
+
# 1. Use OCR results to refine YOLO detections
|
| 321 |
+
# 2. Merge overlapping text and UI element detections
|
| 322 |
+
# 3. Use text content to improve element classification
|
| 323 |
+
#
|
| 324 |
+
# Example implementation:
|
| 325 |
+
# ```
|
| 326 |
+
# def integrated_ocr_detection(image, ocr_results, yolo_detections):
|
| 327 |
+
# merged_detections = []
|
| 328 |
+
#
|
| 329 |
+
# # For each YOLO detection
|
| 330 |
+
# for yolo_box in yolo_detections:
|
| 331 |
+
# # Find overlapping OCR text
|
| 332 |
+
# overlapping_text = []
|
| 333 |
+
# for text, text_box in ocr_results:
|
| 334 |
+
# if calculate_iou(yolo_box, text_box) > threshold:
|
| 335 |
+
# overlapping_text.append(text)
|
| 336 |
+
#
|
| 337 |
+
# # Use text content to refine element classification
|
| 338 |
+
# element_type = classify_element_with_text(yolo_box, overlapping_text)
|
| 339 |
+
#
|
| 340 |
+
# merged_detections.append({
|
| 341 |
+
# "box": yolo_box,
|
| 342 |
+
# "text": " ".join(overlapping_text),
|
| 343 |
+
# "type": element_type
|
| 344 |
+
# })
|
| 345 |
+
#
|
| 346 |
+
# return merged_detections
|
| 347 |
+
# ```
|
| 348 |
+
|
| 349 |
ocr_bbox_rslt, is_goal_filtered = check_ocr_box(
|
| 350 |
image,
|
| 351 |
display_img=False,
|
|
|
|
| 367 |
|
| 368 |
# Process image with OmniParser
|
| 369 |
try:
|
| 370 |
+
# ENHANCEMENT OPPORTUNITY: SAM Integration
|
| 371 |
+
# The current implementation doesn't use SAM (Segment Anything Model).
|
| 372 |
+
# Integrating SAM could:
|
| 373 |
+
# 1. Provide more precise segmentation of UI elements
|
| 374 |
+
# 2. Better handle complex UI layouts with overlapping elements
|
| 375 |
+
# 3. Improve detection of irregular-shaped elements
|
| 376 |
+
#
|
| 377 |
+
# Example implementation:
|
| 378 |
+
# ```
|
| 379 |
+
# def integrate_sam(image, boxes, sam_model):
|
| 380 |
+
# # Initialize SAM predictor
|
| 381 |
+
# predictor = SamPredictor(sam_model)
|
| 382 |
+
# predictor.set_image(np.array(image))
|
| 383 |
+
#
|
| 384 |
+
# refined_elements = []
|
| 385 |
+
# for box in boxes:
|
| 386 |
+
# # Convert box to SAM input format
|
| 387 |
+
# input_box = np.array([box[0], box[1], box[2], box[3]])
|
| 388 |
+
#
|
| 389 |
+
# # Get SAM mask
|
| 390 |
+
# masks, scores, _ = predictor.predict(
|
| 391 |
+
# box=input_box,
|
| 392 |
+
# multimask_output=False
|
| 393 |
+
# )
|
| 394 |
+
#
|
| 395 |
+
# # Use the mask to refine the element boundaries
|
| 396 |
+
# refined_elements.append({
|
| 397 |
+
# "box": box,
|
| 398 |
+
# "mask": masks[0],
|
| 399 |
+
# "mask_confidence": scores[0]
|
| 400 |
+
# })
|
| 401 |
+
#
|
| 402 |
+
# return refined_elements
|
| 403 |
+
# ```
|
| 404 |
+
|
| 405 |
dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
|
| 406 |
image,
|
| 407 |
yolo_model,
|
|
|
|
| 426 |
# Create structured output
|
| 427 |
elements = []
|
| 428 |
for i, element in enumerate(parsed_content_list):
|
| 429 |
+
# ENHANCEMENT OPPORTUNITY: Confidence Scoring
|
| 430 |
+
# The current implementation uses a simple confidence score.
|
| 431 |
+
# A more sophisticated approach could:
|
| 432 |
+
# 1. Combine confidence scores from multiple models (YOLO, VLM, OCR)
|
| 433 |
+
# 2. Consider element context and relationships
|
| 434 |
+
# 3. Use historical data to improve confidence scoring
|
| 435 |
+
#
|
| 436 |
+
# Example implementation:
|
| 437 |
+
# ```
|
| 438 |
+
# def calculate_confidence(yolo_conf, vlm_conf, ocr_conf, element_type):
|
| 439 |
+
# # Base confidence from YOLO
|
| 440 |
+
# base_conf = yolo_conf
|
| 441 |
+
#
|
| 442 |
+
# # Adjust based on VLM confidence
|
| 443 |
+
# if vlm_conf > 0.8:
|
| 444 |
+
# base_conf = (base_conf + vlm_conf) / 2
|
| 445 |
+
#
|
| 446 |
+
# # Adjust based on element type
|
| 447 |
+
# if element_type == "button" and ocr_conf > 0.9:
|
| 448 |
+
# base_conf = (base_conf + ocr_conf) / 2
|
| 449 |
+
#
|
| 450 |
+
# # Normalize to 0-1 range
|
| 451 |
+
# return min(1.0, base_conf)
|
| 452 |
+
# ```
|
| 453 |
+
|
| 454 |
elements.append({
|
| 455 |
"id": i,
|
| 456 |
"text": element.get("text", ""),
|
|
|
|
| 460 |
"confidence": element.get("confidence", 0.0)
|
| 461 |
})
|
| 462 |
|
| 463 |
+
# ENHANCEMENT OPPORTUNITY: Predictive Monitoring
|
| 464 |
+
# The current implementation doesn't include predictive monitoring.
|
| 465 |
+
# Adding this could:
|
| 466 |
+
# 1. Verify that detected elements make sense in the UI context
|
| 467 |
+
# 2. Identify missing or incorrectly detected elements
|
| 468 |
+
# 3. Provide feedback for improving detection accuracy
|
| 469 |
+
#
|
| 470 |
+
# Example implementation:
|
| 471 |
+
# ```
|
| 472 |
+
# def verify_detections(elements, image, vlm_model):
|
| 473 |
+
# # Use VLM to analyze the entire image
|
| 474 |
+
# global_description = vlm_model.describe_image(image)
|
| 475 |
+
#
|
| 476 |
+
# # Check if detected elements match the global description
|
| 477 |
+
# expected_elements = extract_expected_elements(global_description)
|
| 478 |
+
#
|
| 479 |
+
# # Compare detected vs expected
|
| 480 |
+
# missing_elements = [e for e in expected_elements if not any(
|
| 481 |
+
# similar_element(e, detected) for detected in elements
|
| 482 |
+
# )]
|
| 483 |
+
#
|
| 484 |
+
# # Provide feedback
|
| 485 |
+
# return {
|
| 486 |
+
# "verified_elements": elements,
|
| 487 |
+
# "missing_elements": missing_elements,
|
| 488 |
+
# "confidence": calculate_overall_confidence(elements, expected_elements)
|
| 489 |
+
# }
|
| 490 |
+
# ```
|
| 491 |
+
|
| 492 |
# Return structured data and visualization
|
| 493 |
return {
|
| 494 |
"elements": elements,
|
app_simplified.py
CHANGED
|
@@ -30,6 +30,47 @@ def process_image(image):
|
|
| 30 |
# Define some mock UI element types
|
| 31 |
element_types = ["Button", "Text Field", "Checkbox", "Dropdown", "Menu Item", "Icon", "Link"]
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
# Generate some random elements
|
| 34 |
elements = []
|
| 35 |
num_elements = min(15, int(image.width * image.height / 40000)) # Scale with image size
|
|
@@ -57,6 +98,34 @@ def process_image(image):
|
|
| 57 |
text = random.choice(captions[element_type])
|
| 58 |
caption = f"{element_type}: {text}"
|
| 59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
# Add to elements list
|
| 61 |
elements.append({
|
| 62 |
"id": i,
|
|
@@ -71,10 +140,40 @@ def process_image(image):
|
|
| 71 |
draw.rectangle([x1, y1, x2, y2], outline="red", width=2)
|
| 72 |
draw.text((x1, y1 - 10), f"{i}: {text}", fill="red")
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
return {
|
| 75 |
"elements": elements,
|
| 76 |
"visualization": vis_img,
|
| 77 |
-
"note": "This is a simplified implementation that simulates OmniParser functionality."
|
| 78 |
}
|
| 79 |
|
| 80 |
# API endpoint function
|
|
|
|
| 30 |
# Define some mock UI element types
|
| 31 |
element_types = ["Button", "Text Field", "Checkbox", "Dropdown", "Menu Item", "Icon", "Link"]
|
| 32 |
|
| 33 |
+
# ENHANCEMENT OPPORTUNITY: Data Fusion
|
| 34 |
+
# In a real implementation, we would integrate multiple models:
|
| 35 |
+
# 1. YOLO for initial detection of UI elements
|
| 36 |
+
# 2. OCR for text detection
|
| 37 |
+
# 3. VLM for captioning and context understanding
|
| 38 |
+
# 4. SAM for precise segmentation
|
| 39 |
+
#
|
| 40 |
+
# Example architecture:
|
| 41 |
+
# ```
|
| 42 |
+
# def integrated_detection(image):
|
| 43 |
+
# # 1. Run YOLO to detect UI elements
|
| 44 |
+
# yolo_boxes = yolo_model(image)
|
| 45 |
+
#
|
| 46 |
+
# # 2. Run OCR to detect text
|
| 47 |
+
# ocr_results = ocr_model(image)
|
| 48 |
+
#
|
| 49 |
+
# # 3. Use VLM to understand the overall context
|
| 50 |
+
# context = vlm_model.analyze_image(image)
|
| 51 |
+
#
|
| 52 |
+
# # 4. For each detected element, use SAM for precise segmentation
|
| 53 |
+
# elements = []
|
| 54 |
+
# for box in yolo_boxes:
|
| 55 |
+
# # Get SAM mask
|
| 56 |
+
# mask = sam_model.segment(image, box)
|
| 57 |
+
#
|
| 58 |
+
# # Find overlapping text from OCR
|
| 59 |
+
# element_text = find_overlapping_text(box, ocr_results)
|
| 60 |
+
#
|
| 61 |
+
# # Use VLM to caption the element with context
|
| 62 |
+
# caption = vlm_model.caption_region(image, box, context)
|
| 63 |
+
#
|
| 64 |
+
# elements.append({
|
| 65 |
+
# "box": box,
|
| 66 |
+
# "mask": mask,
|
| 67 |
+
# "text": element_text,
|
| 68 |
+
# "caption": caption
|
| 69 |
+
# })
|
| 70 |
+
#
|
| 71 |
+
# return elements
|
| 72 |
+
# ```
|
| 73 |
+
|
| 74 |
# Generate some random elements
|
| 75 |
elements = []
|
| 76 |
num_elements = min(15, int(image.width * image.height / 40000)) # Scale with image size
|
|
|
|
| 98 |
text = random.choice(captions[element_type])
|
| 99 |
caption = f"{element_type}: {text}"
|
| 100 |
|
| 101 |
+
# ENHANCEMENT OPPORTUNITY: Confidence Scoring
|
| 102 |
+
# In a real implementation, confidence scores would be calculated based on:
|
| 103 |
+
# 1. Detection confidence from YOLO
|
| 104 |
+
# 2. Text recognition confidence from OCR
|
| 105 |
+
# 3. Caption confidence from VLM
|
| 106 |
+
# 4. Segmentation confidence from SAM
|
| 107 |
+
#
|
| 108 |
+
# Example implementation:
|
| 109 |
+
# ```
|
| 110 |
+
# def calculate_confidence(detection_conf, ocr_conf, vlm_conf, sam_conf):
|
| 111 |
+
# # Weighted average of confidence scores
|
| 112 |
+
# weights = {
|
| 113 |
+
# "detection": 0.4,
|
| 114 |
+
# "ocr": 0.2,
|
| 115 |
+
# "vlm": 0.3,
|
| 116 |
+
# "sam": 0.1
|
| 117 |
+
# }
|
| 118 |
+
#
|
| 119 |
+
# confidence = (
|
| 120 |
+
# weights["detection"] * detection_conf +
|
| 121 |
+
# weights["ocr"] * ocr_conf +
|
| 122 |
+
# weights["vlm"] * vlm_conf +
|
| 123 |
+
# weights["sam"] * sam_conf
|
| 124 |
+
# )
|
| 125 |
+
#
|
| 126 |
+
# return confidence
|
| 127 |
+
# ```
|
| 128 |
+
|
| 129 |
# Add to elements list
|
| 130 |
elements.append({
|
| 131 |
"id": i,
|
|
|
|
| 140 |
draw.rectangle([x1, y1, x2, y2], outline="red", width=2)
|
| 141 |
draw.text((x1, y1 - 10), f"{i}: {text}", fill="red")
|
| 142 |
|
| 143 |
+
# ENHANCEMENT OPPORTUNITY: Predictive Monitoring
|
| 144 |
+
# In a real implementation, we would verify the detected elements:
|
| 145 |
+
# 1. Check if the detected elements make sense in the UI context
|
| 146 |
+
# 2. Verify that interactive elements have appropriate labels
|
| 147 |
+
# 3. Ensure that the UI structure is coherent
|
| 148 |
+
#
|
| 149 |
+
# Example implementation:
|
| 150 |
+
# ```
|
| 151 |
+
# def verify_ui_elements(elements, image):
|
| 152 |
+
# # Use VLM to analyze the entire UI
|
| 153 |
+
# ui_analysis = vlm_model.analyze_ui(image)
|
| 154 |
+
#
|
| 155 |
+
# # Check if detected elements match the expected UI structure
|
| 156 |
+
# verified_elements = []
|
| 157 |
+
# for element in elements:
|
| 158 |
+
# # Verify element type based on appearance and context
|
| 159 |
+
# verified_type = verify_element_type(element, ui_analysis)
|
| 160 |
+
#
|
| 161 |
+
# # Verify interactability
|
| 162 |
+
# verified_interactable = verify_interactability(element, verified_type)
|
| 163 |
+
#
|
| 164 |
+
# verified_elements.append({
|
| 165 |
+
# **element,
|
| 166 |
+
# "verified_type": verified_type,
|
| 167 |
+
# "verified_interactable": verified_interactable
|
| 168 |
+
# })
|
| 169 |
+
#
|
| 170 |
+
# return verified_elements
|
| 171 |
+
# ```
|
| 172 |
+
|
| 173 |
return {
|
| 174 |
"elements": elements,
|
| 175 |
"visualization": vis_img,
|
| 176 |
+
"note": "This is a simplified implementation that simulates OmniParser functionality. For a real implementation, consider integrating YOLO, VLM, OCR, and SAM models as described in the code comments."
|
| 177 |
}
|
| 178 |
|
| 179 |
# API endpoint function
|
requirements.txt
CHANGED
|
@@ -7,7 +7,8 @@ numpy>=1.24.0
|
|
| 7 |
easyocr>=1.7.0
|
| 8 |
# Use a specific version of paddleocr that works with our patch
|
| 9 |
paddleocr==2.6.0.3
|
| 10 |
-
paddlepaddle
|
|
|
|
| 11 |
opencv-python>=4.7.0
|
| 12 |
huggingface_hub>=0.16.0
|
| 13 |
peft>=0.4.0
|
|
|
|
| 7 |
easyocr>=1.7.0
|
| 8 |
# Use a specific version of paddleocr that works with our patch
|
| 9 |
paddleocr==2.6.0.3
|
| 10 |
+
# Use a version of paddlepaddle that is available
|
| 11 |
+
paddlepaddle>=2.5.0
|
| 12 |
opencv-python>=4.7.0
|
| 13 |
huggingface_hub>=0.16.0
|
| 14 |
peft>=0.4.0
|