Spaces:
Sleeping
Sleeping
| # Scratch Vision Game - Technical Documentation | |
| ## Overview | |
| The Scratch Vision Game is an AI-powered system that converts visual Scratch programming blocks from images/PDFs into functional Scratch 3.0 projects (.sb3 files). The system uses computer vision, OCR, and large language models to analyze, interpret, and reconstruct Scratch programs from visual inputs. | |
| ## System Architecture | |
| ### Core Components | |
| 1. **Image Processing Pipeline** (`app.py`) | |
| - PDF extraction and image preprocessing | |
| - Multi-modal image enhancement using OpenCV | |
| - OCR text extraction with Tesseract | |
| - Visual similarity matching using multiple algorithms | |
| 2. **Block Recognition System** (`utils/block_relation_builder.py`) | |
| - Scratch block catalog management | |
| - Pseudocode to JSON conversion | |
| - Block relationship building and validation | |
| - Project structure generation | |
| 3. **AI Processing Layer** | |
| - LLM-based code interpretation using Groq/LLaMA | |
| - Multi-modal vision models for image captioning | |
| - Semantic understanding of Scratch programming concepts | |
| ## Process Flow & System Tree Structure | |
| ### Complete User Journey Tree | |
| ``` | |
| USER INPUT (PDF File via Web Interface) | |
| โ | |
| โโโ ๐ /process_pdf [POST] - Flask Route Handler | |
| โ โ | |
| โ โโโ ๐ PDF Validation & Security | |
| โ โ โโโ secure_filename() - Sanitize filename | |
| โ โ โโโ tempfile.mkdtemp() - Create temp directory | |
| โ โ โโโ pdf_file.save() - Save to temp location | |
| โ โ | |
| โ โโโ ๐ PDF Processing Pipeline | |
| โ โ โ | |
| โ โ โโโ ๐ฏ extract_images_from_pdf() | |
| โ โ โ โโโ partition_pdf() - Unstructured library extraction | |
| โ โ โ โ โโโ strategy="hi_res" | |
| โ โ โ โ โโโ extract_image_block_types=["Image"] | |
| โ โ โ โ โโโ extract_image_block_to_payload=True | |
| โ โ โ โ | |
| โ โ โ โโโ ๐พ Save extracted.json | |
| โ โ โ โ โโโ /outputs/EXTRACTED_JSON/{pdf_name}/extracted.json | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ For Each Extracted Image: | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ผ๏ธ Image Processing Branch | |
| โ โ โ โ โโโ base64.b64decode() - Decode image data | |
| โ โ โ โ โโโ Image.open() - PIL image creation | |
| โ โ โ โ โโโ image.save() - Save as PNG | |
| โ โ โ โ โโโ /outputs/DETECTED_IMAGE/{pdf_name}/Sprite_{i}.png | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ค AI Analysis Branch (Parallel) | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ Description Generation | |
| โ โ โ โ โโโ LangGraph Agent (Groq LLaMA) | |
| โ โ โ โ โโโ Prompt: "Give a brief Captioning." | |
| โ โ โ โ โโโ response["messages"][-1].content | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ท๏ธ Name Generation | |
| โ โ โ โ โโโ LangGraph Agent (Groq LLaMA) | |
| โ โ โ โ โโโ Prompt: "give a short name caption" | |
| โ โ โ โ โโโ response["messages"][-1].content | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ Metadata Assembly | |
| โ โ โ โโโ extracted_sprites.json | |
| โ โ โ โโโ "Sprite {count}": { | |
| โ โ โ โ โโโ "name": AI_generated_name | |
| โ โ โ โ โโโ "base64": image_data | |
| โ โ โ โ โโโ "file-path": pdf_directory | |
| โ โ โ โ โโโ "description": AI_description | |
| โ โ โ โโโ } | |
| โ โ | |
| โ โโโ ๐ฎ Project Generation Pipeline | |
| โ โ | |
| โ โโโ ๐ similarity_matching() | |
| โ โ โ | |
| โ โ โโโ ๐ Embedding Generation Branch | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ฏ Query Processing | |
| โ โ โ โ โโโ base64.b64decode() - Decode sprite images | |
| โ โ โ โ โโโ tempfile.mkdtemp() - Create temp workspace | |
| โ โ โ โ โโโ Image.save() - Save temp sprite files | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ง CLIP Embeddings | |
| โ โ โ โ โโโ OpenCLIPEmbeddings() - Initialize embedder | |
| โ โ โ โ โโโ clip_embd.embed_image() - Generate embeddings | |
| โ โ โ โ โโโ sprite_features = np.array() | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ Similarity Computation | |
| โ โ โ โโโ Load: /outputs/embeddings.json | |
| โ โ โ โโโ np.matmul(sprite_matrix, img_matrix.T) | |
| โ โ โ โโโ np.argmax(similarity, axis=1) | |
| โ โ โ | |
| โ โ โโโ ๐จ Asset Matching & Collection | |
| โ โ โ โ | |
| โ โ โ โโโ ๐งโโ๏ธ Sprite Assets Branch | |
| โ โ โ โ โโโ Match: /blocks/sprites/{matched_folder}/ | |
| โ โ โ โ โโโ Load: sprite.json | |
| โ โ โ โ โโโ Copy: All files except matched image & sprite.json | |
| โ โ โ โ โโโ Append to: project_data[] | |
| โ โ โ โ | |
| โ โ โ โโโ ๐ Backdrop Assets Branch (Parallel) | |
| โ โ โ โโโ Match: /blocks/Backdrops/{matched_folder}/ | |
| โ โ โ โโโ Load: project.json | |
| โ โ โ โโโ Copy: All files except matched image & project.json | |
| โ โ โ โโโ Extract: Stage targets โ backdrop_data[] | |
| โ โ โ | |
| โ โ โโโ ๐๏ธ Project Assembly | |
| โ โ โ | |
| โ โ โโโ ๐ JSON Structure Creation | |
| โ โ โ โโโ final_project = { | |
| โ โ โ โ โโโ "targets": [] | |
| โ โ โ โ โโโ "monitors": [] | |
| โ โ โ โ โโโ "extensions": [] | |
| โ โ โ โ โโโ "meta": {...} | |
| โ โ โ โโโ } | |
| โ โ โ | |
| โ โ โโโ ๐งโโ๏ธ Sprite Integration | |
| โ โ โ โโโ For sprite in project_data: | |
| โ โ โ โโโ if not sprite.get("isStage"): | |
| โ โ โ โโโ final_project["targets"].append(sprite) | |
| โ โ โ | |
| โ โ โโโ ๐ Stage/Backdrop Integration | |
| โ โ โ โโโ if backdrop_data: | |
| โ โ โ โโโ Merge: all_costumes.extend() | |
| โ โ โ โโโ Merge: sounds from first backdrop | |
| โ โ โ โโโ Create: Stage target with merged assets | |
| โ โ โ | |
| โ โ โโโ ๐พ Final Output | |
| โ โ โโโ /outputs/project_{uuid}/project.json | |
| โ โ โโโ Return: project_json_path | |
| โ | |
| โโโ ๐ค Response Generation | |
| โ โโโ JSON Response: | |
| โ โโโ "message": "โ PDF processed successfully" | |
| โ โโโ "output_json": extracted_sprites_path | |
| โ โโโ "sprites": sprite_metadata | |
| โ โโโ "project_output_json": final_project_path | |
| โ โโโ "test_url": download_link | |
| โ | |
| โโโ ๐ฅ /download_sb3/{project_id} [GET] - Download Endpoint | |
| โโโ Locate: /game_samples/{project_id}.sb3 | |
| โโโ Validate: File existence | |
| โโโ send_from_directory() - Serve .sb3 file | |
| ``` | |
| ### Parallel Processing Branches | |
| ``` | |
| ๐ CONCURRENT OPERATIONS DURING PDF PROCESSING: | |
| โโโ ๐ผ๏ธ Image Processing Thread | |
| โ โโโ OpenCV Enhancement Pipeline | |
| โ โ โโโ upscale_image_cv() - 2x cubic interpolation | |
| โ โ โโโ reduce_noise_cv() - Non-local means denoising | |
| โ โ โโโ sharpen_cv() - Kernel-based sharpening | |
| โ โ โโโ enhance_contrast_cv() - Contrast enhancement | |
| โ โ | |
| โ โโโ Multi-Algorithm Similarity Matching | |
| โ โโโ DINOv2 Embeddings (Semantic) | |
| โ โโโ PHash (Perceptual Hashing) | |
| โ โโโ Image Signatures (Goldberg Algorithm) | |
| โโโ ๐ค AI Processing Thread | |
| โ โโโ SmolVLM Vision Model | |
| โ โ โโโ Image Captioning | |
| โ โ โโโ Name Generation | |
| โ โ | |
| โ โโโ Groq LLaMA Language Model | |
| โ โโโ OCR Text Refinement | |
| โ โโโ Pseudocode Generation | |
| โ โโโ JSON Structure Validation | |
| โโโ ๐พ I/O Operations Thread | |
| โโโ File System Operations | |
| โ โโโ Directory Creation | |
| โ โโโ Image Saving/Loading | |
| โ โโโ JSON Serialization | |
| โ | |
| โโโ Asset Management | |
| โโโ Reference Asset Loading | |
| โโโ Project Asset Copying | |
| โโโ Final Project Assembly | |
| ``` | |
| ### Data Flow Diagram | |
| ``` | |
| ๐ DATA TRANSFORMATION PIPELINE: | |
| PDF Bytes โ Images โ Enhanced Images โ Embeddings โ Similarities โ Assets โ .sb3 | |
| โ โ โ โ โ โ โ | |
| [Binary] [PIL.Image] [np.ndarray] [np.float32] [indices] [JSON] [ZIP] | |
| โ โ โ โ โ โ โ | |
| โโ OCR โโโโโโผโ AI โโโโโโโโผโ Models โโโโโผโ Search โโโโผโ Match โโโผโ Buildโค | |
| โ โ โ โ โ โ โ | |
| โโ Text โโโโโดโ Metadata โโดโ Features โโโดโ Ranking โโโดโ Select โโดโ Pack โ | |
| ``` | |
| ### Key Processing Functions | |
| **Input Processing:** | |
| - `extract_images_from_pdf()` - Extracts images from PDF using unstructured library | |
| - `process_image_cv2_from_pil()` - Enhances images using OpenCV (upscaling, denoising, sharpening) | |
| ### 2. Visual Similarity Matching | |
| ``` | |
| Query Image โ Multi-Algorithm Matching โ Asset Selection โ Project Assembly | |
| ``` | |
| **Algorithms Used:** | |
| - **DINOv2 Embeddings**: Deep learning-based semantic similarity | |
| - **Perceptual Hashing (PHash)**: Structural image comparison | |
| - **Image Signatures**: Goldberg algorithm for visual fingerprinting | |
| **Implementation:** | |
| ```python | |
| def run_query_search_flow(query_b64, embeddings_dict, hash_dict, signature_obj_map): | |
| # 1. Preprocess query image | |
| enhanced_query_pil = process_image_cv2_from_pil(query_from_b64, scale=2) | |
| # 2. Generate embeddings | |
| query_emb = get_dinov2_embedding_from_pil(prepped) | |
| query_phash = phash.encode_image(image_array=query_hash_arr) | |
| query_sig = gis.generate_signature(query_sig_path) | |
| # 3. Compute similarities | |
| emb_sim = cosine_similarity(query_emb, stored_emb) | |
| ph_sim = 1.0 - (hamming_distance / MAX_PHASH_BITS) | |
| im_sim = 1.0 - gis.normalized_distance(stored_sig, query_sig) | |
| # 4. Combine scores | |
| combined = (emb_clamped + ph_sim + im_sim) / 3.0 | |
| ``` | |
| ### 3. Code Block Recognition | |
| ``` | |
| OCR Text โ LLM Processing โ Pseudocode โ Block Mapping โ JSON Generation | |
| ``` | |
| **LLM System Prompt:** | |
| ```python | |
| SYSTEM_PROMPT = """Your task is to process OCR-extracted text from images of Scratch 3.0 code blocks and produce precisely formatted pseudocode JSON. | |
| ### Core Role | |
| - Treat this as an OCR refinement task: the input may contain typos or spacing issues. | |
| - Intelligently correct OCR mistakes to align with valid Scratch 3.0 block syntax. | |
| ### Universal Rules | |
| 1. Code Detection: If no Scratch blocks are detected, the `pseudocode` value must be "No Code-blocks". | |
| 2. Script Ownership: Determine the target from "Script for:". If it matches a `Stage_costumes` name, set `name_variable` to "Stage". | |
| 3. Pseudocode Structure: The pseudocode must be a single JSON string with `\n` for newlines. | |
| """ | |
| ``` | |
| ### 4. Project Generation | |
| ``` | |
| Pseudocode โ Block Definitions โ Relationship Building โ .sb3 Assembly | |
| ``` | |
| ## Libraries and Dependencies | |
| ### Core Libraries | |
| #### Computer Vision & Image Processing | |
| - **OpenCV** (`cv2`): Image enhancement, filtering, and preprocessing | |
| - **PIL/Pillow**: Image manipulation and format conversion | |
| - **imagededup**: Perceptual hashing for duplicate detection | |
| - **image-match**: Visual similarity using Goldberg signatures | |
| #### Machine Learning & AI | |
| - **transformers**: Hugging Face models (DINOv2, SmolVLM) | |
| - **torch**: PyTorch for deep learning inference | |
| - **sentence-transformers**: Text and image embeddings | |
| - **faiss-cpu**: Fast similarity search and clustering | |
| - **open_clip_torch**: OpenAI CLIP embeddings | |
| #### Language Models | |
| - **langchain**: LLM orchestration and chaining | |
| - **langchain-groq**: Groq API integration | |
| - **langgraph**: Graph-based agent workflows | |
| #### Document Processing | |
| - **unstructured**: PDF parsing and content extraction | |
| - **pdf2image**: PDF to image conversion | |
| - **pytesseract**: OCR text extraction | |
| - **PyPDF2**: PDF manipulation | |
| #### Web Framework | |
| - **Flask**: Web application framework | |
| - **Flask-SocketIO**: Real-time communication | |
| - **gunicorn**: WSGI HTTP server | |
| ### Model Specifications | |
| #### Vision Models | |
| ```python | |
| # DINOv2 for semantic image understanding | |
| DINOV2_MODEL = "facebook/dinov2-small" | |
| dinov2_processor = AutoImageProcessor.from_pretrained(DINOV2_MODEL) | |
| dinov2_model = AutoModel.from_pretrained(DINOV2_MODEL) | |
| # SmolVLM for image captioning | |
| smolvlm256m_processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct") | |
| smolvlm256m_model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct") | |
| ``` | |
| #### Language Model | |
| ```python | |
| # Groq LLaMA for code interpretation | |
| llm = ChatGroq( | |
| model="meta-llama/llama-4-scout-17b-16e-instruct", | |
| temperature=0, | |
| max_tokens=None, | |
| ) | |
| ``` | |
| ## Technical Approaches | |
| ### 1. Multi-Modal Image Enhancement | |
| **OpenCV Pipeline:** | |
| ```python | |
| def process_image_cv2_from_pil(pil_img, scale=2): | |
| bgr = pil_to_bgr_np(pil_img) | |
| bgr = upscale_image_cv(bgr, scale=scale) # Cubic interpolation | |
| bgr = reduce_noise_cv(bgr) # Non-local means denoising | |
| bgr = sharpen_cv(bgr) # Kernel-based sharpening | |
| bgr = enhance_contrast_cv(bgr) # Contrast enhancement | |
| return bgr_np_to_pil(bgr) | |
| ``` | |
| ### 2. Hybrid Similarity Scoring | |
| **Multi-Algorithm Consensus:** | |
| ```python | |
| def choose_top_candidates(embedding_results, phash_results, imgmatch_results): | |
| # Method A: Normalized weighted average | |
| weighted_scores[p] = (w_emb * emb_norm[p] + w_ph * ph_norm[p] + w_im * im_norm[p]) | |
| # Method B: Rank-sum (Borda count) | |
| rank_sum[p] = rank_emb[p] + rank_ph[p] + rank_im[p] | |
| # Method C: Harmonic mean (penalizes missing values) | |
| harm = 3.0 / ((1.0/a) + (1.0/b) + (1.0/c)) | |
| ``` | |
| ### 3. Block Relationship Building | |
| **Scratch Block Catalog System:** | |
| ```python | |
| def generate_blocks_from_opcodes(opcode_counts, all_block_definitions): | |
| """ | |
| Generates Scratch blocks with proper parent-child relationships | |
| - Hat blocks: topLevel=True, parent=None | |
| - Stack blocks: Linked via 'next' field | |
| - C-blocks: Contains SUBSTACK inputs | |
| - Shadow blocks: Linked as input values | |
| """ | |
| ``` | |
| ### 4. Project Assembly Pipeline | |
| **JSON Structure Generation:** | |
| ```python | |
| final_project = { | |
| "targets": [], # Sprites and Stage | |
| "monitors": [], # Variable/list monitors | |
| "extensions": [], # Scratch extensions | |
| "meta": { | |
| "semver": "3.0.0", | |
| "vm": "11.3.0", | |
| "agent": "OpenAI ScratchVision Agent" | |
| } | |
| } | |
| ``` | |
| ## File System Architecture | |
| ### Project Directory Structure | |
| ``` | |
| ๐ scratch-vision-game/ | |
| โโโ ๐ app.py # Main Flask application (PRIMARY) | |
| โโโ ๐ requirements.txt # Python dependencies | |
| โโโ ๐ณ Dockerfile # Container configuration | |
| โโโ ๐ README.md # Basic project info | |
| โโโ ๐ README2.md # Technical documentation | |
| โ | |
| โโโ ๐ utils/ # Core processing utilities | |
| โ โโโ ๐ง block_relation_builder.py # Scratch block logic & JSON generation | |
| โ | |
| โโโ ๐ blocks/ # Scratch block definitions & assets | |
| โ โโโ ๐ blocks.json # Main block catalog | |
| โ โโโ ๐ boolean_blocks.json # Boolean/condition blocks | |
| โ โโโ ๐ cap_blocks.json # Terminal blocks (stop, delete clone) | |
| โ โโโ ๐ c_blocks.json # Control flow blocks (if, repeat, forever) | |
| โ โโโ ๐ control_blocks.json # Control category blocks | |
| โ โโโ ๐ data_blocks.json # Variables and lists blocks | |
| โ โโโ ๐ event_blocks.json # Event/trigger blocks | |
| โ โโโ ๐ hat_blocks.json # Script starter blocks | |
| โ โโโ ๐ looks_blocks.json # Appearance blocks | |
| โ โโโ ๐ motion_blocks.json # Movement blocks | |
| โ โโโ ๐ operator_blocks.json # Math and logic operators | |
| โ โโโ ๐ reporter_blocks.json # Value reporter blocks | |
| โ โโโ ๐ sensing_blocks.json # Sensor blocks | |
| โ โโโ ๐ sound_blocks.json # Audio blocks | |
| โ โโโ ๐ stack_blocks.json # Sequential action blocks | |
| โ โ | |
| โ โโโ ๐ sprites/ # Reference sprite assets | |
| โ โ โโโ ๐ {sprite_name}/ | |
| โ โ โ โโโ ๐ผ๏ธ {sprite_image}.png | |
| โ โ โ โโโ ๐ sprite.json # Sprite definition | |
| โ โ โ โโโ ๐ต {sounds}.wav | |
| โ โ โโโ ... | |
| โ โ | |
| โ โโโ ๐ Backdrops/ # Reference backdrop assets | |
| โ โ โโโ ๐ {backdrop_name}/ | |
| โ โ โ โโโ ๐ผ๏ธ {backdrop_image}.png | |
| โ โ โ โโโ ๐ project.json # Stage definition | |
| โ โ โ โโโ ๐ต {sounds}.wav | |
| โ โ โโโ ... | |
| โ โ | |
| โ โโโ ๐ sound/ # Audio assets library | |
| โ โโโ ๐ต *.wav | |
| โ | |
| โโโ ๐ templates/ # Flask HTML templates | |
| โ โโโ ๐ *.html | |
| โ | |
| โโโ ๐ static/ # Web static assets | |
| โ โโโ ๐จ css/ | |
| โ โโโ ๐ js/ | |
| โ โโโ ๐ผ๏ธ images/ | |
| โ | |
| โโโ ๐ game_samples/ # Pre-built .sb3 files | |
| โ โโโ ๐ฎ *.sb3 | |
| โ | |
| โโโ ๐ generated_projects/ # Runtime generated projects | |
| โ โโโ ๐ project_{uuid}/ | |
| โ โโโ ๐ project.json | |
| โ โโโ ๐ผ๏ธ *.png | |
| โ โโโ ๐ต *.wav | |
| โ | |
| โโโ ๐ outputs/ # Processing outputs (Runtime) | |
| โโโ ๐ DETECTED_IMAGE/ # Extracted & processed images | |
| โ โโโ ๐ {pdf_name}/ | |
| โ โโโ ๐ผ๏ธ Sprite_*.png | |
| โ | |
| โโโ ๐ SCANNED_IMAGE/ # Original scanned images | |
| โ | |
| โโโ ๐ EXTRACTED_JSON/ # Intermediate JSON data | |
| โ โโโ ๐ {pdf_name}/ | |
| โ โโโ ๐ extracted.json # Raw PDF extraction | |
| โ โโโ ๐ extracted_sprites.json # AI-processed sprites | |
| โ | |
| โโโ ๐ embeddings.json # Pre-computed embeddings cache | |
| ``` | |
| ### Runtime Directory Creation Flow | |
| ``` | |
| ๐๏ธ DYNAMIC DIRECTORY CREATION: | |
| User Upload โ PDF Processing โ Directory Structure | |
| โ โ โ | |
| โโ temp_dir โโโโผโ pdf_filename โโโโโโผโ /outputs/DETECTED_IMAGE/{pdf_name}/ | |
| โ โ โโ /outputs/EXTRACTED_JSON/{pdf_name}/ | |
| โ โ โโ /generated_projects/project_{uuid}/ | |
| โ โ | |
| โโ secure_filename() โโโโโโโโโโโโโโโโโโโ Sanitized paths | |
| ``` | |
| ### Data Persistence Locations | |
| ``` | |
| ๐พ PERSISTENT DATA STORAGE: | |
| โโโ ๐ Input Processing | |
| โ โโโ /tmp/{random}/ - Temporary PDF storage | |
| โ โโโ /outputs/DETECTED_IMAGE/ - Extracted sprite images | |
| โ โโโ /outputs/EXTRACTED_JSON/ - Processing metadata | |
| โ โโโ /outputs/embeddings.json - Similarity search cache | |
| โ | |
| โโโ ๐ฏ Asset Matching | |
| โ โโโ /blocks/sprites/ - Reference sprite library | |
| โ โโโ /blocks/Backdrops/ - Reference backdrop library | |
| โ โโโ /blocks/*.json - Block definition catalogs | |
| โ | |
| โโโ ๐ฎ Final Output | |
| โโโ /generated_projects/project_{uuid}/ - Assembled project | |
| โโโ /game_samples/{project_id}.sb3 - Downloadable Scratch file | |
| โโโ /logs/app.log - Application logs | |
| ``` | |
| ## API Endpoints | |
| ### `/process_pdf` (POST) | |
| Processes uploaded PDF files containing Scratch code blocks. | |
| **Request:** | |
| ``` | |
| Content-Type: multipart/form-data | |
| pdf_file: <PDF file> | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "message": "โ PDF processed successfully", | |
| "output_json": "path/to/extracted.json", | |
| "sprites": {...}, | |
| "project_output_json": "path/to/project.json" | |
| } | |
| ``` | |
| ### `/download_sb3/<project_id>` (GET) | |
| Downloads generated Scratch 3.0 project files. | |
| ## Processing Timeline & Performance | |
| ### Execution Timeline Tree | |
| ``` | |
| โฑ๏ธ PROCESSING TIMELINE (Typical PDF with 5 images): | |
| ๐ค User Upload (0.0s) | |
| โ | |
| โโโ ๐ PDF Validation (0.1s) | |
| โ โโโ File security & temp storage | |
| โ | |
| โโโ ๐ PDF Extraction (2-5s) | |
| โ โโโ partition_pdf() - Unstructured processing | |
| โ โโโ Image extraction & base64 encoding | |
| โ โโโ extracted.json creation | |
| โ | |
| โโโ ๐ค AI Processing (10-15s per image) | |
| โ โโโ ๐ Description Generation (5-7s) | |
| โ โ โโโ LangGraph agent initialization | |
| โ โ โโโ Groq API call | |
| โ โ โโโ Response processing | |
| โ โ | |
| โ โโโ ๐ท๏ธ Name Generation (5-7s) | |
| โ โ โโโ Second LangGraph agent call | |
| โ โ โโโ Groq API call | |
| โ โ โโโ Response processing | |
| โ โ | |
| โ โโโ ๐ Metadata Assembly (0.1s) | |
| โ โโโ JSON structure creation | |
| โ | |
| โโโ ๐ Similarity Matching (3-8s) | |
| โ โโโ ๐ฏ Image Decoding (0.5s) | |
| โ โโโ ๐ง CLIP Embeddings (2-3s) | |
| โ โโโ ๐ Similarity Computation (0.5s) | |
| โ โโโ ๐จ Asset Matching (2-4s) | |
| โ | |
| โโโ ๐๏ธ Project Assembly (1-2s) | |
| โ โโโ JSON merging | |
| โ โโโ Asset copying | |
| โ โโโ Final project creation | |
| โ | |
| โโโ ๐ค Response Generation (0.1s) | |
| โโโ JSON response formatting | |
| TOTAL: ~60-90 seconds for 5-image PDF | |
| ``` | |
| ### Performance Bottlenecks & Optimizations | |
| ``` | |
| ๐ PERFORMANCE OPTIMIZATION STRATEGIES: | |
| โโโ ๐ง Model Loading (Startup Cost) | |
| โ โโโ โ Pre-loaded global models | |
| โ โ โโโ DINOv2: ~2GB VRAM | |
| โ โ โโโ SmolVLM: ~1GB VRAM | |
| โ โ โโโ CLIP: ~500MB VRAM | |
| โ โ | |
| โ โโโ โ GPU Acceleration (when available) | |
| โ โ โโโ torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| โ โ | |
| โ โโโ โ CPU Optimization | |
| โ โโโ torch.set_num_threads(4) | |
| โ | |
| โโโ ๐ผ๏ธ Image Processing Pipeline | |
| โ โโโ โ Efficient NumPy Operations | |
| โ โ โโโ Vectorized computations | |
| โ โ โโโ In-place operations where possible | |
| โ โ โโโ Memory-mapped file access | |
| โ โ | |
| โ โโโ โ OpenCV Optimizations | |
| โ โ โโโ Multi-threaded operations | |
| โ โ โโโ SIMD instructions | |
| โ โ โโโ Optimized algorithms | |
| โ โ | |
| โ โโโ โ Memory Management | |
| โ โโโ Garbage collection hints | |
| โ โโโ Temporary file cleanup | |
| โ โโโ Buffer reuse | |
| โ | |
| โโโ ๐ Similarity Search Acceleration | |
| โ โโโ โ Pre-computed Embeddings Cache | |
| โ โ โโโ /outputs/embeddings.json (persistent) | |
| โ โ | |
| โ โโโ โ Normalized Embeddings | |
| โ โ โโโ Cosine similarity via dot product | |
| โ โ โโโ L2 normalization preprocessing | |
| โ โ | |
| โ โโโ โ Parallel Algorithm Execution | |
| โ โโโ DINOv2, PHash, ImageMatch concurrent | |
| โ โโโ Multi-threaded similarity computation | |
| โ | |
| โโโ ๐ API & I/O Optimizations | |
| โโโ โ Async File Operations | |
| โโโ โ Streaming Responses | |
| โโโ โ Connection Pooling | |
| โโโ โ Compression (gzip) | |
| ``` | |
| ### Memory Usage Profile | |
| ``` | |
| ๐พ MEMORY CONSUMPTION BREAKDOWN: | |
| โโโ ๐ง AI Models (Peak: ~4GB) | |
| โ โโโ DINOv2 Model: ~2GB | |
| โ โโโ SmolVLM Model: ~1GB | |
| โ โโโ CLIP Embeddings: ~500MB | |
| โ โโโ Groq API Client: ~100MB | |
| โ | |
| โโโ ๐ผ๏ธ Image Processing (Peak: ~500MB per image) | |
| โ โโโ Original PIL Images: ~50MB each | |
| โ โโโ Enhanced Images: ~100MB each | |
| โ โโโ OpenCV Buffers: ~200MB each | |
| โ โโโ Embedding Vectors: ~2KB each | |
| โ | |
| โโโ ๐ Data Structures (Peak: ~200MB) | |
| โ โโโ Block Definitions: ~50MB | |
| โ โโโ Asset Metadata: ~100MB | |
| โ โโโ Similarity Matrices: ~50MB | |
| โ โโโ JSON Structures: ~10MB | |
| โ | |
| โโโ ๐ Web Framework (Baseline: ~100MB) | |
| โโโ Flask Application: ~50MB | |
| โโโ Request Buffers: ~30MB | |
| โโโ Response Caching: ~20MB | |
| TOTAL PEAK: ~5GB (with GPU models loaded) | |
| TOTAL BASELINE: ~1GB (CPU-only, no active processing) | |
| ``` | |
| ### Performance Optimizations | |
| ### 1. Model Caching | |
| - Pre-loaded models with global variables | |
| - GPU acceleration when available | |
| - Batch processing for multiple images | |
| ### 2. Image Processing | |
| - Efficient numpy operations | |
| - OpenCV optimizations | |
| - Memory management for large images | |
| ### 3. Similarity Search | |
| - FAISS indexing for fast nearest neighbor search | |
| - Normalized embeddings for cosine similarity | |
| - Parallel processing of multiple algorithms | |
| ## Error Handling | |
| ### 1. Graceful Degradation | |
| ```python | |
| def process_image_cv2_from_pil(pil_img, scale=2): | |
| try: | |
| # OpenCV enhancement pipeline | |
| return enhanced_image | |
| except Exception as e: | |
| print(f"Enhancement failed: {e}") | |
| return original_image # Fallback to original | |
| ``` | |
| ### 2. JSON Validation | |
| ```python | |
| agent_json_resolver = create_react_agent( | |
| model=llm, | |
| prompt=SYSTEM_PROMPT_JSON_CORRECTOR | |
| ) | |
| ``` | |
| ## Deployment | |
| ### Docker Configuration | |
| ```dockerfile | |
| FROM python:3.11-slim | |
| # System dependencies: tesseract-ocr, poppler-utils, libgl1 | |
| # Python dependencies: requirements.txt | |
| # Environment: Flask production mode | |
| EXPOSE 7860 | |
| CMD ["python", "app.py"] | |
| ``` | |
| ### Environment Variables | |
| - `GROQ_API_KEY`: API key for Groq language model | |
| - `TRANSFORMERS_CACHE`: Model cache directory | |
| - `HF_HOME`: Hugging Face cache directory | |
| ## Future Enhancements | |
| 1. **Real-time Processing**: WebSocket integration for live feedback | |
| 2. **Advanced OCR**: Custom trained models for Scratch block recognition | |
| 3. **Multi-language Support**: International Scratch block recognition | |
| 4. **Collaborative Features**: Multi-user project editing | |
| 5. **Performance Monitoring**: Detailed analytics and optimization metrics | |
| ## Contributing | |
| The system is designed with modularity in mind: | |
| - Add new block definitions in `blocks/` directory | |
| - Extend similarity algorithms in the matching pipeline | |
| - Enhance OCR accuracy with custom preprocessing | |
| - Improve LLM prompts for better code interpretation | |
| ## License | |
| Apache 2.0 License - See project repository for full details. | |