Spaces:
Running
Running
| # scraperl-documentation | |
| Welcome to ScrapeRL - an advanced Reinforcement Learning-powered web scraping environment. This documentation covers all aspects of using and configuring ScrapeRL. | |
| --- | |
| ## table-of-contents | |
| 1. [Getting Started](#getting-started) | |
| 2. [Dashboard Overview](#dashboard-overview) | |
| 3. [Agents](#agents) | |
| 4. [Plugins](#plugins) | |
| 5. [Memory System](#memory-system) | |
| 6. [Models & Providers](#models--providers) | |
| 7. [Settings](#settings) | |
| 8. [API Reference](#api-reference) | |
| 9. [Troubleshooting](#troubleshooting) | |
| --- | |
| ## getting-started | |
| ### what-is-scraperl | |
| ScrapeRL is an intelligent web scraping system that uses Reinforcement Learning (RL) to learn and adapt scraping strategies. Unlike traditional scrapers, ScrapeRL can: | |
| - **Learn from experience** - Improve scraping strategies over time | |
| - **Adapt to changes** - Handle website structure changes automatically | |
| - **Multi-agent coordination** - Use specialized agents for different tasks | |
| - **Memory-enhanced** - Remember patterns and optimize future runs | |
| ### quick-start | |
| 1. **Enter a Target URL** - Provide the webpage you want to scrape | |
| 2. **Write an Instruction** - Describe what data you want to extract | |
| 3. **Configure Options** - Select model, agents, and plugins | |
| 4. **Start Episode** - Click Start and watch the magic happen! | |
| ### example-task | |
| ``` | |
| URL: https://example.com/products | |
| Instruction: Extract all product names, prices, and descriptions | |
| Task Type: Medium | |
| ``` | |
| --- | |
| ## dashboard-overview | |
| The dashboard is your command center for monitoring and controlling scraping operations. | |
| ### layout-structure | |
| | Section | Description | | |
| |---------|-------------| | |
| | **Input Bar** | Enter URL, instruction, and configure task | | |
| | **Left Sidebar** | View active agents, MCPs, skills, and tools | | |
| | **Center Area** | Main visualization and current observation | | |
| | **Right Sidebar** | Memory stats, extracted data, recent actions | | |
| | **Bottom Logs** | Real-time terminal-style log output | | |
| ### stats-header | |
| The header shows key metrics with expandable details: | |
| - **Episodes** - Total scraping sessions completed | |
| - **Steps** - Actions taken in current/total sessions | |
| - **Reward** - Performance score (higher is better) | |
| - **Time** - Current time and session duration | |
| Click the **⋯** icon on any stat to see detailed statistics (min, max, average). | |
| ### task-configuration | |
| #### task-types | |
| | Type | Description | Use Case | | |
| |------|-------------|----------| | |
| | **Low** | Simple single-page scraping | Product page, article text | | |
| | **Medium** | Multi-page with navigation | Search results, listings | | |
| | **High** | Complex interactive tasks | Login-required, forms | | |
| --- | |
| ## agents | |
| ScrapeRL uses a multi-agent architecture where specialized agents handle different aspects of scraping. | |
| ### available-agents | |
| | Agent | Role | Description | | |
| |-------|------|-------------| | |
| | **Coordinator** | Orchestrator | Manages all other agents, decides strategy | | |
| | **Scraper** | Extractor | Extracts data from page content | | |
| | **Navigator** | Navigation | Handles page navigation, clicking, scrolling | | |
| | **Analyzer** | Analysis | Analyzes extracted data for patterns | | |
| | **Validator** | Validation | Validates data quality and completeness | | |
| ### agent-selection | |
| 1. Click the **Agents** button in the input bar | |
| 2. Select agents you want to enable | |
| 3. Active agents appear in the left sidebar accordion | |
| 4. Monitor agent activity in real-time | |
| ### agent-status-indicators | |
| - **Active** - Currently processing | |
| - **Ready** - Waiting for task | |
| - **Idle** - Not currently in use | |
| - **Error** - Encountered an issue | |
| --- | |
| ## plugins | |
| Extend ScrapeRL's capabilities with plugins organized by category. | |
| ### plugin-categories | |
| #### mcps-model-context-protocols | |
| Tools that provide browser automation and page interaction: | |
| | Plugin | Description | | |
| |--------|-------------| | |
| | Browser Use | AI-powered browser automation | | |
| | Puppeteer MCP | Headless Chrome control | | |
| | Playwright MCP | Cross-browser automation | | |
| #### skills | |
| Specialized capabilities for specific tasks: | |
| | Plugin | Description | | |
| |--------|-------------| | |
| | Web Scraping | Core extraction algorithms | | |
| | Data Extraction | Structured data parsing | | |
| | Form Filling | Automated form completion | | |
| #### apis | |
| External service integrations: | |
| | Plugin | Description | | |
| |--------|-------------| | |
| | Firecrawl | High-performance web crawler | | |
| | Jina Reader | Content reader API | | |
| | Serper | Search engine results API | | |
| #### vision | |
| Visual understanding capabilities: | |
| | Plugin | Description | | |
| |--------|-------------| | |
| | GPT-4 Vision | OpenAI visual analysis | | |
| | Gemini Vision | Google visual AI | | |
| | Claude Vision | Anthropic visual models | | |
| ### managing-plugins | |
| 1. Go to **Plugins** tab | |
| 2. Browse by category | |
| 3. Click **Install** to add a plugin | |
| 4. Enable plugins in Dashboard via the Plugins popup | |
| --- | |
| ## memory-system | |
| ScrapeRL uses a hierarchical memory system for context retention. | |
| ### memory-layers | |
| | Layer | Purpose | Retention | | |
| |-------|---------|-----------| | |
| | **Working** | Current task context | Session | | |
| | **Episodic** | Experience records | Persistent | | |
| | **Semantic** | Learned patterns | Persistent | | |
| | **Procedural** | Action sequences | Persistent | | |
| ### memory-features | |
| - **Auto-consolidation** - Promotes important data between layers | |
| - **Similarity search** - Find related memories quickly | |
| - **Pattern recognition** - Learn from past experiences | |
| --- | |
| ## models-and-providers | |
| ### supported-providers | |
| | Provider | Models | Best For | | |
| |----------|--------|----------| | |
| | **Groq** | GPT-OSS 120B | Fast inference, default | | |
| | **Google** | Gemini 2.5 Flash | Balanced performance | | |
| | **OpenAI** | GPT-4 Turbo | High accuracy | | |
| | **Anthropic** | Claude 3 Opus | Complex reasoning | | |
| ### model-selection | |
| 1. Click **Model** button in input bar | |
| 2. Select from available models | |
| 3. Models require appropriate API keys | |
| ### api-keys | |
| Configure API keys in **Settings > API Keys**: | |
| 1. Select provider | |
| 2. Enter your API key | |
| 3. Click Save | |
| 4. Key status shows as "Active" when configured | |
| --- | |
| ## settings | |
| ### general-settings | |
| | Setting | Description | | |
| |---------|-------------| | |
| | WebSocket Updates | Enable real-time updates | | |
| | Memory Persistence | Save memory across sessions | | |
| | Auto-save Episodes | Automatically save completed episodes | | |
| | Debug Mode | Enable verbose logging | | |
| ### budget-and-limits | |
| Control API usage costs: | |
| - **Daily Limit** - Maximum spend per day | |
| - **Monthly Limit** - Maximum spend per month | |
| - **Max Tokens** - Token limit per request | |
| - **Alert Threshold** - Warning at 80% usage | |
| > Budget limits are disabled by default. Enable in Settings to control spending. | |
| ### appearance | |
| - **Theme** - Dark (default), Light, Auto | |
| - **Compact Mode** - Reduce UI spacing | |
| - **Animations** - Enable/disable transitions | |
| --- | |
| ## api-reference | |
| ### health-check | |
| ```bash | |
| GET /api/health | |
| ``` | |
| Response: | |
| ```json | |
| { | |
| "status": "healthy", | |
| "version": "0.1.0", | |
| "timestamp": "2026-03-28T00:00:00Z" | |
| } | |
| ``` | |
| ### episode-management | |
| ```bash | |
| # Start new episode | |
| POST /api/episode/reset | |
| { | |
| "task_id": "scrape-products", | |
| "config": { ... } | |
| } | |
| # Take action | |
| POST /api/episode/step | |
| { | |
| "action": "navigate", | |
| "params": { "url": "..." } | |
| } | |
| # Get current state | |
| GET /api/episode/state | |
| ``` | |
| ### memory-api | |
| ```bash | |
| # Store entry | |
| POST /api/memory/store | |
| { | |
| "content": "...", | |
| "memory_type": "working", | |
| "metadata": { ... } | |
| } | |
| # Query memories | |
| POST /api/memory/query | |
| { | |
| "query": "product prices", | |
| "memory_type": "semantic", | |
| "limit": 10 | |
| } | |
| ``` | |
| ### plugins-api | |
| ```bash | |
| # List plugins | |
| GET /api/plugins/ | |
| # Install plugin | |
| POST /api/plugins/install | |
| { "plugin_id": "firecrawl" } | |
| # Uninstall plugin | |
| POST /api/plugins/uninstall | |
| { "plugin_id": "firecrawl" } | |
| ``` | |
| --- | |
| ## troubleshooting | |
| ### common-issues | |
| #### api-key-required-error | |
| **Solution:** Configure at least one API key in Settings > API Keys | |
| #### episode-not-starting | |
| **Checklist:** | |
| - [ ] Valid URL entered | |
| - [ ] At least one agent selected | |
| - [ ] API key configured | |
| - [ ] System status shows "Online" | |
| #### slow-performance | |
| **Tips:** | |
| - Use Groq for faster inference | |
| - Reduce enabled plugins | |
| - Lower task complexity if possible | |
| #### memory-full | |
| **Solution:** Clear memory layers in Settings > Advanced > Clear Cache | |
| ### getting-help | |
| - Check the logs panel for error details | |
| - View episode history for past issues | |
| - Report bugs on GitHub | |
| --- | |
| ## keyboard-shortcuts | |
| | Shortcut | Action | | |
| |----------|--------| | |
| | `Ctrl + Enter` | Start/Stop episode | | |
| | `Ctrl + L` | Clear logs | | |
| | `Ctrl + ,` | Open settings | | |
| | `Escape` | Close popups | | |
| --- | |
| ## version-history | |
| ### v0-1-0-current | |
| - Initial release | |
| - Multi-agent architecture | |
| - Plugin system | |
| - Memory layers | |
| - Dashboard with real-time monitoring | |
| --- | |
| *Documentation last updated: March 2026* | |
| *Built with by NeerajCodz* | |
| ## document-flow | |
| ```mermaid | |
| flowchart TD | |
| A[document] --> B[key-sections] | |
| B --> C[implementation] | |
| B --> D[operations] | |
| B --> E[validation] | |
| ``` | |
| ## related-api-reference | |
| | item | value | | |
| | --- | --- | | |
| | api-reference | `api-reference.md` | | |