Spaces:

NeerajCodz
/

scrapeRL

Running

File size: 9,037 Bytes

# scraperl-documentation

Welcome to ScrapeRL - an advanced Reinforcement Learning-powered web scraping environment. This documentation covers all aspects of using and configuring ScrapeRL.

---

## table-of-contents

1. [Getting Started](#getting-started)
2. [Dashboard Overview](#dashboard-overview)
3. [Agents](#agents)
4. [Plugins](#plugins)
5. [Memory System](#memory-system)
6. [Models & Providers](#models--providers)
7. [Settings](#settings)
8. [API Reference](#api-reference)
9. [Troubleshooting](#troubleshooting)

---

## getting-started

### what-is-scraperl

ScrapeRL is an intelligent web scraping system that uses Reinforcement Learning (RL) to learn and adapt scraping strategies. Unlike traditional scrapers, ScrapeRL can:

- **Learn from experience** - Improve scraping strategies over time
- **Adapt to changes** - Handle website structure changes automatically
- **Multi-agent coordination** - Use specialized agents for different tasks
- **Memory-enhanced** - Remember patterns and optimize future runs

### quick-start

1. **Enter a Target URL** - Provide the webpage you want to scrape
2. **Write an Instruction** - Describe what data you want to extract
3. **Configure Options** - Select model, agents, and plugins
4. **Start Episode** - Click Start and watch the magic happen!

### example-task

```
URL: https://example.com/products
Instruction: Extract all product names, prices, and descriptions
Task Type: Medium
```

---

## dashboard-overview

The dashboard is your command center for monitoring and controlling scraping operations.

### layout-structure

| Section | Description |
|---------|-------------|
| **Input Bar** | Enter URL, instruction, and configure task |
| **Left Sidebar** | View active agents, MCPs, skills, and tools |
| **Center Area** | Main visualization and current observation |
| **Right Sidebar** | Memory stats, extracted data, recent actions |
| **Bottom Logs** | Real-time terminal-style log output |

### stats-header

The header shows key metrics with expandable details:

- **Episodes** - Total scraping sessions completed
- **Steps** - Actions taken in current/total sessions
- **Reward** - Performance score (higher is better)
- **Time** - Current time and session duration

Click the **⋯** icon on any stat to see detailed statistics (min, max, average).

### task-configuration

#### task-types

| Type | Description | Use Case |
|------|-------------|----------|
|  **Low** | Simple single-page scraping | Product page, article text |
|  **Medium** | Multi-page with navigation | Search results, listings |
|  **High** | Complex interactive tasks | Login-required, forms |

---

## agents

ScrapeRL uses a multi-agent architecture where specialized agents handle different aspects of scraping.

### available-agents

| Agent | Role | Description |
|-------|------|-------------|
| **Coordinator** |  Orchestrator | Manages all other agents, decides strategy |
| **Scraper** |  Extractor | Extracts data from page content |
| **Navigator** |  Navigation | Handles page navigation, clicking, scrolling |
| **Analyzer** |  Analysis | Analyzes extracted data for patterns |
| **Validator** |  Validation | Validates data quality and completeness |

### agent-selection

1. Click the **Agents** button in the input bar
2. Select agents you want to enable
3. Active agents appear in the left sidebar accordion
4. Monitor agent activity in real-time

### agent-status-indicators

-  **Active** - Currently processing
-  **Ready** - Waiting for task
-  **Idle** - Not currently in use
-  **Error** - Encountered an issue

---

## plugins

Extend ScrapeRL's capabilities with plugins organized by category.

### plugin-categories

#### mcps-model-context-protocols

Tools that provide browser automation and page interaction:

| Plugin | Description |
|--------|-------------|
| Browser Use | AI-powered browser automation |
| Puppeteer MCP | Headless Chrome control |
| Playwright MCP | Cross-browser automation |

#### skills

Specialized capabilities for specific tasks:

| Plugin | Description |
|--------|-------------|
| Web Scraping | Core extraction algorithms |
| Data Extraction | Structured data parsing |
| Form Filling | Automated form completion |

#### apis

External service integrations:

| Plugin | Description |
|--------|-------------|
| Firecrawl | High-performance web crawler |
| Jina Reader | Content reader API |
| Serper | Search engine results API |

#### vision

Visual understanding capabilities:

| Plugin | Description |
|--------|-------------|
| GPT-4 Vision | OpenAI visual analysis |
| Gemini Vision | Google visual AI |
| Claude Vision | Anthropic visual models |

### managing-plugins

1. Go to **Plugins** tab
2. Browse by category
3. Click **Install** to add a plugin
4. Enable plugins in Dashboard via the Plugins popup

---

## memory-system

ScrapeRL uses a hierarchical memory system for context retention.

### memory-layers

| Layer | Purpose | Retention |
|-------|---------|-----------|
| **Working** | Current task context | Session |
| **Episodic** | Experience records | Persistent |
| **Semantic** | Learned patterns | Persistent |
| **Procedural** | Action sequences | Persistent |

### memory-features

- **Auto-consolidation** - Promotes important data between layers
- **Similarity search** - Find related memories quickly
- **Pattern recognition** - Learn from past experiences

---

## models-and-providers

### supported-providers

| Provider | Models | Best For |
|----------|--------|----------|
| **Groq** | GPT-OSS 120B | Fast inference, default |
| **Google** | Gemini 2.5 Flash | Balanced performance |
| **OpenAI** | GPT-4 Turbo | High accuracy |
| **Anthropic** | Claude 3 Opus | Complex reasoning |

### model-selection

1. Click **Model** button in input bar
2. Select from available models
3. Models require appropriate API keys

### api-keys

Configure API keys in **Settings > API Keys**:

1. Select provider
2. Enter your API key
3. Click Save
4. Key status shows as "Active" when configured

---

## settings

### general-settings

| Setting | Description |
|---------|-------------|
| WebSocket Updates | Enable real-time updates |
| Memory Persistence | Save memory across sessions |
| Auto-save Episodes | Automatically save completed episodes |
| Debug Mode | Enable verbose logging |

### budget-and-limits

Control API usage costs:

- **Daily Limit** - Maximum spend per day
- **Monthly Limit** - Maximum spend per month
- **Max Tokens** - Token limit per request
- **Alert Threshold** - Warning at 80% usage

>  Budget limits are disabled by default. Enable in Settings to control spending.

### appearance

- **Theme** - Dark (default), Light, Auto
- **Compact Mode** - Reduce UI spacing
- **Animations** - Enable/disable transitions

---

## api-reference

### health-check

```bash
GET /api/health
```

Response:
```json
{
  "status": "healthy",
  "version": "0.1.0",
  "timestamp": "2026-03-28T00:00:00Z"
}
```

### episode-management

```bash
# Start new episode
POST /api/episode/reset
{
  "task_id": "scrape-products",
  "config": { ... }
}

# Take action
POST /api/episode/step
{
  "action": "navigate",
  "params": { "url": "..." }
}

# Get current state
GET /api/episode/state
```

### memory-api

```bash
# Store entry
POST /api/memory/store
{
  "content": "...",
  "memory_type": "working",
  "metadata": { ... }
}

# Query memories
POST /api/memory/query
{
  "query": "product prices",
  "memory_type": "semantic",
  "limit": 10
}
```

### plugins-api

```bash
# List plugins
GET /api/plugins/

# Install plugin
POST /api/plugins/install
{ "plugin_id": "firecrawl" }

# Uninstall plugin
POST /api/plugins/uninstall
{ "plugin_id": "firecrawl" }
```

---

## troubleshooting

### common-issues

#### api-key-required-error

**Solution:** Configure at least one API key in Settings > API Keys

#### episode-not-starting

**Checklist:**
- [ ] Valid URL entered
- [ ] At least one agent selected
- [ ] API key configured
- [ ] System status shows "Online"

#### slow-performance

**Tips:**
- Use Groq for faster inference
- Reduce enabled plugins
- Lower task complexity if possible

#### memory-full

**Solution:** Clear memory layers in Settings > Advanced > Clear Cache

### getting-help

- Check the logs panel for error details
- View episode history for past issues
- Report bugs on GitHub

---

## keyboard-shortcuts

| Shortcut | Action |
|----------|--------|
| `Ctrl + Enter` | Start/Stop episode |
| `Ctrl + L` | Clear logs |
| `Ctrl + ,` | Open settings |
| `Escape` | Close popups |

---

## version-history

### v0-1-0-current

- Initial release
- Multi-agent architecture
- Plugin system
- Memory layers
- Dashboard with real-time monitoring

---

*Documentation last updated: March 2026*

*Built with  by NeerajCodz*

## document-flow

```mermaid
flowchart TD
    A[document] --> B[key-sections]
    B --> C[implementation]
    B --> D[operations]
    B --> E[validation]
```
## related-api-reference

| item | value |
| --- | --- |
| api-reference | `api-reference.md` |