Spaces:

NeerajCodz
/

scrapeRL

Running

App Files Files Community

scrapeRL / docs /user-guide.md

NeerajCodz

docs: init proto

24f0bf0 25 days ago

preview code

raw

history blame contribute delete

9.04 kB

	# scraperl-documentation

	Welcome to ScrapeRL - an advanced Reinforcement Learning-powered web scraping environment. This documentation covers all aspects of using and configuring ScrapeRL.

	---

	## table-of-contents

	1. [Getting Started](#getting-started)
	2. [Dashboard Overview](#dashboard-overview)
	3. [Agents](#agents)
	4. [Plugins](#plugins)
	5. [Memory System](#memory-system)
	6. [Models & Providers](#models--providers)
	7. [Settings](#settings)
	8. [API Reference](#api-reference)
	9. [Troubleshooting](#troubleshooting)

	---

	## getting-started

	### what-is-scraperl

	ScrapeRL is an intelligent web scraping system that uses Reinforcement Learning (RL) to learn and adapt scraping strategies. Unlike traditional scrapers, ScrapeRL can:

	- Learn from experience - Improve scraping strategies over time
	- Adapt to changes - Handle website structure changes automatically
	- Multi-agent coordination - Use specialized agents for different tasks
	- Memory-enhanced - Remember patterns and optimize future runs

	### quick-start

	1. Enter a Target URL - Provide the webpage you want to scrape
	2. Write an Instruction - Describe what data you want to extract
	3. Configure Options - Select model, agents, and plugins
	4. Start Episode - Click Start and watch the magic happen!

	### example-task

	```
	URL: https://example.com/products
	Instruction: Extract all product names, prices, and descriptions
	Task Type: Medium
	```

	---

	## dashboard-overview

	The dashboard is your command center for monitoring and controlling scraping operations.

	### layout-structure

	\| Section \| Description \|
	\|---------\|-------------\|
	\| Input Bar \| Enter URL, instruction, and configure task \|
	\| Left Sidebar \| View active agents, MCPs, skills, and tools \|
	\| Center Area \| Main visualization and current observation \|
	\| Right Sidebar \| Memory stats, extracted data, recent actions \|
	\| Bottom Logs \| Real-time terminal-style log output \|

	### stats-header

	The header shows key metrics with expandable details:

	- Episodes - Total scraping sessions completed
	- Steps - Actions taken in current/total sessions
	- Reward - Performance score (higher is better)
	- Time - Current time and session duration

	Click the ⋯ icon on any stat to see detailed statistics (min, max, average).

	### task-configuration

	#### task-types

	\| Type \| Description \| Use Case \|
	\|------\|-------------\|----------\|
	\| Low \| Simple single-page scraping \| Product page, article text \|
	\| Medium \| Multi-page with navigation \| Search results, listings \|
	\| High \| Complex interactive tasks \| Login-required, forms \|

	---

	## agents

	ScrapeRL uses a multi-agent architecture where specialized agents handle different aspects of scraping.

	### available-agents

	\| Agent \| Role \| Description \|
	\|-------\|------\|-------------\|
	\| Coordinator \| Orchestrator \| Manages all other agents, decides strategy \|
	\| Scraper \| Extractor \| Extracts data from page content \|
	\| Navigator \| Navigation \| Handles page navigation, clicking, scrolling \|
	\| Analyzer \| Analysis \| Analyzes extracted data for patterns \|
	\| Validator \| Validation \| Validates data quality and completeness \|

	### agent-selection

	1. Click the Agents button in the input bar
	2. Select agents you want to enable
	3. Active agents appear in the left sidebar accordion
	4. Monitor agent activity in real-time

	### agent-status-indicators

	- Active - Currently processing
	- Ready - Waiting for task
	- Idle - Not currently in use
	- Error - Encountered an issue

	---

	## plugins

	Extend ScrapeRL's capabilities with plugins organized by category.

	### plugin-categories

	#### mcps-model-context-protocols

	Tools that provide browser automation and page interaction:

	\| Plugin \| Description \|
	\|--------\|-------------\|
	\| Browser Use \| AI-powered browser automation \|
	\| Puppeteer MCP \| Headless Chrome control \|
	\| Playwright MCP \| Cross-browser automation \|

	#### skills

	Specialized capabilities for specific tasks:

	\| Plugin \| Description \|
	\|--------\|-------------\|
	\| Web Scraping \| Core extraction algorithms \|
	\| Data Extraction \| Structured data parsing \|
	\| Form Filling \| Automated form completion \|

	#### apis

	External service integrations:

	\| Plugin \| Description \|
	\|--------\|-------------\|
	\| Firecrawl \| High-performance web crawler \|
	\| Jina Reader \| Content reader API \|
	\| Serper \| Search engine results API \|

	#### vision

	Visual understanding capabilities:

	\| Plugin \| Description \|
	\|--------\|-------------\|
	\| GPT-4 Vision \| OpenAI visual analysis \|
	\| Gemini Vision \| Google visual AI \|
	\| Claude Vision \| Anthropic visual models \|

	### managing-plugins

	1. Go to Plugins tab
	2. Browse by category
	3. Click Install to add a plugin
	4. Enable plugins in Dashboard via the Plugins popup

	---

	## memory-system

	ScrapeRL uses a hierarchical memory system for context retention.

	### memory-layers

	\| Layer \| Purpose \| Retention \|
	\|-------\|---------\|-----------\|
	\| Working \| Current task context \| Session \|
	\| Episodic \| Experience records \| Persistent \|
	\| Semantic \| Learned patterns \| Persistent \|
	\| Procedural \| Action sequences \| Persistent \|

	### memory-features

	- Auto-consolidation - Promotes important data between layers
	- Similarity search - Find related memories quickly
	- Pattern recognition - Learn from past experiences

	---

	## models-and-providers

	### supported-providers

	\| Provider \| Models \| Best For \|
	\|----------\|--------\|----------\|
	\| Groq \| GPT-OSS 120B \| Fast inference, default \|
	\| Google \| Gemini 2.5 Flash \| Balanced performance \|
	\| OpenAI \| GPT-4 Turbo \| High accuracy \|
	\| Anthropic \| Claude 3 Opus \| Complex reasoning \|

	### model-selection

	1. Click Model button in input bar
	2. Select from available models
	3. Models require appropriate API keys

	### api-keys

	Configure API keys in Settings > API Keys:

	1. Select provider
	2. Enter your API key
	3. Click Save
	4. Key status shows as "Active" when configured

	---

	## settings

	### general-settings

	\| Setting \| Description \|
	\|---------\|-------------\|
	\| WebSocket Updates \| Enable real-time updates \|
	\| Memory Persistence \| Save memory across sessions \|
	\| Auto-save Episodes \| Automatically save completed episodes \|
	\| Debug Mode \| Enable verbose logging \|

	### budget-and-limits

	Control API usage costs:

	- Daily Limit - Maximum spend per day
	- Monthly Limit - Maximum spend per month
	- Max Tokens - Token limit per request
	- Alert Threshold - Warning at 80% usage

	> Budget limits are disabled by default. Enable in Settings to control spending.

	### appearance

	- Theme - Dark (default), Light, Auto
	- Compact Mode - Reduce UI spacing
	- Animations - Enable/disable transitions

	---

	## api-reference

	### health-check

	```bash
	GET /api/health
	```

	Response:
	```json
	{
	"status": "healthy",
	"version": "0.1.0",
	"timestamp": "2026-03-28T00:00:00Z"
	}
	```

	### episode-management

	```bash
	# Start new episode
	POST /api/episode/reset
	{
	"task_id": "scrape-products",
	"config": { ... }
	}

	# Take action
	POST /api/episode/step
	{
	"action": "navigate",
	"params": { "url": "..." }
	}

	# Get current state
	GET /api/episode/state
	```

	### memory-api

	```bash
	# Store entry
	POST /api/memory/store
	{
	"content": "...",
	"memory_type": "working",
	"metadata": { ... }
	}

	# Query memories
	POST /api/memory/query
	{
	"query": "product prices",
	"memory_type": "semantic",
	"limit": 10
	}
	```

	### plugins-api

	```bash
	# List plugins
	GET /api/plugins/

	# Install plugin
	POST /api/plugins/install
	{ "plugin_id": "firecrawl" }

	# Uninstall plugin
	POST /api/plugins/uninstall
	{ "plugin_id": "firecrawl" }
	```

	---

	## troubleshooting

	### common-issues

	#### api-key-required-error

	Solution: Configure at least one API key in Settings > API Keys

	#### episode-not-starting

	Checklist:
	- [ ] Valid URL entered
	- [ ] At least one agent selected
	- [ ] API key configured
	- [ ] System status shows "Online"

	#### slow-performance

	Tips:
	- Use Groq for faster inference
	- Reduce enabled plugins
	- Lower task complexity if possible

	#### memory-full

	Solution: Clear memory layers in Settings > Advanced > Clear Cache

	### getting-help

	- Check the logs panel for error details
	- View episode history for past issues
	- Report bugs on GitHub

	---

	## keyboard-shortcuts

	\| Shortcut \| Action \|
	\|----------\|--------\|
	\| `Ctrl + Enter` \| Start/Stop episode \|
	\| `Ctrl + L` \| Clear logs \|
	\| `Ctrl + ,` \| Open settings \|
	\| `Escape` \| Close popups \|

	---

	## version-history

	### v0-1-0-current

	- Initial release
	- Multi-agent architecture
	- Plugin system
	- Memory layers
	- Dashboard with real-time monitoring

	---

	Documentation last updated: March 2026

	Built with by NeerajCodz

	## document-flow

	```mermaid
	flowchart TD
	A[document] --> B[key-sections]
	B --> C[implementation]
	B --> D[operations]
	B --> E[validation]
	```
	## related-api-reference

	\| item \| value \|
	\| --- \| --- \|
	\| api-reference \| `api-reference.md` \|