Spaces:

NeerajCodz
/

scrapeRL

Running

App Files Files Community

scrapeRL / docs /search-engine.md

NeerajCodz

docs: init proto

24f0bf0 28 days ago

preview code

raw

history blame contribute delete

21.1 kB

	# search-engine-layer

	## table-of-contents
	1. [Overview](#overview)
	2. [Supported Search Engines](#supported-search-engines)
	3. [Query Optimization](#query-optimization)
	4. [Multi-Hop Search](#multi-hop-search)
	5. [Source Credibility Scoring](#source-credibility-scoring)
	6. [Result Ranking](#result-ranking)
	7. [Caching & Deduplication](#caching--deduplication)
	8. [Configuration](#configuration)

	---

	## overview

	The Search Engine Layer enables agents to search the web intelligently, optimize queries, perform multi-hop searches, and evaluate source credibility.

	### capabilities

	- Multiple search engine APIs (Google, Bing, Brave, DuckDuckGo, Perplexity)
	- Query optimization and rewriting
	- Multi-hop search (search → refine → search again)
	- Source credibility scoring
	- Result ranking and filtering
	- Caching and deduplication
	- Cost tracking

	---

	## supported-search-engines

	### 1-google-search-api

	Pros:
	- Most comprehensive results
	- High quality
	- Advanced operators support

	Cons:
	- Requires API key + Custom Search Engine ID
	- Costs $5 per 1000 queries after free tier

	Configuration:
	```python
	{
	"google": {
	"api_key": "YOUR_GOOGLE_API_KEY",
	"search_engine_id": "YOUR_CSE_ID",
	"region": "us",
	"safe_search": True,
	"num_results": 10
	}
	}
	```

	Usage:
	```python
	results = search_engine.search(
	query="product reviews for Widget Pro",
	engine="google",
	num_results=10
	)
	```

	### 2-bing-search-api

	Pros:
	- Good quality results
	- Competitive pricing ($7 per 1000 queries)
	- News search included

	Cons:
	- Smaller index than Google
	- Less advanced operators

	Configuration:
	```python
	{
	"bing": {
	"api_key": "YOUR_BING_API_KEY",
	"market": "en-US",
	"safe_search": "Moderate",
	"freshness": None # "Day", "Week", "Month"
	}
	}
	```

	### 3-brave-search-api

	Pros:
	- Privacy-focused
	- Independent index
	- Good pricing ($5 per 1000 queries)
	- No tracking

	Cons:
	- Smaller index
	- Newer service

	Configuration:
	```python
	{
	"brave": {
	"api_key": "YOUR_BRAVE_API_KEY",
	"country": "US",
	"safe_search": "moderate",
	"freshness": None
	}
	}
	```

	### 4-duckduckgo-free-no-api-key

	Pros:
	- Completely free
	- No API key required
	- Privacy-focused
	- Good for testing

	Cons:
	- Rate limited
	- Less control over results
	- Smaller result set

	Usage:
	```python
	from duckduckgo_search import DDGS

	results = DDGS().text(
	keywords="web scraping tools",
	max_results=10
	)
	```

	### 5-perplexity-ai-ai-powered-search

	Pros:
	- Returns AI-summarized answers with citations
	- Real-time web access
	- Conversational queries

	Cons:
	- More expensive
	- Designed for Q&A, not traditional search

	Configuration:
	```python
	{
	"perplexity": {
	"api_key": "YOUR_PERPLEXITY_API_KEY",
	"model": "pplx-70b-online",
	"include_citations": True
	}
	}
	```

	---

	## query-optimization

	### query-rewriter

	```python
	class QueryOptimizer:
	"""Optimize search queries for better results."""

	def optimize(self, query: str, context: Dict = None) -> str:
	"""Optimize a search query."""
	optimized = query

	# 1. Expand abbreviations
	optimized = self.expand_abbreviations(optimized)

	# 2. Add context keywords
	if context:
	optimized = self.add_context(optimized, context)

	# 3. Remove stop words (optional)
	# optimized = self.remove_stop_words(optimized)

	# 4. Add search operators
	optimized = self.add_operators(optimized)

	return optimized

	def expand_abbreviations(self, query: str) -> str:
	"""Expand common abbreviations."""
	expansions = {
	"AI": "artificial intelligence",
	"ML": "machine learning",
	"API": "application programming interface",
	"UI": "user interface",
	"UX": "user experience",
	}

	for abbr, full in expansions.items():
	# Only expand if abbreviation stands alone
	query = re.sub(rf'\b{abbr}\b', full, query)

	return query

	def add_context(self, query: str, context: Dict) -> str:
	"""Add contextual keywords."""
	if context.get('domain'):
	query = f"{query} site:{context['domain']}"

	if context.get('year'):
	query = f"{query} {context['year']}"

	if context.get('location'):
	query = f"{query} {context['location']}"

	return query

	def add_operators(self, query: str) -> str:
	"""Add search operators for precision."""
	# If query has multiple important terms, wrap in quotes
	important_terms = self.extract_important_terms(query)

	if len(important_terms) > 1:
	# Exact phrase search for key terms
	for term in important_terms:
	if len(term.split()) > 1:
	query = query.replace(term, f'"{term}"')

	return query
	```

	### query-expansion

	```python
	class QueryExpander:
	"""Expand queries with synonyms and related terms."""

	def expand(self, query: str) -> List[str]:
	"""Generate query variations."""
	variations = [query]

	# 1. Synonym replacement
	synonyms = self.get_synonyms(query)
	for synonym_set in synonyms:
	for term, synonym in synonym_set:
	varied = query.replace(term, synonym)
	variations.append(varied)

	# 2. Add modifiers
	modifiers = ["best", "top", "review", "comparison", "guide"]
	for modifier in modifiers:
	variations.append(f"{modifier} {query}")

	# 3. Question forms
	variations.extend([
	f"what is {query}",
	f"how to {query}",
	f"why {query}"
	])

	return variations[:5] # Limit to top 5
	```

	### bad-query-detection

	```python
	def is_bad_query(query: str) -> bool:
	"""Detect poorly formed queries."""
	# Too short
	if len(query.split()) < 2:
	return True

	# All stop words
	stop_words = {'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be'}
	words = set(query.lower().split())
	if words.issubset(stop_words):
	return True

	# No meaningful content
	if not re.search(r'[a-zA-Z]{3,}', query):
	return True

	return False
	```

	---

	## multi-hop-search

	### multi-hop-strategy

	```python
	class MultiHopSearch:
	"""Perform multi-hop search with refinement."""

	async def search_multi_hop(
	self,
	initial_query: str,
	max_hops: int = 3
	) -> MultiHopResult:
	"""Perform multi-hop search."""
	results_by_hop = []
	current_query = initial_query

	for hop in range(max_hops):
	# Execute search
	results = await self.search(current_query)
	results_by_hop.append(results)

	# Analyze results
	analysis = self.analyze_results(results)

	# Check if we found what we need
	if analysis.is_satisfactory:
	break

	# Refine query for next hop
	current_query = self.refine_query(
	current_query,
	results,
	analysis
	)

	return MultiHopResult(
	hops=results_by_hop,
	final_query=current_query,
	best_results=self.rank_all_results(results_by_hop)
	)

	def refine_query(
	self,
	original_query: str,
	results: List[SearchResult],
	analysis: ResultAnalysis
	) -> str:
	"""Refine query based on previous results."""
	# Extract new keywords from top results
	new_keywords = self.extract_keywords_from_results(results[:3])

	# If results were too broad, add specificity
	if analysis.too_broad:
	specific_terms = [kw for kw in new_keywords if len(kw.split()) > 1]
	if specific_terms:
	return f"{original_query} {specific_terms[0]}"

	# If results were off-topic, add negative keywords
	if analysis.off_topic_terms:
	negative = ' '.join(f"-{term}" for term in analysis.off_topic_terms)
	return f"{original_query} {negative}"

	# If no results, try synonyms
	if analysis.no_results:
	return self.query_expander.expand(original_query)[0]

	return original_query
	```

	### example-multi-hop-flow

	```python
	# Hop 1: Initial broad search
	query_1 = "best web scraping tools"
	results_1 = search(query_1)
	# Results: General articles about scraping tools

	# Hop 2: Refine to specific use case
	query_2 = "best web scraping tools for e-commerce Python"
	results_2 = search(query_2)
	# Results: More specific, Python-focused

	# Hop 3: Add recent constraint
	query_3 = "best web scraping tools for e-commerce Python 2026"
	results_3 = search(query_3)
	# Results: Latest tools with recent reviews
	```

	---

	## source-credibility-scoring

	### credibility-scorer

	```python
	class SourceCredibilityScorer:
	"""Score the credibility of search result sources."""

	def score(self, url: str, domain: str, result: SearchResult) -> float:
	"""Calculate credibility score (0.0 to 1.0)."""
	score = 0.5 # Base score

	# 1. Domain reputation
	score += self.domain_reputation_score(domain) * 0.3

	# 2. Domain age
	score += self.domain_age_score(domain) * 0.1

	# 3. HTTPS
	if url.startswith('https://'):
	score += 0.05

	# 4. TLD credibility
	score += self.tld_score(domain) * 0.1

	# 5. Presence in result snippet
	score += self.snippet_quality_score(result.snippet) * 0.15

	# 6. Backlinks (if available)
	score += self.backlink_score(domain) * 0.2

	# 7. Freshness
	score += self.freshness_score(result.date_published) * 0.1

	return min(max(score, 0.0), 1.0)

	def domain_reputation_score(self, domain: str) -> float:
	"""Score based on known domain reputation."""
	# Trusted domains
	trusted = {
	'wikipedia.org': 1.0,
	'github.com': 0.95,
	'stackoverflow.com': 0.95,
	'nytimes.com': 0.9,
	'bbc.com': 0.9,
	'reuters.com': 0.9,
	'arxiv.org': 0.95,
	'nature.com': 0.95,
	'sciencedirect.com': 0.9,
	}

	# Known spammy/low-quality domains
	untrusted = {
	'contentvilla.com': 0.1,
	'ehow.com': 0.3,
	}

	if domain in trusted:
	return trusted[domain]

	if domain in untrusted:
	return untrusted[domain]

	# Medium trust for unknown domains
	return 0.5

	def tld_score(self, domain: str) -> float:
	"""Score based on top-level domain."""
	tld = domain.split('.')[-1]

	tld_scores = {
	'edu': 0.9, # Educational institutions
	'gov': 0.95, # Government
	'org': 0.8, # Organizations
	'com': 0.6, # Commercial (neutral)
	'net': 0.6,
	'io': 0.6,
	'info': 0.4, # Often spammy
	'xyz': 0.3, # Cheap, often spam
	}

	return tld_scores.get(tld, 0.5)

	def snippet_quality_score(self, snippet: str) -> float:
	"""Score snippet quality."""
	score = 0.5

	# Penalize clickbait patterns
	clickbait_patterns = [
	r'you won\'t believe',
	r'shocking',
	r'one weird trick',
	r'\d+ reasons why',
	]

	for pattern in clickbait_patterns:
	if re.search(pattern, snippet, re.I):
	score -= 0.2

	# Reward factual language
	if re.search(r'according to\|research\|study\|data\|analysis', snippet, re.I):
	score += 0.2

	return max(0.0, score)

	def freshness_score(self, date_published: Optional[datetime]) -> float:
	"""Score based on content freshness."""
	if not date_published:
	return 0.3 # Unknown date

	age_days = (datetime.now() - date_published).days

	# Decay function: Fresh content scores higher
	if age_days < 30:
	return 1.0
	elif age_days < 90:
	return 0.8
	elif age_days < 365:
	return 0.6
	elif age_days < 730:
	return 0.4
	else:
	return 0.2
	```

	### domain-blacklist

	```python
	DOMAIN_BLACKLIST = [
	'contentvilla.com',
	'pastebin.com', # Often scraped/duplicated content
	'scam-detector.com',
	'pinterest.com', # Image aggregator, not original content
	# Add more as needed
	]

	def is_blacklisted(url: str) -> bool:
	"""Check if URL is blacklisted."""
	domain = urlparse(url).netloc
	return any(blocked in domain for blocked in DOMAIN_BLACKLIST)
	```

	---

	## result-ranking

	### ranking-algorithm

	```python
	class ResultRanker:
	"""Rank search results by relevance and quality."""

	def rank(
	self,
	results: List[SearchResult],
	query: str,
	context: Dict = None
	) -> List[RankedResult]:
	"""Rank results by multiple factors."""
	ranked = []

	for result in results:
	score = self.calculate_score(result, query, context)
	ranked.append(RankedResult(
	result=result,
	score=score
	))

	# Sort by score (highest first)
	ranked.sort(key=lambda x: x.score, reverse=True)

	return ranked

	def calculate_score(
	self,
	result: SearchResult,
	query: str,
	context: Dict
	) -> float:
	"""Calculate ranking score."""
	score = 0.0

	# 1. Credibility (40%)
	credibility = self.credibility_scorer.score(
	result.url,
	result.domain,
	result
	)
	score += credibility * 0.4

	# 2. Relevance (35%)
	relevance = self.calculate_relevance(result, query)
	score += relevance * 0.35

	# 3. Freshness (10%)
	freshness = self.credibility_scorer.freshness_score(result.date_published)
	score += freshness * 0.1

	# 4. Engagement signals (10%)
	# (If available: click-through rate, dwell time, etc.)
	score += result.engagement_score * 0.1

	# 5. Diversity bonus (5%)
	# Prefer results from different domains
	if context and context.get('seen_domains'):
	if result.domain not in context['seen_domains']:
	score += 0.05

	return score

	def calculate_relevance(self, result: SearchResult, query: str) -> float:
	"""Calculate query-result relevance."""
	# Simple keyword matching (can be enhanced with embeddings)
	query_terms = set(query.lower().split())

	# Check title
	title_terms = set(result.title.lower().split())
	title_overlap = len(query_terms & title_terms) / len(query_terms)

	# Check snippet
	snippet_terms = set(result.snippet.lower().split())
	snippet_overlap = len(query_terms & snippet_terms) / len(query_terms)

	# Weighted average
	relevance = 0.6 * title_overlap + 0.4 * snippet_overlap

	return relevance
	```

	---

	## caching-and-deduplication

	### search-result-cache

	```python
	class SearchCache:
	"""Cache search results to reduce API calls."""

	def __init__(self, ttl_seconds: int = 3600):
	self.cache = {}
	self.ttl = ttl_seconds

	def get(self, query: str, engine: str) -> Optional[List[SearchResult]]:
	"""Get cached results."""
	key = self.make_key(query, engine)

	if key in self.cache:
	cached, timestamp = self.cache[key]

	# Check if still valid
	age = (datetime.now() - timestamp).total_seconds()
	if age < self.ttl:
	return cached
	else:
	# Expired, remove
	del self.cache[key]

	return None

	def set(self, query: str, engine: str, results: List[SearchResult]):
	"""Cache results."""
	key = self.make_key(query, engine)
	self.cache[key] = (results, datetime.now())

	def make_key(self, query: str, engine: str) -> str:
	"""Generate cache key."""
	normalized = query.lower().strip()
	return f"{engine}:{normalized}"
	```

	### result-deduplication

	```python
	class ResultDeduplicator:
	"""Remove duplicate results across multiple searches."""

	def deduplicate(self, results: List[SearchResult]) -> List[SearchResult]:
	"""Remove duplicates."""
	seen_urls = set()
	seen_titles = set()
	unique = []

	for result in results:
	# Normalize URL (remove query params, fragments)
	normalized_url = self.normalize_url(result.url)

	# Normalize title
	normalized_title = result.title.lower().strip()

	# Check if we've seen this result
	if normalized_url in seen_urls:
	continue

	# Check for near-duplicate titles
	if self.is_near_duplicate_title(normalized_title, seen_titles):
	continue

	# Add to unique set
	unique.append(result)
	seen_urls.add(normalized_url)
	seen_titles.add(normalized_title)

	return unique

	def normalize_url(self, url: str) -> str:
	"""Normalize URL for comparison."""
	parsed = urlparse(url)
	# Remove query params and fragment
	normalized = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
	# Remove trailing slash
	return normalized.rstrip('/')

	def is_near_duplicate_title(self, title: str, seen_titles: Set[str]) -> bool:
	"""Check if title is near-duplicate of seen titles."""
	from difflib import SequenceMatcher

	for seen in seen_titles:
	similarity = SequenceMatcher(None, title, seen).ratio()
	if similarity > 0.85: # 85% similar
	return True

	return False
	```

	---

	## configuration

	### search-engine-settings

	```typescript
	interface SearchEngineConfig {
	default: 'google' \| 'bing' \| 'brave' \| 'duckduckgo' \| 'perplexity';

	providers: {
	google?: GoogleConfig;
	bing?: BingConfig;
	brave?: BraveConfig;
	duckduckgo?: DuckDuckGoConfig;
	perplexity?: PerplexityConfig;
	};

	// Global settings
	maxResults: number; // Default: 10
	timeout: number; // Seconds
	cacheResults: boolean; // Default: true
	cacheTTL: number; // Seconds

	// Query optimization
	optimizeQueries: boolean; // Default: true
	expandQueries: boolean; // Default: false

	// Multi-hop
	enableMultiHop: boolean; // Default: false
	maxHops: number; // Default: 3

	// Filtering
	filterByCredibility: boolean; // Default: true
	minCredibilityScore: number; // Default: 0.4
	blacklistedDomains: string[];

	// Cost tracking
	trackCosts: boolean; // Default: true
	dailyQueryLimit: number; // Default: 1000
	}
	```

	### usage-example

	```python
	# Initialize search engine
	search = SearchEngine(config)

	# Simple search
	results = await search.search(
	query="best Python web scraping libraries",
	engine="google",
	num_results=10
	)

	# Optimized search
	results = await search.search_optimized(
	query="web scraping",
	context={"domain": "python.org", "year": 2026},
	optimize=True,
	filter_credibility=True
	)

	# Multi-hop search
	multi_hop_results = await search.search_multi_hop(
	initial_query="web scraping tools",
	max_hops=3
	)

	# Get ranked results
	ranked = search.rank_results(
	results,
	query="web scraping tools",
	context={"seen_domains": ["github.com"]}
	)
	```

	---

	Next: See [agents.md](./agents.md) for agent architecture.


	## related-api-reference

	\| item \| value \|
	\| --- \| --- \|
	\| api-reference \| `api-reference.md` \|

	## document-metadata

	\| key \| value \|
	\| --- \| --- \|
	\| document \| `search-engine.md` \|
	\| status \| active \|

	## document-flow

	```mermaid
	flowchart TD
	A[document] --> B[key-sections]
	B --> C[implementation]
	B --> D[operations]
	B --> E[validation]
	```