OpenTransformer
/

async-web-crawler

Model card Files Files and versions

async-web-crawler / README.md

OpenTransformer's picture

OpenTransformer

Upload README.md with huggingface_hub

8ab279a verified 24 days ago

|

history blame contribute delete

442 Bytes

	# Async Web Crawler

	High-performance async web scraper for dataset collection.

	## Install
	```bash
	pip install aiohttp
	```

	## Usage
	```bash
	python crawler.py seeds.txt output_dir/ --workers 100
	```

	## Get Seeds
	```bash
	curl -sL https://tranco-list.eu/top-1m.csv.zip -o tranco.zip && unzip tranco.zip
	awk -F, '{print "https://"$2"/"}' top-1m.csv > seeds.txt
	```

	## Output
	Each file contains URL and extracted text.

	OpenTransformers Ltd