| # Async Web Crawler | |
| High-performance async web scraper for dataset collection. | |
| ## Install | |
| ```bash | |
| pip install aiohttp | |
| ``` | |
| ## Usage | |
| ```bash | |
| python crawler.py seeds.txt output_dir/ --workers 100 | |
| ``` | |
| ## Get Seeds | |
| ```bash | |
| curl -sL https://tranco-list.eu/top-1m.csv.zip -o tranco.zip && unzip tranco.zip | |
| awk -F, '{print "https://"$2"/"}' top-1m.csv > seeds.txt | |
| ``` | |
| ## Output | |
| Each file contains URL and extracted text. | |
| *OpenTransformers Ltd* | |