Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Running the bot nowadays is hard, because a lot of sites will now block you - not just by asking nicely via robots.txt, but by checking your actual source IP. Once they see it's not Google, they send you a 403.


Cloudflare’s ubiquity makes bootstrapping a search index via crawler virtually impossible, but what about data sources like Common Crawl?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: