Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
Nextgrid
4 months ago
|
parent
|
context
|
favorite
| on:
Waiting for dawn in search: Search index, Google r...
Running the bot nowadays is hard, because a lot of sites will now block you - not just by asking nicely via robots.txt, but by checking your actual source IP. Once they see it's not Google, they send you a 403.
eloisius
4 months ago
[–]
Cloudflare’s ubiquity makes bootstrapping a search index via crawler virtually impossible, but what about data sources like Common Crawl?
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: