New bookmark: Why the internet needs crawl neutrality

As the maintainer of a long list of crawling search engines, this seems like a really important issue to me. Perhaps I should add info about crawlers and user agent strings to that list. “The web is hostile to upstart search engine crawlers, and most websites only allow Google’s crawler.”

Follow

@Seirdy if this is the workaround that Neeva had to come up with that took them an inordinate amount of time and resources, they might need to hire all new engineers 😆

> This forces startups to spend inordinate amounts of time and resources coming up with workarounds. For example, Neeva implements a policy of “crawling a site so long as the robots.txt allows GoogleBot and does not specifically disallow Neevabot.”

@Seirdy with that being said, one of the real issues (blocks to the workarounds) happens when a site uses a service like Cloudflare to properly block them. Whether that be detecting a hostname spoof or via reverse DNS. So there is some credence to that post, but I place most of the blame on shortsighted webmasters/SEOs/developers.

Sign in to participate in the conversation
Coywolf Social

The federated instance of Mastodon for Coywolf