-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
user-agent matching #37
Comments
The intention is as you describe. This regression was introduced in the switch to the C++ extension version of reppy (0.4.0), and a bug fix will be pushed shortly to rep-cpp and subsequently reppy. |
Ok, well, I appreciate the quick response and bug fix. However this change misses the main point of my feature request, which is substring matching. Here is a test I would like to see pass:
|
Ah, I apparently completely glossed over that part. I agree with the two case-insensitive matching cases, but am more on the fence about the case of substring matches. That said, that appears to be from the resource we trust the most. Sounds like there are two approaches here -- substring checks as described, or the ability to provide multiple agents and see if any match before using the default. I don't believe it's the intention of someone crafting a
This is the first time I've encountered the suggestion / interpretation of wanting to do substring matches. On its face it seems like an unnecessary burden on clients to do a This would involve a bit of refactoring, but is not conceptually hard. |
I agree the intention of the person writing robots.txt might not be for "my-agent" etc to match here. Maybe some kind of whole-word matching would be theoretically preferable. However that's not what the spec says, and it would be more difficult to implement. It's true that substring matching is O(n). But even for the craziest edge case robots.txt, n will be in the hundreds, maybe the thousands. String searches are fast. I would be very surprised if this becomes a performance issue in any real world use cases. |
I agree that in many practical cases it's probably unimportant performance-wise. For us, we're only interested in a single set of rules, and so we use |
It turned out to be pretty easy to monkey-patch reppy. |
Note the reference as quoted before says the following:
I think this is truly only the version portion of the agent that is supposed to be dropped for substring matching rather than arbitrary substring matching. Also, in general, with RFCs it's generally the case that actual practice takes precedence over the document itself, and when implementations differ from the RFC, typically, the RFC is updated. There's sufficient evidence that robots in practice don't do sub-token level substring matching for robots declarations. |
I wouldn't read it that way. If anything it probably means to remove the version from the token in robots.txt (i.e. "Googlebot/2.1" => "Googlebot"). If this is your user-agent (Google Smartphone https://support.google.com/webmasters/answer/1061943)
then if you remove version information what would it be?
And then look for an exact match on that in robots.txt?? Seems crazy to me |
* Replace all references to `tmp` with `venv` Signed-off-by: Jono Yang <jyang@nexb.com>
There doesn't seem to be a rock solid convention on how to match user-agent strings. However the various standards agree on recommending case-insensitive substring match. See below.
Reppy does something different (exact match?). Which means someone writing a web crawler likely has to manage two different kinds of user-agent string, one for robots.txt and a different that's actually sent in the requests. (That seems to be what google's crawlers do, but ugh, why. https://support.google.com/webmasters/answer/1061943)
References:
http://stackoverflow.com/questions/18026551/is-the-user-agent-line-in-robots-txt-an-exact-match-or-a-substring-match
http://www.robotstxt.org/norobots-rfc.txt
https://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1
The text was updated successfully, but these errors were encountered: