Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user-agent matching #37

Open
nlevitt opened this issue Nov 16, 2016 · 8 comments
Open

user-agent matching #37

nlevitt opened this issue Nov 16, 2016 · 8 comments

Comments

@nlevitt
Copy link

nlevitt commented Nov 16, 2016

There doesn't seem to be a rock solid convention on how to match user-agent strings. However the various standards agree on recommending case-insensitive substring match. See below.

Reppy does something different (exact match?). Which means someone writing a web crawler likely has to manage two different kinds of user-agent string, one for robots.txt and a different that's actually sent in the requests. (That seems to be what google's crawlers do, but ugh, why. https://support.google.com/webmasters/answer/1061943)

References:

http://stackoverflow.com/questions/18026551/is-the-user-agent-line-in-robots-txt-an-exact-match-or-a-substring-match

In the original robots.txt specification (from 1994), it says:

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

http://www.robotstxt.org/norobots-rfc.txt

The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive.

https://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1

There must be exactly one "User-agent" field per record. The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended.

@dlecocq
Copy link

dlecocq commented Nov 16, 2016

The intention is as you describe. This regression was introduced in the switch to the C++ extension version of reppy (0.4.0), and a bug fix will be pushed shortly to rep-cpp and subsequently reppy.

This was referenced Nov 16, 2016
@nlevitt
Copy link
Author

nlevitt commented Nov 16, 2016

Ok, well, I appreciate the quick response and bug fix. However this change misses the main point of my feature request, which is substring matching.

Here is a test I would like to see pass:

diff --git a/tests/test_robots.py b/tests/test_robots.py
index fd0543329b..c6e8f5b3cc 100644
--- a/tests/test_robots.py
+++ b/tests/test_robots.py
@@ -142,6 +142,18 @@ class RobotsTest(unittest.TestCase):
         self.assertFalse(robot.allowed('/path', 'agent'))
         self.assertFalse(robot.allowed('/path', 'aGeNt'))

+    def test_substring(self):
+        '''
+        "The robot must obey the first record in /robots.txt that contains a
+        User-Agent line whose value contains the name token of the robot as a
+        substring." -- http://www.robotstxt.org/norobots-rfc.txt
+        '''
+        robot = robots.Robots.parse('http://example.com/robots.txt', '''
+            User-agent: Agent
+            Disallow: /path
+        ''')
+        self.assertFalse(robot.allowed('/path', 'foo; agent; bar'))
+
     def test_empty(self):
         '''Makes sure we can parse an empty robots.txt'''
         robot = robots.Robots.parse('http://example.com/robots.txt', '')

@dlecocq
Copy link

dlecocq commented Nov 16, 2016

Ah, I apparently completely glossed over that part.

I agree with the two case-insensitive matching cases, but am more on the fence about the case of substring matches. That said, that appears to be from the resource we trust the most.

Sounds like there are two approaches here -- substring checks as described, or the ability to provide multiple agents and see if any match before using the default. I don't believe it's the intention of someone crafting a robots.txt:

# Should not match my-agent, foo-agent, etc.
User-agent: agent
Disallow: /

This is the first time I've encountered the suggestion / interpretation of wanting to do substring matches. On its face it seems like an unnecessary burden on clients to do a O(agents) check in order to do the matching.

This would involve a bit of refactoring, but is not conceptually hard.

@nlevitt
Copy link
Author

nlevitt commented Nov 16, 2016

# Should not match my-agent, foo-agent, etc.
User-agent: agent
Disallow: /

I agree the intention of the person writing robots.txt might not be for "my-agent" etc to match here. Maybe some kind of whole-word matching would be theoretically preferable. However that's not what the spec says, and it would be more difficult to implement.

It's true that substring matching is O(n). But even for the craziest edge case robots.txt, n will be in the hundreds, maybe the thousands. String searches are fast. I would be very surprised if this becomes a performance issue in any real world use cases.

@dlecocq
Copy link

dlecocq commented Nov 16, 2016

I agree that in many practical cases it's probably unimportant performance-wise. For us, we're only interested in a single set of rules, and so we use AgentCache to cache only the relevant stanza, so we pay any O(n) lookup time just once.

@nlevitt
Copy link
Author

nlevitt commented Nov 16, 2016

It turned out to be pretty easy to monkey-patch reppy.
internetarchive/brozzler@3aead6de93
I'm ok with this.

@b4hand
Copy link
Contributor

b4hand commented Nov 21, 2016

Note the reference as quoted before says the following:

A case insensitive substring match of the name without version information is recommended. [emphasis added]

I think this is truly only the version portion of the agent that is supposed to be dropped for substring matching rather than arbitrary substring matching. Also, in general, with RFCs it's generally the case that actual practice takes precedence over the document itself, and when implementations differ from the RFC, typically, the RFC is updated. There's sufficient evidence that robots in practice don't do sub-token level substring matching for robots declarations.

@nlevitt
Copy link
Author

nlevitt commented Nov 21, 2016

I wouldn't read it that way. If anything it probably means to remove the version from the token in robots.txt (i.e. "Googlebot/2.1" => "Googlebot").

If this is your user-agent (Google Smartphone https://support.google.com/webmasters/answer/1061943)

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

then if you remove version information what would it be?

Mozilla (Linux; Android; Nexus 5X Build) AppleWebKit (KHTML, like Gecko) Chrome Mobile Safari (compatible; Googlebot; +http://www.google.com/bot.html)

And then look for an exact match on that in robots.txt?? Seems crazy to me

JonoYang added a commit to aboutcode-org/reppy2 that referenced this issue Oct 4, 2021
    * Replace all references to `tmp` with `venv`

Signed-off-by: Jono Yang <jyang@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants