Skip to content

Commit 397b5d0

Browse files
authored
Merge pull request #75 from ropensci/fix/vignette
Fix/vignette
2 parents c63877c + 855bbe6 commit 397b5d0

File tree

5 files changed

+247
-367
lines changed

5 files changed

+247
-367
lines changed

.Rbuildignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,3 @@ revdep/*.*
2020
^doc$
2121
^Meta$
2222
^\.github$
23-
vignettes/*.Rmd

DESCRIPTION

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
Package: robotstxt
2-
Date: 2020-09-03
2+
Date: 2024-08-24
33
Type: Package
44
Title: A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker
5-
Version: 0.7.13
5+
Version: 0.7.14
66
Authors@R: c(
77
person(
88
"Pedro", "Baltazar", role = c("aut", "cre"),

NEWS.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
NEWS robotstxt
22
==========================================================================
33

4+
0.7.14 | 2024-08-24
5+
--------------------------------------------------------------------------
6+
7+
- CRAN compliance - Packages which use Internet resources should fail gracefully
8+
49

510
0.7.13 | 2020-09-03
611
--------------------------------------------------------------------------

vignettes/using_robotstxt.Rmd

Lines changed: 240 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -11,127 +11,265 @@ vignette: >
1111
%\VignetteEngine{knitr::rmarkdown}
1212
%\VignetteEncoding{UTF-8}
1313
---
14+
# Description
1415

15-
# Description
16+
The package provides a simple ‘robotstxt’ class and accompanying methods
17+
to parse and check ‘robots.txt’ files. Data fields are provided as data
18+
frames and vectors. Permissions can be checked by providing path
19+
character vectors and optional bot names.
1620

17-
The package provides a simple 'robotstxt' class and accompanying methods to parse and check 'robots.txt' files. Data fields are provided as data frames and vectors. Permissions can be checked by providing path character vectors and optional bot names.
18-
19-
2021
# Robots.txt files
2122

22-
Robots.txt files are a way to kindly ask webbots, spiders, crawlers, wanderers and the like to access or not access certain parts of a webpage. The de facto 'standard' never made it beyond a informal ["Network Working Group INTERNET DRAFT"](http://www.robotstxt.org/norobots-rfc.txt). Nonetheless, the use of robots.txt files is widespread (e.g. https://en.wikipedia.org/robots.txt, https://www.google.com/robots.txt) and bots from Google, Yahoo and the like will adhere to the rules defined in robots.txt files - although, their interpretation of those rules might differ (e.g. [rules for googlebot ](https://developers.google.com/search/reference/robots_txt)).
23-
24-
As the name of the files already suggests robots.txt files are plain text and always found at the root of a domain. The syntax of the files in essence follows a `fieldname: value` scheme with optional preceding `user-agent: ...` lines to indicate the scope of the following rule block. Blocks are separated by blank lines and the omission of a user-agent field (which directly corresponds to the HTTP user-agent field) is seen as referring to all bots. `#` serves to comment lines and parts of lines. Everything after `#` until the end of line is regarded a comment. Possible field names are: user-agent, disallow, allow, crawl-delay, sitemap, and host.
25-
26-
27-
Let us have an example file to get an idea how a robots.txt file might look like. The file below starts with a comment line followed by a line disallowing access to any content -- everything that is contained in root ("`/`") -- for all bots. The next block concerns GoodBot and NiceBot. Those two get the previous permissions lifted by being disallowed nothing. The third block is for PrettyBot. PrettyBot likes shiny stuff and therefor gets a special permission for everything contained in the "`/shinystuff/`" folder while all other restrictions still hold. In the last block all bots are asked to pause at least 5 seconds between two visits.
28-
29-
30-
```robots.txt
31-
# this is a comment
32-
# a made up example of an robots.txt file
33-
34-
Disallow: /
35-
36-
User-agent: GoodBot # another comment
37-
User-agent: NiceBot
38-
Disallow:
39-
40-
User-agent: PrettyBot
41-
Allow: /shinystuff/
42-
43-
Crawl-Delay: 5
44-
```
45-
46-
For more information have a look at: http://www.robotstxt.org/norobots-rfc.txt, where the robots.txt file 'standard' is described formally. Valuable introductions can be found at http://www.robotstxt.org/robotstxt.html as well as at https://en.wikipedia.org/wiki/Robots_exclusion_standard - of cause.
23+
Robots.txt files are a way to kindly ask webbots, spiders, crawlers,
24+
wanderers and the like to access or not access certain parts of a
25+
webpage. The de facto ‘standard’ never made it beyond a informal
26+
[“Network Working Group INTERNET
27+
DRAFT”](http://www.robotstxt.org/norobots-rfc.txt). Nonetheless, the use
28+
of robots.txt files is widespread
29+
(e.g. <https://en.wikipedia.org/robots.txt>,
30+
<https://www.google.com/robots.txt>) and bots from Google, Yahoo and the
31+
like will adhere to the rules defined in robots.txt files - although,
32+
their interpretation of those rules might differ (e.g. [rules for
33+
googlebot](https://developers.google.com/search/reference/robots_txt)).
34+
35+
As the name of the files already suggests robots.txt files are plain
36+
text and always found at the root of a domain. The syntax of the files
37+
in essence follows a `fieldname: value` scheme with optional preceding
38+
`user-agent: ...` lines to indicate the scope of the following rule
39+
block. Blocks are separated by blank lines and the omission of a
40+
user-agent field (which directly corresponds to the HTTP user-agent
41+
field) is seen as referring to all bots. `#` serves to comment lines and
42+
parts of lines. Everything after `#` until the end of line is regarded a
43+
comment. Possible field names are: user-agent, disallow, allow,
44+
crawl-delay, sitemap, and host.
45+
46+
Let us have an example file to get an idea how a robots.txt file might
47+
look like. The file below starts with a comment line followed by a line
48+
disallowing access to any content – everything that is contained in root
49+
(“`/`”) – for all bots. The next block concerns GoodBot and NiceBot.
50+
Those two get the previous permissions lifted by being disallowed
51+
nothing. The third block is for PrettyBot. PrettyBot likes shiny stuff
52+
and therefor gets a special permission for everything contained in the
53+
`/shinystuff/`” folder while all other restrictions still hold. In the
54+
last block all bots are asked to pause at least 5 seconds between two
55+
visits.
56+
57+
# this is a comment
58+
# a made up example of an robots.txt file
59+
60+
Disallow: /
61+
62+
User-agent: GoodBot # another comment
63+
User-agent: NiceBot
64+
Disallow:
65+
66+
User-agent: PrettyBot
67+
Allow: /shinystuff/
68+
69+
Crawl-Delay: 5
70+
71+
For more information have a look at:
72+
<http://www.robotstxt.org/norobots-rfc.txt>, where the robots.txt file
73+
‘standard’ is described formally. Valuable introductions can be found at
74+
<http://www.robotstxt.org/robotstxt.html> as well as at
75+
<https://en.wikipedia.org/wiki/Robots_exclusion_standard> - of cause.
4776

4877
# Fast food usage for the uninterested
4978

50-
```{r, message=FALSE}
51-
library(robotstxt)
52-
paths_allowed("http://google.com/")
53-
paths_allowed("http://google.com/search")
54-
```
79+
library(robotstxt)
80+
paths_allowed("http://google.com/")
5581

82+
## [1] TRUE
5683

84+
paths_allowed("http://google.com/search")
5785

58-
# Example Usage
86+
## [1] FALSE
5987

60-
First, let us load the package. In addition we load the dplyr package to be able to use the magrittr pipe operator `%>%` and some easy to read and remember data manipulation functions.
88+
# Example Usage
6189

62-
```{r, message=FALSE}
63-
library(robotstxt)
64-
library(dplyr)
65-
```
66-
67-
## object oriented style
90+
First, let us load the package. In addition we load the dplyr package to
91+
be able to use the magrittr pipe operator `%>%` and some easy to read
92+
and remember data manipulation functions.
6893

69-
The first step is to create an instance of the robotstxt class provided by the package. The instance has to be initiated via providing either domain or the actual text of the robots.txt file. If only the domain is provided, the robots.txt file will be downloaded automatically. Have a look at `?robotstxt` for descriptions of all data fields and methods as well as their parameters.
94+
library(robotstxt)
95+
library(dplyr)
7096

97+
## object oriented style
7198

72-
```{r, include=FALSE}
73-
rtxt <-
74-
robotstxt(
75-
domain = "wikipedia.org",
76-
text = robotstxt:::rt_get_rtxt("robots_wikipedia.txt")
77-
)
78-
```
99+
The first step is to create an instance of the robotstxt class provided
100+
by the package. The instance has to be initiated via providing either
101+
domain or the actual text of the robots.txt file. If only the domain is
102+
provided, the robots.txt file will be downloaded automatically. Have a
103+
look at `?robotstxt` for descriptions of all data fields and methods as
104+
well as their parameters.
79105

80-
```{r, eval=FALSE}
81-
rtxt <- robotstxt(domain="wikipedia.org")
82-
```
106+
rtxt <- robotstxt(domain="wikipedia.org")
83107

84108
`rtxt` is of class `robotstxt`.
85109

86-
```{r}
87-
class(rtxt)
88-
```
89-
90-
Printing the object lets us glance at all data fields and methods in `rtxt` - we have access to the text as well as all common fields. Non-standard fields are collected in `other`.
91-
92-
```{r}
93-
rtxt
94-
```
95-
96-
Checking permissions works via `rtxt`'s `check` method by providing one or more paths. If no bot name is provided `"*"` - meaning any bot - is assumed.
97-
98-
99-
```{r}
100-
# checking for access permissions
101-
rtxt$check(paths = c("/","api/"), bot = "*")
102-
rtxt$check(paths = c("/","api/"), bot = "Orthogaffe")
103-
rtxt$check(paths = c("/","api/"), bot = "Mediapartners-Google* ")
104-
```
105-
106-
110+
class(rtxt)
111+
112+
## [1] "robotstxt"
113+
114+
Printing the object lets us glance at all data fields and methods in
115+
`rtxt` - we have access to the text as well as all common fields.
116+
Non-standard fields are collected in `other`.
117+
118+
rtxt
119+
120+
## $text
121+
## [1] "#\n# robots.txt for http://www.wikipedia.org/ and friends\n#\n# Please note: There are a lot of pages on this site, and there are\n# some misbehaved spiders out there that go _way_ too fast. If you're\n# irresponsible, your access to the site may be blocked.\n#\n\n# advertising-related bots:\nUser-agent: Mediapartners-Google*\n\n[... 653 lines omitted ...]"
122+
##
123+
## $domain
124+
## [1] "wikipedia.org"
125+
##
126+
## $robexclobj
127+
## <Robots Exclusion Protocol Object>
128+
## $bots
129+
## [1] "Mediapartners-Google*" "IsraBot" "Orthogaffe" "UbiCrawler"
130+
## [5] "DOC" "Zao" "" "[... 28 items omitted ...]"
131+
##
132+
## $comments
133+
## line comment
134+
## 1 1 #
135+
## 2 2 # robots.txt for http://www.wikipedia.org/ and friends
136+
## 3 3 #
137+
## 4 4 # Please note: There are a lot of pages on this site, and there are
138+
## 5 5 # some misbehaved spiders out there that go _way_ too fast. If you're
139+
## 6 6 # irresponsible, your access to the site may be blocked.
140+
## 7
141+
## 8 [... 173 items omitted ...]
142+
##
143+
## $permissions
144+
## field useragent value
145+
## 1 Disallow Mediapartners-Google* /
146+
## 2 Disallow IsraBot
147+
## 3 Disallow Orthogaffe
148+
## 4 Disallow UbiCrawler /
149+
## 5 Disallow DOC /
150+
## 6 Disallow Zao /
151+
## 7
152+
## 8 [... 370 items omitted ...]
153+
##
154+
## $crawl_delay
155+
## [1] field useragent value
156+
## <0 rows> (or 0-length row.names)
157+
##
158+
## $host
159+
## [1] field useragent value
160+
## <0 rows> (or 0-length row.names)
161+
##
162+
## $sitemap
163+
## [1] field useragent value
164+
## <0 rows> (or 0-length row.names)
165+
##
166+
## $other
167+
## [1] field useragent value
168+
## <0 rows> (or 0-length row.names)
169+
##
170+
## $check
171+
## function (paths = "/", bot = "*")
172+
## {
173+
## spiderbar::can_fetch(obj = self$robexclobj, path = paths,
174+
## user_agent = bot)
175+
## }
176+
## <bytecode: 0x12f9629b0>
177+
## <environment: 0x12f965c10>
178+
##
179+
## attr(,"class")
180+
## [1] "robotstxt"
181+
182+
Checking permissions works via `rtxt`’s `check` method by providing one
183+
or more paths. If no bot name is provided `"*"` - meaning any bot - is
184+
assumed.
185+
186+
# checking for access permissions
187+
rtxt$check(paths = c("/","api/"), bot = "*")
188+
189+
## [1] TRUE FALSE
190+
191+
rtxt$check(paths = c("/","api/"), bot = "Orthogaffe")
192+
193+
## [1] TRUE TRUE
194+
195+
rtxt$check(paths = c("/","api/"), bot = "Mediapartners-Google* ")
196+
197+
## [1] TRUE FALSE
107198

108199
## functional style
109200

110-
While working with the robotstxt class is recommended the checking can be done with functions only as well. In the following we (1) download the robots.txt file; (2) parse it and (3) check permissions.
111-
112-
```{r, include=FALSE}
113-
r_text <- robotstxt:::rt_get_rtxt("robots_new_york_times.txt")
114-
```
115-
116-
```{r, eval=FALSE}
117-
r_text <- get_robotstxt("nytimes.com")
118-
```
119-
120-
```{r}
121-
r_parsed <- parse_robotstxt(r_text)
122-
r_parsed
123-
```
124-
125-
```{r}
126-
paths_allowed(
127-
paths = c("images/","/search"),
128-
domain = c("wikipedia.org", "google.com"),
129-
bot = "Orthogaffe"
130-
)
131-
```
132-
133-
134-
135-
136-
137-
201+
While working with the robotstxt class is recommended the checking can
202+
be done with functions only as well. In the following we (1) download
203+
the robots.txt file; (2) parse it and (3) check permissions.
204+
205+
r_text <- get_robotstxt("nytimes.com")
206+
207+
r_parsed <- parse_robotstxt(r_text)
208+
r_parsed
209+
210+
## $useragents
211+
## [1] "*" "Mediapartners-Google" "AdsBot-Google" "adidxbot"
212+
##
213+
## $comments
214+
## [1] line comment
215+
## <0 rows> (or 0-length row.names)
216+
##
217+
## $permissions
218+
## field useragent value
219+
## 1 Allow * /ads/public/
220+
## 2 Allow * /svc/news/v3/all/pshb.rss
221+
## 3 Disallow * /ads/
222+
## 4 Disallow * /adx/bin/
223+
## 5 Disallow * /archives/
224+
## 6 Disallow * /auth/
225+
## 7 Disallow * /cnet/
226+
## 8 Disallow * /college/
227+
## 9 Disallow * /external/
228+
## 10 Disallow * /financialtimes/
229+
## 11 Disallow * /idg/
230+
## 12 Disallow * /indexes/
231+
## 13 Disallow * /library/
232+
## 14 Disallow * /nytimes-partners/
233+
## 15 Disallow * /packages/flash/multimedia/TEMPLATES/
234+
## 16 Disallow * /pages/college/
235+
## 17 Disallow * /paidcontent/
236+
## 18 Disallow * /partners/
237+
## 19 Disallow * /restaurants/search*
238+
## 20 Disallow * /reuters/
239+
## 21 Disallow * /register
240+
## 22 Disallow * /thestreet/
241+
## 23 Disallow * /svc
242+
## 24 Disallow * /video/embedded/*
243+
## 25 Disallow * /web-services/
244+
## 26 Disallow * /gst/travel/travsearch*
245+
## 27 Disallow Mediapartners-Google /restaurants/search*
246+
## 28 Disallow AdsBot-Google /restaurants/search*
247+
## 29 Disallow adidxbot /restaurants/search*
248+
##
249+
## $crawl_delay
250+
## [1] field useragent value
251+
## <0 rows> (or 0-length row.names)
252+
##
253+
## $sitemap
254+
## field useragent value
255+
## 1 Sitemap * http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz
256+
## 2 Sitemap * http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz
257+
## 3 Sitemap * http://spiderbites.nytimes.com/sitemaps/sitemap_video/sitemap.xml.gz
258+
##
259+
## $host
260+
## [1] field useragent value
261+
## <0 rows> (or 0-length row.names)
262+
##
263+
## $other
264+
## [1] field useragent value
265+
## <0 rows> (or 0-length row.names)
266+
267+
paths_allowed(
268+
paths = c("images/","/search"),
269+
domain = c("wikipedia.org", "google.com"),
270+
bot = "Orthogaffe"
271+
)
272+
273+
## wikipedia.org google.com
274+
275+
## [1] TRUE FALSE

0 commit comments

Comments
 (0)