-
Notifications
You must be signed in to change notification settings - Fork 2
/
datatxt-spec.txt
466 lines (352 loc) · 18.3 KB
/
datatxt-spec.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
Submit: https://datatracker.ietf.org/submit/
A Method for Web Data Discovery
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are
working documents of the Internet Engineering Task Force
(IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
``work in progress.''
To learn the current status of any Internet-Draft, please
check the ``1id-abstracts.txt'' listing contained in the
Internet- Drafts Shadow Directories on ftp.is.co.za (Africa),
nic.nordu.net (Europe), munnari.oz.au (Pacific Rim),
ds.internic.net (US East Coast), or ftp.isi.edu (US West
Coast).
Table of Contents
1. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2
3. Specification . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Access method . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 File Format Description . . . . . . . . . . . . . . . . . . 4
3.2.1 The User-agent line . . . . . . . . . . . . . . . . . . . . 5
3.2.2 The Allow and Disallow lines . . . . . . . . . . . . . . . 5
3.3 Formal Syntax . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Expiration . . . . . . . . . . . . . . . . . . . . . . . . 8
4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 8
5. Implementor's Notes . . . . . . . . . . . . . . . . . . . . 9
5.1 Backwards Compatibility . . . . . . . . . . . . . . . . . . 9
5.2 Interoperability . . .. . . . . . . . . . . . . . . . . . . 10
6. Security Considerations . . . . . . . . . . . . . . . . . . 10
7. References . . . . . . . . . . . . . . . . . . . . . . . . 10
8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 11
9. Author's Address . . . . . . . . . . . . . . . . . . . . . 11
1. Abstract
This memo defines a method for administrators of sites on the World-
Wide Web to allow automated tools to easily discover paths to
datasets and describe their contents.
2. Introduction
Data Catalogs have been embraced by government agencies, nonprofit
organizations, and for-profit organizations that aim to provide
open access to data and transparency. A data catalog is often
comprised of a web interface to search and download datasets;
however, it is hard to automate the discovery and extraction of
datasets since each catalog implements a different set of endpoints
and protocols.
The technique specified in this memo allows Web site administrators
to indicate to visiting humans and robots where datasets are
locateed within their site.
It is solely up to the visitor to consult this information and act
accordingly. By searching the index, rendering metadata associated
with each dataset and download them without having to rely on
complex methods to infer if a given page withibn the site contains
datasets, or not.
3. The Specification
This memo specifies a format for exposing datasets to Web site
visitors, and specifies an access method to retrieve these
instructions. Visitors can then choose to provide custom data
catalogs that support one or many Web sites through user interface
or automated methods.
3.1 Access method
The instructions must be accessible via HTTP [2] from the site that
the instructions are to be applied to, as a resource of Internet
Media Type [3] "text/plain" under a standard relative path on the
server: "/data.txt".
For convenience we will refer to this resource as the "/data.txt
file", though the resource need in fact not originate from a file-
system.
Some examples of URLs [4] for sites and URLs for corresponding
"/data.txt" sites:
https://datatxt.org/ https://datatxt.org/data.txt
http://www.bar.com:8001/ http://www.bar.com:8001/data.txt
If the server response indicates Success (HTTP 2xx Status Code,)
the visitor can read the content, parse it, and present an
appropriate user interface or automated method to process the
Web site as a source for a given data catalog.
If the server response indicates the resource does not exist (HTTP
Status Code 404), the visitor should assume the Web site might
contain datasets which require to be discovered by other methods.
3.2 File Format Description
The instructions are encoded as a formatted plain text object,
described here. A complete BNF-like description of the syntax of this
format is given in section 3.3.
The format logically consists of a non-empty set or records,
separated by blank lines. The records consist of a set of lines of
the form:
<Field> ":" <value>
In this memo we refer to lines with a Field "foo" as "foo lines".
The record starts with one or more User-agent lines, specifying
which robots the record applies to, followed by "Disallow" and
"Allow" instructions to that robot. For example:
User-agent: webcrawler
User-agent: infoseek
Allow: /tmp/ok.html
Disallow: /tmp
Disallow: /user/foo
These lines are discussed separately below.
Lines with Fields not explicitly specified by this specification
may occur in the /robots.txt, allowing for future extension of the
format. Consult the BNF for restrictions on the syntax of such
extensions. Note specifically that for backwards compatibility
with robots implementing earlier versions of this specification,
breaking of lines is not allowed.
Comments are allowed anywhere in the file, and consist of optional
whitespace, followed by a comment character '#' followed by the
comment, terminated by the end-of-line.
3.2.1 The User-agent line
Name tokens are used to allow robots to identify themselves via a
simple product token. Name tokens should be short and to the
point. The name token a robot chooses for itself should be sent
as part of the HTTP User-agent header, and must be well documented.
These name tokens are used in User-agent lines in /robots.txt to
identify to which specific robots the record applies. The robot
must obey the first record in /robots.txt that contains a User-
Agent line whose value contains the name token of the robot as a
substring. The name comparisons are case-insensitive. If no such
record exists, it should obey the first record with a User-agent
line with a "*" value, if present. If no record satisfied either
condition, or no records are present at all, access is unlimited.
The name comparisons are case-insensitive.
For example, a fictional company FigTree Search Services who names
their robot "Fig Tree", send HTTP requests like:
GET / HTTP/1.0
User-agent: FigTree/0.1 Robot libwww-perl/5.04
might scan the "/robots.txt" file for records with:
User-agent: figtree
3.2.2 The Allow and Disallow lines
These lines indicate whether accessing a URL that matches the
corresponding path is allowed or disallowed. Note that these
instructions apply to any HTTP method on a URL.
To evaluate if access to a URL is allowed, a robot must attempt to
match the paths in Allow and Disallow lines against the URL, in the
order they occur in the record. The first match found is used. If no
match is found, the default assumption is that the URL is allowed.
The /robots.txt URL is always allowed, and must not appear in the
Allow/Disallow rules.
The matching process compares every octet in the path portion of
the URL and the path from the record. If a %xx encoded octet is
encountered it is unencoded prior to comparison, unless it is the
"/" character, which has special meaning in a path. The match
evaluates positively if and only if the end of the path from the
record is reached before a difference in octets is encountered.
This table illustrates some examples:
Record Path URL path Matches
/tmp /tmp yes
/tmp /tmp.html yes
/tmp /tmp/a.html yes
/tmp/ /tmp no
/tmp/ /tmp/ yes
/tmp/ /tmp/a.html yes
/a%3cd.html /a%3cd.html yes
/a%3Cd.html /a%3cd.html yes
/a%3cd.html /a%3Cd.html yes
/a%3Cd.html /a%3Cd.html yes
/a%2fb.html /a%2fb.html yes
/a%2fb.html /a/b.html no
/a/b.html /a%2fb.html no
/a/b.html /a/b.html yes
/%7ejoe/index.html /~joe/index.html yes
/~joe/index.html /%7Ejoe/index.html yes
3.3 Formal Syntax
This is a BNF-like description, using the conventions of RFC 822 [5],
except that "|" is used to designate alternatives. Briefly, literals
are quoted with "", parentheses "(" and ")" are used to group
elements, optional elements are enclosed in [brackets], and elements
may be preceded with <n>* to designate n or more repetitions of the
following element; n defaults to 0.
robotstxt = *blankcomment
| *blankcomment record *( 1*commentblank 1*record )
*blankcomment
blankcomment = 1*(blank | commentline)
commentblank = *commentline blank *(blankcomment)
blank = *space CRLF
CRLF = CR LF
record = *commentline agentline *(commentline | agentline)
1*ruleline *(commentline | ruleline)
agentline = "User-agent:" *space agent [comment] CRLF
ruleline = (disallowline | allowline | extension)
disallowline = "Disallow" ":" *space path [comment] CRLF
allowline = "Allow" ":" *space rpath [comment] CRLF
extension = token : *space value [comment] CRLF
value = <any CHAR except CR or LF or "#">
commentline = comment CRLF
comment = *blank "#" anychar
space = 1*(SP | HT)
rpath = "/" path
agent = token
anychar = <any CHAR except CR or LF>
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character
(octets 0 - 31) and DEL (127)>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
The syntax for "token" is taken from RFC 1945 [2], reproduced here for
convenience:
token = 1*<any CHAR except CTLs or tspecials>
tspecials = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
The syntax for "path" is defined in RFC 1808 [6], reproduced here for
convenience:
path = fsegment *( "/" segment )
fsegment = 1*pchar
segment = *pchar
pchar = uchar | ":" | "@" | "&" | "="
uchar = unreserved | escape
unreserved = alpha | digit | safe | extra
escape = "%" hex hex
hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
alpha = lowalpha | hialpha
lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
"j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
"s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","
3.4 Expiration
Robots should cache /robots.txt files, but if they do they must
periodically verify the cached copy is fresh before using its
contents.
Standard HTTP cache-control mechanisms can be used by both origin
server and robots to influence the caching of the /robots.txt file.
Specifically robots should take note of Expires header set by the
origin server.
If no cache-control directives are present robots should default to
an expiry of 7 days.
4. Examples
This section contains an example of how a /robots.txt may be used.
A fictional site may have the following URLs:
http://www.fict.org/
http://www.fict.org/index.html
http://www.fict.org/robots.txt
http://www.fict.org/server.html
http://www.fict.org/services/fast.html
http://www.fict.org/services/slow.html
http://www.fict.org/orgo.gif
http://www.fict.org/org/about.html
http://www.fict.org/org/plans.html
http://www.fict.org/%7Ejim/jim.html
http://www.fict.org/%7Emak/mak.html
The site may in the /robots.txt have specific rules for robots that
send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and
"Excite/1.0", and a set of default rules:
# /robots.txt for http://www.fict.org/
# comments to webmaster@fict.org
User-agent: unhipbot
Disallow: /
User-agent: webcrawler
User-agent: excite
Disallow:
User-agent: *
Disallow: /org/plans.html
Allow: /org/
Allow: /serv
Allow: /~mak
Disallow: /
The following matrix shows which robots are allowed to access URLs:
unhipbot webcrawler other
& excite
http://www.fict.org/ No Yes No
http://www.fict.org/index.html No Yes No
http://www.fict.org/robots.txt Yes Yes Yes
http://www.fict.org/server.html No Yes Yes
http://www.fict.org/services/fast.html No Yes Yes
http://www.fict.org/services/slow.html No Yes Yes
http://www.fict.org/orgo.gif No Yes No
http://www.fict.org/org/about.html No Yes Yes
http://www.fict.org/org/plans.html No Yes No
http://www.fict.org/%7Ejim/jim.html No Yes No
http://www.fict.org/%7Emak/mak.html No Yes Yes
5. Notes for Implementors
5.1 Backwards Compatibility
Previous of this specification didn't provide the Allow line. The
introduction of the Allow line causes robots to behave slightly
differently under either specification:
If a /robots.txt contains an Allow which overrides a later occurring
Disallow, a robot ignoring Allow lines will not retrieve those
parts. This is considered acceptable because there is no requirement
for a robot to access URLs it is allowed to retrieve, and it is safe,
in that no URLs a Web site administrator wants to Disallow are be
allowed. It is expected this may in fact encourage robots to upgrade
compliance to the specification in this memo.
5.2 Interoperability
Implementors should pay particular attention to the robustness in
parsing of the /robots.txt file. Web site administrators who are not
aware of the /robots.txt mechanisms often notice repeated failing
request for it in their log files, and react by putting up pages
asking "What are you looking for?".
As the majority of /robots.txt files are created with platform-
specific text editors, robots should be liberal in accepting files
with different end-of-line conventions, specifically CR and LF in
addition to CRLF.
6. Security Considerations
There are a few risks in the method described here, which may affect
either origin server or robot.
Web site administrators must realise this method is voluntary, and
is not sufficient to guarantee some robots will not visit restricted
parts of the URL space. Failure to use proper authentication or other
restriction may result in exposure of restricted information. It even
possible that the occurence of paths in the /robots.txt file may
expose the existence of resources not otherwise linked to on the
site, which may aid people guessing for URLs.
Robots need to be aware that the amount of resources spent on dealing
with the /robots.txt is a function of the file contents, which is not
under the control of the robot. For example, the contents may be
larger in size than the robot can deal with. To prevent denial-of-
service attacks, robots are therefore encouraged to place limits on
the resources spent on processing of /robots.txt.
The /robots.txt directives are retrieved and applied in separate,
possible unauthenticated HTTP transactions, and it is possible that
one server can impersonate another or otherwise intercept a
/robots.txt, and provide a robot with false information. This
specification does not preclude authentication and encryption
from being employed to increase security.
7. Acknowledgements
The author would like the subscribers to the robots mailing list for
their contributions to this specification.
8. References
[1] Koster, M., "A Standard for Robot Exclusion",
http://info.webcrawler.com/mak/projects/robots/norobots.html,
June 1994.
[2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext
Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996.
[3] Postel, J., "Media Type Registration Procedure." RFC 1590,
USC/ISI, March 1994.
[4] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
University of Minnesota, December 1994.
[5] Crocker, D., "Standard for the Format of ARPA Internet Text
Messages", STD 11, RFC 822, UDEL, August 1982.
[6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
UC Irvine, June 1995.
9. Author's Address
Martijn Koster
WebCrawler
America Online
690 Fifth Street
San Francisco
CA 94107
Phone: 415-3565431
EMail: m.koster@webcrawler.com