datatxt-spec.txt

Submit: https://datatracker.ietf.org/submit/


                      A Method for Web Data Discovery


Status of this Memo

     This document is an Internet-Draft.  Internet-Drafts are
     working documents of the Internet Engineering Task Force
     (IETF), its areas, and its working groups.  Note that other
     groups may also distribute working documents as Internet-
     Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as
     ``work in progress.''

     To learn the current status of any Internet-Draft, please
     check the ``1id-abstracts.txt'' listing contained in the
     Internet- Drafts Shadow Directories on ftp.is.co.za (Africa),
     nic.nordu.net (Europe), munnari.oz.au (Pacific Rim),
     ds.internic.net (US East Coast), or ftp.isi.edu (US West
     Coast).

Table of Contents

   1.    Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . 2
   2.    Introduction  . . . . . . . . . . . . . . . . . . . . . . . 2
   3.    Specification . . . . . . . . . . . . . . . . . . . . . . . 3
   3.1   Access method . . . . . . . . . . . . . . . . . . . . . . . 3
   3.2   File Format Description . . . . . . . . . . . . . . . . . . 4
   3.2.1 The User-agent line . . . . . . . . . . . . . . . . . . . . 5
   3.2.2 The Allow and Disallow lines . . . . . . . . . . .  . . . . 5
   3.3   Formal Syntax . . . . . . . . . . . . . . . . . . . . . . . 6
   3.4   Expiration . . . . . . . . . . . . .  . . . . . . . . . . . 8
   4.    Examples . . . . . . . . . . . . . .  . . . . . . . . . . . 8
   5.    Implementor's Notes . . . . . . . . . . . . . . . . . . . . 9
   5.1   Backwards Compatibility . . . . . . . . . . . . . . . . . . 9
   5.2   Interoperability . . .. . . . . . . . . . . . . . . . . . . 10
   6.    Security Considerations . . . . . . . . . . . . . . . . . . 10
   7.    References  . . . . . . . . . . . . . . . . . . . . . . . . 10
   8.    Acknowledgements  . . . . . . . . . . . . . . . . . . . . . 11
   9.    Author's Address  . . . . . . . . . . . . . . . . . . . . . 11


1.  Abstract

   This memo defines a method for administrators of sites on the World-
   Wide Web to allow automated tools to easily discover paths to
   datasets and describe their contents.

2.  Introduction

   Data Catalogs have been embraced by government agencies, nonprofit
   organizations, and for-profit organizations that aim to provide
   open access to data and transparency. A data catalog is often
   comprised of a web interface to search and download datasets;
   however, it is hard to automate the discovery and extraction of
   datasets since each catalog implements a different set of endpoints
   and protocols.

   The technique specified in this memo allows Web site administrators
   to indicate to visiting humans and robots where datasets are
   locateed within their site.

   It is solely up to the visitor to consult this information and act
   accordingly. By searching the index, rendering metadata associated
   with each dataset and download them without having to rely on
   complex methods to infer if a given page withibn the site contains
   datasets, or not.
   
3. The Specification

   This memo specifies a format for exposing datasets to Web site
   visitors, and specifies an access method to retrieve these
   instructions. Visitors can then choose to provide custom data
   catalogs that support one or many Web sites through user interface
   or automated methods.

3.1 Access method

   The instructions must be accessible via HTTP [2] from the site that
   the instructions are to be applied to, as a resource of Internet
   Media Type [3] "text/plain" under a standard relative path on the
   server: "/data.txt".

   For convenience we will refer to this resource as the "/data.txt
   file", though the resource need in fact not originate from a file-
   system.

   Some examples of URLs [4] for sites and URLs for corresponding
   "/data.txt" sites:

     https://datatxt.org/            https://datatxt.org/data.txt

     http://www.bar.com:8001/        http://www.bar.com:8001/data.txt

   If the server response indicates Success (HTTP 2xx Status Code,)
   the visitor can read the content, parse it, and present an
   appropriate user interface or automated method to process the
   Web site as a source for a given data catalog.

   If the server response indicates the resource does not exist (HTTP
   Status Code 404), the visitor should assume the Web site might
   contain datasets which require to be discovered by other methods.

3.2 File Format Description

   The instructions are encoded as a formatted plain text object,
   described here. A complete BNF-like description of the syntax of this
   format is given in section 3.3.
  
   The format logically consists of a non-empty set or records,
   separated by blank lines. The records consist of a set of lines of
   the form:
  
     <Field> ":" <value>
  
   In this memo we refer to lines with a Field "foo" as "foo lines".

   The record starts with one or more User-agent lines, specifying
   which robots the record applies to, followed by "Disallow" and
   "Allow" instructions to that robot. For example:
  
     User-agent: webcrawler
     User-agent: infoseek
     Allow:    /tmp/ok.html
     Disallow: /tmp
     Disallow: /user/foo
    
   These lines are discussed separately below.
   
   Lines with Fields not explicitly specified by this specification
   may occur in the /robots.txt, allowing for future extension of the
   format. Consult the BNF for restrictions on the syntax of such
   extensions. Note specifically that for backwards compatibility 
   with robots implementing earlier versions of this specification,
   breaking of lines is not allowed.

   Comments are allowed anywhere in the file, and consist of optional
   whitespace, followed by a comment character '#' followed by the
   comment, terminated by the end-of-line.
  
3.2.1 The User-agent line

   Name tokens are used to allow robots to identify themselves via a
   simple product token. Name tokens should be short and to the
   point. The name token a robot chooses for itself should be sent
   as part of the HTTP User-agent header, and must be well documented.

   These name tokens are used in User-agent lines in /robots.txt to
   identify to which specific robots the record applies. The robot
   must obey the first record in /robots.txt that contains a User-
   Agent line whose value contains the name token of the robot as a 
   substring. The name comparisons are case-insensitive. If no such
   record exists, it should obey the first record with a User-agent
   line with a "*" value, if present. If no record satisfied either
   condition, or no records are present at all, access is unlimited.

   The name comparisons are case-insensitive.
  
   For example, a fictional company FigTree Search Services who names
   their robot "Fig Tree", send HTTP requests like:
  
     GET / HTTP/1.0
     User-agent: FigTree/0.1 Robot libwww-perl/5.04
    
   might scan the "/robots.txt" file for records with:
  
     User-agent: figtree
  
3.2.2 The Allow and Disallow lines

   These lines indicate whether accessing a URL that matches the
   corresponding path is allowed or disallowed. Note that these
   instructions apply to any HTTP method on a URL.
  
   To evaluate if access to a URL is allowed, a robot must attempt to
   match the paths in Allow and Disallow lines against the URL, in the
   order they occur in the record. The first match found is used. If no
   match is found, the default assumption is that the URL is allowed.

   The /robots.txt URL is always allowed, and must not appear in the
   Allow/Disallow rules.

   The matching process compares every octet in the path portion of
   the URL and the path from the record. If a %xx encoded octet is

   encountered it is unencoded prior to comparison, unless it is the
   "/" character, which has special meaning in a path. The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered.

   This table illustrates some examples:
  
     Record Path        URL path         Matches
     /tmp               /tmp               yes
     /tmp               /tmp.html          yes
     /tmp               /tmp/a.html        yes
     /tmp/              /tmp               no
     /tmp/              /tmp/              yes
     /tmp/              /tmp/a.html        yes
     
     /a%3cd.html        /a%3cd.html        yes
     /a%3Cd.html        /a%3cd.html        yes
     /a%3cd.html        /a%3Cd.html        yes
     /a%3Cd.html        /a%3Cd.html        yes
     
     /a%2fb.html        /a%2fb.html        yes
     /a%2fb.html        /a/b.html          no
     /a/b.html          /a%2fb.html        no
     /a/b.html          /a/b.html          yes
     
     /%7ejoe/index.html /~joe/index.html   yes
     /~joe/index.html   /%7Ejoe/index.html yes
    
3.3 Formal Syntax

  This is a BNF-like description, using the conventions of RFC 822 [5],
  except that "|" is used to designate alternatives.  Briefly, literals
  are quoted with "", parentheses "(" and ")" are used to group
  elements, optional elements are enclosed in [brackets], and elements
  may be preceded with <n>* to designate n or more repetitions of the
  following element; n defaults to 0.

    robotstxt    = *blankcomment
                 | *blankcomment record *( 1*commentblank 1*record )
                   *blankcomment
    blankcomment = 1*(blank | commentline)
    commentblank = *commentline blank *(blankcomment)
    blank        = *space CRLF
    CRLF         = CR LF
    record       = *commentline agentline *(commentline | agentline)
                   1*ruleline *(commentline | ruleline)
    agentline    = "User-agent:" *space agent  [comment] CRLF
    ruleline     = (disallowline | allowline | extension)
    disallowline = "Disallow" ":" *space path [comment] CRLF
    allowline    = "Allow" ":" *space rpath [comment] CRLF
    extension    = token : *space value [comment] CRLF
    value        = <any CHAR except CR or LF or "#">

    commentline  = comment CRLF
    comment      = *blank "#" anychar
    space        = 1*(SP | HT)
    rpath        = "/" path
    agent        = token
    anychar      = <any CHAR except CR or LF>
    CHAR         = <any US-ASCII character (octets 0 - 127)>
    CTL          = <any US-ASCII control character
                        (octets 0 - 31) and DEL (127)>
    CR           = <US-ASCII CR, carriage return (13)>
    LF           = <US-ASCII LF, linefeed (10)>
    SP           = <US-ASCII SP, space (32)>
    HT           = <US-ASCII HT, horizontal-tab (9)>

   The syntax for "token" is taken from RFC 1945 [2], reproduced here for
   convenience:
   
    token        = 1*<any CHAR except CTLs or tspecials>

    tspecials    = "(" | ")" | "<" | ">" | "@"
                 | "," | ";" | ":" | "\" | <">
                 | "/" | "[" | "]" | "?" | "="
                 | "{" | "}" | SP | HT

  The syntax for "path" is defined in RFC 1808 [6], reproduced here for
  convenience:

    path        = fsegment *( "/" segment )
    fsegment    = 1*pchar
    segment     =  *pchar

    pchar       = uchar | ":" | "@" | "&" | "="
    uchar       = unreserved | escape
    unreserved  = alpha | digit | safe | extra

    escape      = "%" hex hex
    hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                         "a" | "b" | "c" | "d" | "e" | "f"

    alpha       = lowalpha | hialpha

    lowalpha    = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
                  "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
                  "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
    hialpha     = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                  "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                  "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"

    digit       = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                  "8" | "9"

    safe        = "$" | "-" | "_" | "." | "+"
    extra       = "!" | "*" | "'" | "(" | ")" | ","

                   
3.4 Expiration

   Robots should cache /robots.txt files, but if they do they must
   periodically verify the cached copy is fresh before using its
   contents.

   Standard HTTP cache-control mechanisms can be used by both origin
   server and robots to influence the caching of the /robots.txt file.
   Specifically robots should take note of Expires header set by the
   origin server.

   If no cache-control directives are present robots should default to
   an expiry of 7 days.


4. Examples

   This section contains an example of how a /robots.txt may be used.

   A fictional site may have the following URLs:

     http://www.fict.org/
     http://www.fict.org/index.html
     http://www.fict.org/robots.txt
     http://www.fict.org/server.html
     http://www.fict.org/services/fast.html
     http://www.fict.org/services/slow.html
     http://www.fict.org/orgo.gif
     http://www.fict.org/org/about.html
     http://www.fict.org/org/plans.html
     http://www.fict.org/%7Ejim/jim.html
     http://www.fict.org/%7Emak/mak.html

   The site may in the /robots.txt have specific rules for robots that
   send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and

   "Excite/1.0", and a set of default rules:

      # /robots.txt for http://www.fict.org/
      # comments to webmaster@fict.org

      User-agent: unhipbot
      Disallow: /

      User-agent: webcrawler
      User-agent: excite
      Disallow: 

      User-agent: *
      Disallow: /org/plans.html
      Allow: /org/
      Allow: /serv
      Allow: /~mak
     Disallow: /

   The following matrix shows which robots are allowed to access URLs:

                                               unhipbot webcrawler other
                                                        & excite
     http://www.fict.org/                         No       Yes       No
     http://www.fict.org/index.html               No       Yes       No
     http://www.fict.org/robots.txt               Yes      Yes       Yes
     http://www.fict.org/server.html              No       Yes       Yes
     http://www.fict.org/services/fast.html       No       Yes       Yes
     http://www.fict.org/services/slow.html       No       Yes       Yes
     http://www.fict.org/orgo.gif                 No       Yes       No
     http://www.fict.org/org/about.html           No       Yes       Yes
     http://www.fict.org/org/plans.html           No       Yes       No
     http://www.fict.org/%7Ejim/jim.html          No       Yes       No
     http://www.fict.org/%7Emak/mak.html          No       Yes       Yes


5. Notes for Implementors

5.1   Backwards Compatibility

   Previous of this specification didn't provide the Allow line. The
   introduction of the Allow line causes robots to behave slightly 
   differently under either specification:
   
   If a /robots.txt contains an Allow which overrides a later occurring
   Disallow, a robot ignoring Allow lines will not retrieve those
   parts. This is considered acceptable because there is no requirement
   for a robot to access URLs it is allowed to retrieve, and it is safe,
   in that no URLs a Web site administrator wants to Disallow are be
   allowed. It is expected this may in fact encourage robots to upgrade
   compliance to the specification in this memo.

5.2   Interoperability

   Implementors should pay particular attention to the robustness in
   parsing of the /robots.txt file. Web site administrators who are not
   aware of the /robots.txt mechanisms often notice repeated failing
   request for it in their log files, and react by putting up pages
   asking "What are you looking for?".

   As the majority of /robots.txt files are created with platform-
   specific text editors, robots should be liberal in accepting files
   with different end-of-line conventions, specifically CR and LF in
   addition to CRLF.


6. Security Considerations

   There are a few risks in the method described here, which may affect
   either origin server or robot.

   Web site administrators must realise this method is voluntary, and
   is not sufficient to guarantee some robots will not visit restricted
   parts of the URL space. Failure to use proper authentication or other
   restriction may result in exposure of restricted information. It even
   possible that the occurence of paths in the /robots.txt file may
   expose the existence of resources not otherwise linked to on the
   site, which may aid people guessing for URLs.

   Robots need to be aware that the amount of resources spent on dealing
   with the /robots.txt is a function of the file contents, which is not
   under the control of the robot. For example, the contents may be
   larger in size than the robot can deal with. To prevent denial-of-
   service attacks, robots are therefore encouraged to place limits on
   the resources spent on processing of /robots.txt.

   The /robots.txt directives are retrieved and applied in separate,
   possible unauthenticated HTTP transactions, and it is possible that
   one server can impersonate another or otherwise intercept a
   /robots.txt, and provide a robot with false information. This
   specification does not preclude authentication and encryption
   from being employed to increase security.

7. Acknowledgements

   The author would like the subscribers to the robots mailing list for
   their contributions to this specification.

8. References

   [1] Koster, M., "A Standard for Robot Exclusion", 
       http://info.webcrawler.com/mak/projects/robots/norobots.html,
       June 1994.
   
   [2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext
       Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996.
       
   [3] Postel, J., "Media Type Registration Procedure." RFC 1590,
        USC/ISI, March 1994.
   
   [4]  Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
        Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
        University of Minnesota, December 1994.

   [5] Crocker, D., "Standard for the Format of ARPA Internet Text
       Messages", STD 11, RFC 822, UDEL, August 1982.

   [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
       UC Irvine, June 1995.

9. Author's Address

   Martijn Koster
   WebCrawler
   America Online
   690 Fifth Street
   San Francisco
   CA 94107
   
   Phone: 415-3565431
   EMail: m.koster@webcrawler.com