Skip to content

A lightweight, Open Source robots.txt parser written in JavaScript

License

Notifications You must be signed in to change notification settings

playfulsparkle/robotstxt.js

Repository files navigation

robotstxt.js

robotstxt.js is a lightweight JavaScript library for parsing robots.txt files. It provides a compliant solution in both browser and Node.js environments.

Directives

  • Clean-param
  • Host
  • Sitemap
  • User-agent
    • Allow
    • Disallow
    • Crawl-delay
    • Cache-delay
    • Comment
    • NoIndex
    • Request-rate
    • Robot-version
    • Visit-time

Benefits

  • Accurately parse and interpret robots.txt rules.
  • Ensure compliance with robots.txt standards to avoid accidental blocking of legitimate bots.
  • Easily check URL permissions for different user agents programmatically.
  • Simplify the process of working with robots.txt in JavaScript applications.

Usage

Here's how to use robotstxt.js to analyze robots.txt content and check crawler permissions.

Node.js

const { robotstxt } = require("@playfulsparkle/robotstxt-js")
...

### JavaScript

```javascript
// Parse robots.txt content
const robotsTxtContent = `
User-Agent: GoogleBot
Allow: /public
Disallow: /private
Crawl-Delay: 5
Sitemap: https://example.com/sitemap.xml
`;

const parser = robotstxt(robotsTxtContent);

// Check URL permissions
console.log(parser.isAllowed("/public/data", "GoogleBot"));   // true
console.log(parser.isDisallowed("/private/admin", "GoogleBot")); // true

// Get specific user agent group
const googleBotGroup = parser.getGroup("googlebot"); // Case-insensitive
if (googleBotGroup) {
    console.log("Crawl Delay:", googleBotGroup.getCrawlDelay()); // 5
    console.log("Rules:", googleBotGroup.getRules().map(rule =>
        `${rule.type}: ${rule.path}`
    )); // ["allow: /public", "disallow: /private"]
}

// Get all sitemaps
console.log("Sitemaps:", parser.getSitemaps()); // ["https://example.com/sitemap.xml"]

// Check default rules (wildcard *)
console.log(parser.isAllowed("/protected", "*")); // true (if no wildcard rules exist)

Installation

NPM

npm i @playfulsparkle/robotstxt-js

Bower

bower install playfulsparkle/robotstxt.js

API Documentation

Core Methods

  • robotstxt(content: string): RobotsTxtParser - Creates a new parser instance with the provided robots.txt content.
  • getReports(): string[] - Get an array of parsing error, warning etc.
  • isAllowed(url: string, userAgent: string): boolean - Check if a URL is allowed for the specified user agent (throws if parameters are missing).
  • isDisallowed(url: string, userAgent: string): boolean - Check if a URL is disallowed for the specified user agent (throws if parameters are missing).
  • getGroup(userAgent: string): Group | undefined - Get the rules group for a specific user agent (case-insensitive match).
  • getSitemaps(): string[] - Get an array of discovered sitemap URLs from Sitemap directives.
  • getCleanParams(): string[] - Retrieve Clean-param directives for URL parameter sanitization.
  • getHost(): string | undefined - Get canonical host declaration for domain normalization.

Group Methods (via getGroup() result)

User Agent Info

  • getName(): string - User agent name for this group.
  • getComment(): string[] - Associated comment from the Comment directive.
  • getRobotVersion(): string | undefined - Robots.txt specification version.
  • getVisitTime(): string | undefined - Recommended crawl time window.

Crawl Management

  • getCacheDelay(): number | undefined - Cache delay in seconds.
  • getCrawlDelay(): number | undefined - Crawl delay in seconds.
  • getRequestRates(): string[] - Request rate limitations.

Rule Access

  • getRules(): Rule[] - All rules (allow/disallow/noindex) for this group.
  • addRule(type: string, path: string): void - Add rule (throws if type missing, throws if path missing).

Specification Support

Full Support

  • User-agent groups and inheritance
  • Allow/Disallow directives
  • Wildcard pattern matching (*)
  • End-of-path matching ($)
  • Crawl-delay directives
  • Sitemap discovery
  • Case-insensitive matching
  • Default user-agent (*) handling
  • Multiple user-agent declarations
  • Rule precedence by specificity

Support

Node.js

robotstxt.js runs in all active Node versions (6.x+).

Browser Support

This library is written using modern JavaScript ES2015 (ES6) features. It is expected to work in the following browser versions and later:

Browser Minimum Supported Version
Desktop Browsers
Chrome 49
Edge 13
Firefox 45
Opera 36
Safari 14.1
Mobile Browsers
Chrome Android 49
Firefox for Android 45
Opera Android 36
Safari on iOS 14.5
Samsung Internet 5.0
WebView Android 49
WebView on iOS 14.5
Other
Node.js 6.13.0

Specifications

License

robotstxt.js is licensed under the terms of the BSD 3-Clause License.

About

A lightweight, Open Source robots.txt parser written in JavaScript

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published