robotstxt.js

robotstxt.js is a lightweight JavaScript library for parsing robots.txt files. It provides a compliant solution in both browser and Node.js environments.

Directives

Clean-param
Host
Sitemap
User-agent
- Allow
- Disallow
- Crawl-delay
- Cache-delay
- Comment
- NoIndex
- Request-rate
- Robot-version
- Visit-time

Benefits

Accurately parse and interpret robots.txt rules.
Ensure compliance with robots.txt standards to avoid accidental blocking of legitimate bots.
Easily check URL permissions for different user agents programmatically.
Simplify the process of working with robots.txt in JavaScript applications.

Usage

Here's how to use robotstxt.js to analyze robots.txt content and check crawler permissions.

Node.js

const { robotstxt } = require("@playfulsparkle/robotstxt-js")
...

### JavaScript

```javascript
// Parse robots.txt content
const robotsTxtContent = `
User-Agent: GoogleBot
Allow: /public
Disallow: /private
Crawl-Delay: 5
Sitemap: https://example.com/sitemap.xml
`;

const parser = robotstxt(robotsTxtContent);

// Check URL permissions
console.log(parser.isAllowed("/public/data", "GoogleBot"));   // true
console.log(parser.isDisallowed("/private/admin", "GoogleBot")); // true

// Get specific user agent group
const googleBotGroup = parser.getGroup("googlebot"); // Case-insensitive
if (googleBotGroup) {
    console.log("Crawl Delay:", googleBotGroup.getCrawlDelay()); // 5
    console.log("Rules:", googleBotGroup.getRules().map(rule =>
        `${rule.type}: ${rule.path}`
    )); // ["allow: /public", "disallow: /private"]
}

// Get all sitemaps
console.log("Sitemaps:", parser.getSitemaps()); // ["https://example.com/sitemap.xml"]

// Check default rules (wildcard *)
console.log(parser.isAllowed("/protected", "*")); // true (if no wildcard rules exist)

Installation

NPM

npm i @playfulsparkle/robotstxt-js

Bower

bower install playfulsparkle/robotstxt.js

API Documentation

Core Methods

robotstxt(content: string): RobotsTxtParser - Creates a new parser instance with the provided robots.txt content.
getReports(): string[] - Get an array of parsing error, warning etc.
isAllowed(url: string, userAgent: string): boolean - Check if a URL is allowed for the specified user agent (throws if parameters are missing).
isDisallowed(url: string, userAgent: string): boolean - Check if a URL is disallowed for the specified user agent (throws if parameters are missing).
getGroup(userAgent: string): Group | undefined - Get the rules group for a specific user agent (case-insensitive match).
getSitemaps(): string[] - Get an array of discovered sitemap URLs from Sitemap directives.
getCleanParams(): string[] - Retrieve Clean-param directives for URL parameter sanitization.
getHost(): string | undefined - Get canonical host declaration for domain normalization.

Group Methods (via `getGroup()` result)

User Agent Info

getName(): string - User agent name for this group.
getComment(): string[] - Associated comment from the Comment directive.
getRobotVersion(): string | undefined - Robots.txt specification version.
getVisitTime(): string | undefined - Recommended crawl time window.

Crawl Management

getCacheDelay(): number | undefined - Cache delay in seconds.
getCrawlDelay(): number | undefined - Crawl delay in seconds.
getRequestRates(): string[] - Request rate limitations.

Rule Access

getRules(): Rule[] - All rules (allow/disallow/noindex) for this group.
addRule(type: string, path: string): void - Add rule (throws if type missing, throws if path missing).

Specification Support

Full Support

User-agent groups and inheritance
Allow/Disallow directives
Wildcard pattern matching (*)
End-of-path matching ($)
Crawl-delay directives
Sitemap discovery
Case-insensitive matching
Default user-agent (*) handling
Multiple user-agent declarations
Rule precedence by specificity

Support

Node.js

robotstxt.js runs in all active Node versions (6.x+).

Browser Support

This library is written using modern JavaScript ES2015 (ES6) features. It is expected to work in the following browser versions and later:

Browser	Minimum Supported Version
Desktop Browsers
Chrome	49
Edge	13
Firefox	45
Opera	36
Safari	14.1
Mobile Browsers
Chrome Android	49
Firefox for Android	45
Opera Android	36
Safari on iOS	14.5
Samsung Internet	5.0
WebView Android	49
WebView on iOS	14.5
Other
Node.js	6.13.0

Specifications

License

robotstxt.js is licensed under the terms of the BSD 3-Clause License.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
benchmark		benchmark
demo		demo
dist		dist
src		src
test		test
.editorconfig		.editorconfig
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.npmignore		.npmignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
bower.json		bower.json
gulpfile.js		gulpfile.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

robotstxt.js

Directives

Benefits

Usage

Node.js

Installation

NPM

Bower

API Documentation

Core Methods

Group Methods (via `getGroup()` result)

User Agent Info

Crawl Management

Rule Access

Specification Support

Full Support

Support

Node.js

Browser Support

Specifications

License

About

Releases 7

Packages

Languages

License

playfulsparkle/robotstxt.js

Folders and files

Latest commit

History

Repository files navigation

robotstxt.js

Directives

Benefits

Usage

Node.js

Installation

NPM

Bower

API Documentation

Core Methods

Group Methods (via getGroup() result)

User Agent Info

Crawl Management

Rule Access

Specification Support

Full Support

Support

Node.js

Browser Support

Specifications

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Group Methods (via `getGroup()` result)

Packages