robotstxt.js is a lightweight JavaScript library for parsing robots.txt files. It provides a compliant solution in both browser and Node.js environments.
- Clean-param
- Host
- Sitemap
- User-agent
- Allow
- Disallow
- Crawl-delay
- Cache-delay
- Comment
- NoIndex
- Request-rate
- Robot-version
- Visit-time
- Accurately parse and interpret
robots.txt
rules. - Ensure compliance with robots.txt standards to avoid accidental blocking of legitimate bots.
- Easily check URL permissions for different user agents programmatically.
- Simplify the process of working with
robots.txt
in JavaScript applications.
Here's how to use robotstxt.js
to analyze robots.txt content and check crawler permissions.
const { robotstxt } = require("@playfulsparkle/robotstxt-js")
...
### JavaScript
```javascript
// Parse robots.txt content
const robotsTxtContent = `
User-Agent: GoogleBot
Allow: /public
Disallow: /private
Crawl-Delay: 5
Sitemap: https://example.com/sitemap.xml
`;
const parser = robotstxt(robotsTxtContent);
// Check URL permissions
console.log(parser.isAllowed("/public/data", "GoogleBot")); // true
console.log(parser.isDisallowed("/private/admin", "GoogleBot")); // true
// Get specific user agent group
const googleBotGroup = parser.getGroup("googlebot"); // Case-insensitive
if (googleBotGroup) {
console.log("Crawl Delay:", googleBotGroup.getCrawlDelay()); // 5
console.log("Rules:", googleBotGroup.getRules().map(rule =>
`${rule.type}: ${rule.path}`
)); // ["allow: /public", "disallow: /private"]
}
// Get all sitemaps
console.log("Sitemaps:", parser.getSitemaps()); // ["https://example.com/sitemap.xml"]
// Check default rules (wildcard *)
console.log(parser.isAllowed("/protected", "*")); // true (if no wildcard rules exist)
npm i @playfulsparkle/robotstxt-js
bower install playfulsparkle/robotstxt.js
robotstxt(content: string): RobotsTxtParser
- Creates a new parser instance with the providedrobots.txt
content.getReports(): string[]
- Get an array of parsing error, warning etc.isAllowed(url: string, userAgent: string): boolean
- Check if a URL is allowed for the specified user agent (throws if parameters are missing).isDisallowed(url: string, userAgent: string): boolean
- Check if a URL is disallowed for the specified user agent (throws if parameters are missing).getGroup(userAgent: string): Group | undefined
- Get the rules group for a specific user agent (case-insensitive match).getSitemaps(): string[]
- Get an array of discovered sitemap URLs from Sitemap directives.getCleanParams(): string[]
- Retrieve Clean-param directives for URL parameter sanitization.getHost(): string | undefined
- Get canonical host declaration for domain normalization.
getName(): string
- User agent name for this group.getComment(): string[]
- Associated comment from the Comment directive.getRobotVersion(): string | undefined
- Robots.txt specification version.getVisitTime(): string | undefined
- Recommended crawl time window.
getCacheDelay(): number | undefined
- Cache delay in seconds.getCrawlDelay(): number | undefined
- Crawl delay in seconds.getRequestRates(): string[]
- Request rate limitations.
getRules(): Rule[]
- All rules (allow/disallow/noindex) for this group.addRule(type: string, path: string): void
- Add rule (throws if type missing, throws if path missing).
- User-agent groups and inheritance
- Allow/Disallow directives
- Wildcard pattern matching (
*
) - End-of-path matching (
$
) - Crawl-delay directives
- Sitemap discovery
- Case-insensitive matching
- Default user-agent (
*
) handling - Multiple user-agent declarations
- Rule precedence by specificity
robotstxt.js
runs in all active Node versions (6.x+).
This library is written using modern JavaScript ES2015 (ES6) features. It is expected to work in the following browser versions and later:
Browser | Minimum Supported Version |
---|---|
Desktop Browsers | |
Chrome | 49 |
Edge | 13 |
Firefox | 45 |
Opera | 36 |
Safari | 14.1 |
Mobile Browsers | |
Chrome Android | 49 |
Firefox for Android | 45 |
Opera Android | 36 |
Safari on iOS | 14.5 |
Samsung Internet | 5.0 |
WebView Android | 49 |
WebView on iOS | 14.5 |
Other | |
Node.js | 6.13.0 |
- Google robots.txt specifications
- Yandex robots.txt specifications
- Sean Conner: "An Extended Standard for Robot Exclusion"
- Martijn Koster: "A Method for Web Robots Control"
- Martijn Koster: "A Standard for Robot Exclusion"
- RFC 7231,
2616 - RFC 7230,
2616 - RFC 5322,
2822,822 - RFC 3986,
1808 - RFC 1945
- RFC 1738
- RFC 952
robotstxt.js is licensed under the terms of the BSD 3-Clause License.