feat: add robots.txt to prevent crawler traps #365

MAVRICK-1 · 2025-11-27T15:54:56Z

This pull request introduces a new robots.txt file to control how search engines crawl and index certain pages of the application. The main change is to restrict search engine access to specific query-based URLs while allowing access to the main resource pages.

Search engine crawling restrictions:

Added a new robots.txt file to disallow crawling of query parameter URLs for datasets, tasks, flows, and runs (e.g., /datasets?*), helping prevent indexing of filtered or search result pages.
Allowed crawling of the main resource listing pages without query parameters (e.g., `/datasets), ensuring these pages remain indexable.

Fixes #335

This commit adds a robots.txt file to the public directory of the Next.js application. The robots.txt file disallows crawling of search pages with query parameters to prevent web crawlers from getting stuck in crawler traps. Fixes openml#335

PGijsbers

Thanks for taking the time for a contribution!

There are two issues with this update:

as noted in the original issue, the current dataset pages include queries (though they are under /search so wouldn't be blocked with this)
all pages are currently under the /search path, this robots file is configuring paths which do not exist on the server and are not being crawled.

It's probably easier to wait with the update until the new frontend is live which has fixed urls for datasets so that we can block all queries for crawlers, unless you have a suggestion.

PGijsbers reviewed Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add robots.txt to prevent crawler traps #365

feat: add robots.txt to prevent crawler traps #365

Uh oh!

MAVRICK-1 commented Nov 27, 2025

Uh oh!

PGijsbers left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: add robots.txt to prevent crawler traps #365

Are you sure you want to change the base?

feat: add robots.txt to prevent crawler traps #365

Uh oh!

Conversation

MAVRICK-1 commented Nov 27, 2025

Uh oh!

PGijsbers left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants