Skip to content

Conversation

@MAVRICK-1
Copy link

This pull request introduces a new robots.txt file to control how search engines crawl and index certain pages of the application. The main change is to restrict search engine access to specific query-based URLs while allowing access to the main resource pages.

Search engine crawling restrictions:

  • Added a new robots.txt file to disallow crawling of query parameter URLs for datasets, tasks, flows, and runs (e.g., /datasets?*), helping prevent indexing of filtered or search result pages.
  • Allowed crawling of the main resource listing pages without query parameters (e.g., `/datasets), ensuring these pages remain indexable.

Fixes #335

This commit adds a robots.txt file to the public directory of the Next.js application.
The robots.txt file disallows crawling of search pages with query parameters to prevent web crawlers from getting stuck in crawler traps.

Fixes openml#335
Copy link
Contributor

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time for a contribution!

There are two issues with this update:

  • as noted in the original issue, the current dataset pages include queries (though they are under /search so wouldn't be blocked with this)
  • all pages are currently under the /search path, this robots file is configuring paths which do not exist on the server and are not being crawled.

It's probably easier to wait with the update until the new frontend is live which has fixed urls for datasets so that we can block all queries for crawlers, unless you have a suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add additional filters to robots.txt to avoid crawler traps

2 participants