Skip to content

Conversation

@ANAMASGARD
Copy link
Contributor

@ANAMASGARD ANAMASGARD commented Dec 13, 2025

Fixes #243

Add RF100 Dataset Catalog

Problem Solved

Users had no easy way to:

  • Browse all 34 RF100 datasets at once
  • Search for datasets by topic or keyword
  • Understand dataset characteristics before downloading

Example question that was impossible before:

"Is there a photovoltaic dataset in torchvision?"

Now it's easy:
search_rf100("photovoltaic") # Instant answer!## What's Included

Searchable Catalog

  • Complete metadata for all 34 RF100 datasets
  • Descriptions, sizes, splits, collections
  • Available as R data (rf100_catalog) and CSV

Search Functions

Search by keyword

search_rf100("cell")
search_rf100("solar")
search_rf100("medical")

Filter by collection

search_rf100(collection = "biology")
search_rf100(collection = "medical")

Get complete catalog

catalog <- get_rf100_catalog()
View(catalog)

List datasets in a collection

list_rf100_datasets("biology")### Comprehensive Documentation

  • Vignette: vignette("rf100-datasets") - Complete catalog with examples
  • README: Updated with RF100 catalog section and quick start
  • Man pages: Full documentation for all functions

New Files

  • R/rf100-catalog.R - Search and catalog functions
  • data/rf100_catalog.rda - R data object
  • inst/extdata/rf100_catalog.csv - CSV export
  • data-raw/create_rf100_catalog.R - Generation script
  • vignettes/rf100-datasets.Rmd - Complete documentation
  • tests/testthat/test-rf100-catalog.R - Test suite (48 tests)
  • man/*.Rd - Function documentation (auto-generated)

Modified Files

  • README.md - Added RF100 catalog section
  • NAMESPACE - Added function exports (auto-generated)
  • DESCRIPTION - Added LazyData (auto-generated)

Copy link
Collaborator

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise Thanks for this well documented dataset
missing as you made a "task" column (which is currently constant) , why not include the other collection (emnist_collection) to it.
suggestion rename dataset and new functions from rf100_* to collection_*

@ANAMASGARD
Copy link
Contributor Author

Sir @cregouby Thank you for the feedback! I've addressed both suggestions:

  • Removed hardcoded dataset count from documentation
  • Removed estimated_images column entirely (calculation, data, and documentation)

The catalog now has 13 columns (down from 14), and tests have been updated accordingly.

@ANAMASGARD ANAMASGARD requested a review from cregouby December 21, 2025 13:18
@cregouby
Copy link
Collaborator

cregouby commented Dec 24, 2025

Hello @ANAMASGARD,

Thanks a lot for those improvement.
I feel sad that we loose the number of images in each dataset in your last commit. It is, I think, a valuable information to choose a deataset from.
And if we think about it, one of the object detection cahllenge is image size, so typical / median image size would also be a valuable information.
Also, could you add a NEWS entry for this content ?

Tell me if you have the bandwidth to add both. If not that will be a later-on improvement.

Best reagards

@cregouby cregouby merged commit 36c9d14 into mlverse:main Dec 26, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Dataset Information] Document the RF100 dataset collection

3 participants