shuffle as boolean while converting dataset to dataclass #1213

mrastgoo · 2025-01-12T20:19:35Z

adding shuffling as boolean in _fetch_dataset_as_dataclass and setting it to True in fetch_employee_salaries

Vincent-Maladiere

Hey @mrastgoo, thank you for this PR!

Your tests fail because of network issues with OpenML. This is unrelated to your PR, so I reran them.

Vincent-Maladiere · 2025-01-13T08:42:31Z

skrub/datasets/_fetching.py

@@ -702,6 +703,8 @@ def _fetch_dataset_as_dataclass(
            df = pd.read_parquet(info["path"])
        else:
            df = pd.read_csv(info["path"], **read_csv_kwargs)
+        if shuffling:
+            df = shuffle(df, random_state=42).reset_index(drop=True)


I think the random state should be a function parameter, as it would allow for control for multiple seeds during testing and more fine-grained control on reproducibility.

I'm not 100% sure we should reset indices, because it could be surprising for users. WDYT?

jeromedockes · 2025-01-13T12:08:04Z

Thanks @mrastgoo! I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing #1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself. WDYT?

GaelVaroquaux · 2025-01-15T17:58:59Z

I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing [1]#1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself

I'd rather keep the code examples as simple as possible (every character counts), and shuffle in the fetcher, by default, in a reproducible way. And for this, an easy way of doing things would be to cut and reorder the dataset at a fix index, like "cutting" a deck of cards.

mrastgoo · 2025-01-15T21:29:50Z

Hey @mrastgoo, thank you for this PR!

Your tests fail because of network issues with OpenML. This is unrelated to your PR, so I reran them.

Thanks @mrastgoo! I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing #1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself. WDYT?

I can see why you prefer that, in this way the fetcher will return the original data as it is, with the same index. However it will make the example longer.

mrastgoo · 2025-01-15T21:31:41Z

I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing [1]#1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself
I'd rather keep the code examples as simple as possible (every character counts), and shuffle in the fetcher, by default, in a reproducible way. And for this, an easy way of doing things would be to cut and reorder the dataset at a fix index, like "cutting" a deck of cards.

I am not sure to understand the purpose of "cutting" @GaelVaroquaux, how many cuts do we do ? Do we want to shuffle as well ? and should we keep the orginal index or reorder them ?

GaelVaroquaux · 2025-01-15T21:37:27Z

I am not sure to understand the purpose of "cutting" @GaelVaroquaux,

The whole goal of the modification is to have the first few lines not as nasty.

how many cuts do we do ?

Let's try one.

Do we want to shuffle as well ?

No, that way it's stable and reproducible (random number generators are not always reproducible across hardware)

and should we keep the orginal index or reorder them ?

Not reorder, but reset to avoid something looking strange

jeromedockes · 2025-01-16T08:56:27Z

and IIUC the reordering is not optional and no new parameter is exposed to the user right?

GaelVaroquaux · 2025-01-16T09:00:12Z

and IIUC the reordering is not optional and no new parameter is exposed to the user right?

As you wish

jeromedockes · 2025-01-16T09:11:05Z

> and IIUC the reordering is not optional and no new parameter is exposed to the user right? As you wish

Ok in that case I'd rather not add a parameter, just do the transformation every time

Vincent-Maladiere · 2025-01-16T16:56:30Z

Then what about editing the dataset, putting the reordered version on Figshare and fetching that directly?

GaelVaroquaux · 2025-01-16T17:20:29Z

Then what about editing the dataset, putting the reordered version on Figshare and fetching that directly?

I always like having a form of traceability of the data, so I think that I would prefer not changing the upstream data

mrastgoo · 2025-01-25T18:15:58Z

and IIUC the reordering is not optional and no new parameter is exposed to the user right? As you wish
Ok in that case I'd rather not add a parameter, just do the transformation every time

Hi @jeromedockes, Just for clarification, you prefer to modify the example code ? I can modify the PR according to that.

jeromedockes · 2025-02-11T10:14:56Z

Hey @mrastgoo !

really sorry about the delay getting back to you. The reason is that there has been a major revamp of the dataset fetchers after we ran into some problems with how they were downloaded before and to simplify and improve the datasets module.

What I suggest is that we keep the plan of not modifying the downloaded file for now, but as discussed change the order of rows after loading the dataframe. The function you need to modify is now here:

https://github.com/skrub-data/skrub/blob/main/skrub/datasets/_fetching.py#L32

(There have been so many changes that it might be easier to start a new PR to avoid git conflicts, sorry about that)

the new function might look something like

diff --git a/skrub/datasets/_fetching.py b/skrub/datasets/_fetching.py
index 0e6a5724..a2537776 100644
--- a/skrub/datasets/_fetching.py
+++ b/skrub/datasets/_fetching.py
@@ -29,7 +29,13 @@ def fetch_employee_salaries(data_home=None):
         - y : pd.DataFrame, target labels
         - metadata : a dictionary containing the name, description, source and target
     """
-    return load_simple_dataset("employee_salaries", data_home)
+    data = load_simple_dataset("employee_salaries", data_home)
+    df = data['employee_salaries']
+    new_df = ... # re-order the dataframe
+    data['employee_salaries'] = new_df
+    data['X'] = new_df.drop(columns='current_annual_salary')
+    data['y'] = new_df['current_annual_salary']
+    return data
 
 
 def fetch_medical_charge(data_home=None):

mrastgoo · 2025-02-12T08:05:14Z

Hey @mrastgoo !

really sorry about the delay getting back to you. The reason is that there has been a major revamp of the dataset fetchers after we ran into some problems with how they were downloaded before and to simplify and improve the datasets module.

What I suggest is that we keep the plan of not modifying the downloaded file for now, but as discussed change the order of rows after loading the dataframe. The function you need to modify is now here:

https://github.com/skrub-data/skrub/blob/main/skrub/datasets/_fetching.py#L32

(There have been so many changes that it might be easier to start a new PR to avoid git conflicts, sorry about that)

the new function might look something like
diff --git a/skrub/datasets/_fetching.py b/skrub/datasets/_fetching.py
index 0e6a5724..a2537776 100644
--- a/skrub/datasets/_fetching.py
+++ b/skrub/datasets/_fetching.py
@@ -29,7 +29,13 @@ def fetch_employee_salaries(data_home=None):
         - y : pd.DataFrame, target labels
         - metadata : a dictionary containing the name, description, source and target
     """
-    return load_simple_dataset("employee_salaries", data_home)
+    data = load_simple_dataset("employee_salaries", data_home)
+    df = data['employee_salaries']
+    new_df = ... # re-order the dataframe
+    data['employee_salaries'] = new_df
+    data['X'] = new_df.drop(columns='current_annual_salary')
+    data['y'] = new_df['current_annual_salary']
+    return data
 
 
 def fetch_medical_charge(data_home=None):

Thanks @jeromedockes for the feedback, I try to finish the PR this weekend

shuffle as boolean while converting dataset to dataclass

9be7245

Vincent-Maladiere reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shuffle as boolean while converting dataset to dataclass #1213

shuffle as boolean while converting dataset to dataclass #1213

mrastgoo commented Jan 12, 2025

Vincent-Maladiere left a comment •

edited

Loading

Vincent-Maladiere Jan 13, 2025

jeromedockes commented Jan 13, 2025 •

edited

Loading

GaelVaroquaux commented Jan 15, 2025 via email

mrastgoo commented Jan 15, 2025

mrastgoo commented Jan 15, 2025

GaelVaroquaux commented Jan 15, 2025 via email

jeromedockes commented Jan 16, 2025

GaelVaroquaux commented Jan 16, 2025 via email

jeromedockes commented Jan 16, 2025 via email

Vincent-Maladiere commented Jan 16, 2025

GaelVaroquaux commented Jan 16, 2025 via email

mrastgoo commented Jan 25, 2025

jeromedockes commented Feb 11, 2025

mrastgoo commented Feb 12, 2025

shuffle as boolean while converting dataset to dataclass #1213

Are you sure you want to change the base?

shuffle as boolean while converting dataset to dataclass #1213

Conversation

mrastgoo commented Jan 12, 2025

Vincent-Maladiere left a comment • edited Loading

Choose a reason for hiding this comment

Vincent-Maladiere Jan 13, 2025

Choose a reason for hiding this comment

jeromedockes commented Jan 13, 2025 • edited Loading

GaelVaroquaux commented Jan 15, 2025 via email

mrastgoo commented Jan 15, 2025

mrastgoo commented Jan 15, 2025

GaelVaroquaux commented Jan 15, 2025 via email

jeromedockes commented Jan 16, 2025

GaelVaroquaux commented Jan 16, 2025 via email

jeromedockes commented Jan 16, 2025 via email

Vincent-Maladiere commented Jan 16, 2025

GaelVaroquaux commented Jan 16, 2025 via email

mrastgoo commented Jan 25, 2025

jeromedockes commented Feb 11, 2025

mrastgoo commented Feb 12, 2025

Vincent-Maladiere left a comment •

edited

Loading

jeromedockes commented Jan 13, 2025 •

edited

Loading