Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setting random seeds #163

Closed
lheagy opened this issue Mar 19, 2023 · 7 comments
Closed

setting random seeds #163

lheagy opened this issue Mar 19, 2023 · 7 comments
Labels
invalid This doesn't seem right

Comments

@lheagy
Copy link
Collaborator

lheagy commented Mar 19, 2023

In the worksheets and tutorials, we use a mix of setting random seeds via numpy (np.random.seed) and passing a random_state to scikit learn (https://github.ubc.ca/UBC-DSCI/dsci-100-instructor/issues/760).

In chapter 6, the textbook only discusses using numpy. We should add discussion of setting the random state with scikit learn (and possibly reduce numpy text if we are able to entirely rely on scikit-learn for setting the random state)

@joelostblom
Copy link
Collaborator

Trevor and I discussed this a bit here #114, and from that discussion it seems like setting the numpy random seed is suitable and it is probably not worth getting involved with multiple approaches. Although since setting the random_state is commonly seen online it might be worth mentioning in a pop up box or something that it is done for a similar reason to using the numpy random seed

@joelostblom
Copy link
Collaborator

@trevorcampbell I came across this again since it is one of the open issues in the dsci100 repo https://github.ubc.ca/UBC-DSCI/dsci-100-instructor/issues/760. One scenario that I don't think we discussed in detail in #114, is that students might be confused when they keep rerunning the same cell but receiving different outputs. Especially since the numpy random seed is set in the beginning which could be far away from the cell they are working with, it might not be immediately obvious for them what is going on when the same code gives different output, for example this cell that comes a fair bit down in the worksheet (it currently has random_state but we would remove that if we go with the numpy seed approach):

image

Another inconvenience is that if they make changes to their code, they might not easily know if the new output is due to the changes they made or the randomness from progressing the PRNG. To avoid that and be sure that difference in the output are from the change they made, they would need to rerun the entire notebook each time. Likewise, if they use an additional function somewhere that progresses the PRNG, they might get cascading errors downstream of that.

I don't know how big of an issue and timesink these scenarios would present in practice, and I think #114 brought up compelling reasons in favor of using numpy seeds as well, but it does seem like from the perspective of the students, setting the random state right in front of their eyes with the function they are working on would bring some conveniences.

I'll wait to modify this part of the worksheets and tutorials for now,

@trevorcampbell
Copy link
Contributor

@joelostblom in the R version, we resolve this by setting the seed in every cell that involves randomness. Which is essentially the same as setting the random state everywhere. I want to emphasize that this is terrible practice in real data analysis, as in the discussion in #114 . But we don't really have a choice -- in autograded jupyter notebooks, we have to control the random stream so that we can check answers against our solutions.

In theory we could somehow incorporate the random state into the test code too, but that would be incredibly difficult / not worth the effort.

Here you can find what we write in the Classification II worksheet in the R version:

image

I'm perfectly fine with setting the numpy seed, or setting the random_state everywhere, as long as we include a paragraph like that before the first instance so students know what to expect.

@trevorcampbell
Copy link
Contributor

R version: in question cells, students see something like this:

image

@trevorcampbell
Copy link
Contributor

Another approach that we didn't consider at the time: we could restart & run all the notebook a lot of times in advance to understand the distribution of each test variable at test time, and formulate statistical tests (with a very low probability of false result) for checking each of the answers.

That would certainly be a lot more elegant, but very difficult given the way things are currently done.

(But I'll open an issue in Autotest to add that functionality, which could be quite nice to have in general...)

@joelostblom
Copy link
Collaborator

joelostblom commented Aug 7, 2023

TLDR; I suggest that we wait to update until scikit-learn has updated their recommendations and then do one PR with all randomness updates. Things are currently reproducible in our worksheets and we have a note in worksheet 6:

image


Based on our previous discussion, I opened scikit-learn/scikit-learn#25395 which sprung scikit-learn/scikit-learn#26148 that in turn led to a proposal to improve both the code and docs for randomness in sklearn scikit-learn/enhancement_proposals#88. It is still open for discussion in case you have the time and desire to contribute some of your expertise. One of the more amusing things mentioned is that randomness in sklearn is hard to understand:

image

In terms changes on our end, I think the best long term solution would be to create a numpy random generator with rng = np.random.default_rng(...) in each cell of the worksheets/tutorials and pass it to random_state and then include a note that says that in "real-life" analysis, students should only create the generator once, but then still pass it to random_state each time. We can also update the textbook to reflect this and then we don't need the large note we have talking about the details of global random states, and we are also following the best practice by default.

I think in practice for our students this won't change much, so it is a low prio item, but I think this is the closest we can get to best practice while also retaining reproducibility, and it would also be in line with the numpy recommendations to use the new generators and what Trevor mentioned about the limitations in the old numpy PRNG algo (and hopefully in line with the updated sklearn recommendations as well). But let's wait until the scikit learn updates are finalized so that we only need to update this once on our end. For now I'm good with keeping a mix in the worksheets, unless someone have additional time to take this one (I think it all should be one PR for the assignments + one for the textbook).

@trevorcampbell trevorcampbell added invalid This doesn't seem right and removed needs-investigation Further information is requested low-priority waiting labels Nov 11, 2023
@trevorcampbell
Copy link
Contributor

This is no longer a textbook issue, this is an issue for how we do the worksheets in the course. I'm marking this as invalid here (and will link to it from the course repo) since I can't transfer it across orgs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants