Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for a replication or documentation mode #53

Open
christofs opened this issue Sep 2, 2022 · 2 comments
Open

Proposal for a replication or documentation mode #53

christofs opened this issue Sep 2, 2022 · 2 comments

Comments

@christofs
Copy link

This is just an idea or a suggestion. With the way stylo works at the moment, it is not very easy to create results that can easily be replicated or precisely documented. The reasons for this, if my probably superficial analysis is correct, have something to do with the following aspects:

  1. A typical manner of using stylo is calling it from the R prompt, then setting parameters in the GUI. Or setting parameters in the prompt, but still, working from the R prompt rather than using an R script.
  2. Some files, notably the plots, are named according to some of the GUI options (e.g. "PCA" and "1000mfw"), but not all parameters are (and can be) included in the file name. In any case, "stylo_config.txt" is a much better place for documenting these parameters.
  3. However, "stylo_config" is overwritten at every fresh run, including when plots with new filenames are created. Then, the parameters for the earlier analysis are lost.
  4. The frequency table can be saved, of course, but again it is not easy to figure out later one which table was used exactly for producing one of the earlier plots.

In practice, this means people need to copy this stuff to a new folder whenever they think an analysis is good and should be kept. More often than not, by the time you realize this, the "stylo_config" is already overwritten and the table of frequencies wasn't saved or was overwritten in the meantime as well.

A simple solution for this could be a "replication mode" or "documentation mode" that can be activated when calling stylo. One could simply say: "documentation=TRUE". Then, the following things would happen:

  1. At every run, as long as that parameter is TRUE, a subfolder is created whose name is a simple timestamp (yyyy-mm-dd-hh-mm)
  2. Optionally, if the parameter "documentation.label = "arbitrary-label-string" is also given, that label can be appended to the subfolder name.
  3. In this folder, all data that is necessary to replicate the analysis is copied: the plot, the stylo_config, the frequency table, possibly other files like a "metadata.csv" if it was used for labeling.

Of course, this creates a lot of data. Folders that turn out not to be useful need to be deleted at some point. But at least no data is lost.

To replicate an analysis, one simply needs to set the working directory to the right time-stamped folder and run stylo again to repeat (and then possibly vary) the analysis. Maybe a parameter like "replication=TRUE" could be used to activate all parameters necessary, for instance to make sure stylo uses the frequency table from the documentation.

Maybe not quite thought out to the end, but something along these lines might be useful.

@jmclawson
Copy link
Contributor

jmclawson commented Jan 27, 2023

I took a stab at this today.
https://gist.github.com/jmclawson/52252349dd100e426c2267b5de48aade

Does the code make sense as you imagine it, @christofs ? There are mainly two functions it makes available.

stylo_log

The first, stylo_log(), accepts a stylo() object that has just been created, and it logs the date and time, the stylo call, and the config file. It's used like this:

# option 1: pipe from stylo() into stylo_log()
stylo() |> stylo_log()

# option 2: enclose stylo() in stylo_log()
stylo_log(stylo())

# option 3: call stylo_log() on a stylo object immediately after creating it:
my_object <- stylo()
stylo_log(my_object)

Options exist, including log_label to redefine the label of the folder and log file, add_dir_date to add a date to the folder name (by default it doesn't do this), and log_date for appending a date to the end of the text file (with a default value of Sys.Date()). At its simplest, stylo_log() will create a folder called "stylo_log" containing a text log file for each day analyses are run.

At the same time that it appends the call and configuration to a log file, it also copies any files made at the same time as stylo_config.txt into the directory used for logging, prepending each of their file names with the date and time they were originally created.

stylo_replicate

The second function, stylo_replicate(), is a little more complex. It will do two things:

  1. If it is not passed a date_time argument, it will run both stylo() and stylo_log(), passing along the log_label, add_dir_date, and log_date arguments to stylo_log(), while passing along ... to stylo(). It's used like this: stylo_replicate() (with the parentheses accepting anything that will work with stylo())
  2. If date_time is passed as an argument, it will parse the appropriate log (with "appropriate" defined by defaults or by the log_label, add_dir_date, and log_date arguments) created by stylo_log() to find the settings used for a previous analysis from that date and time, and it will re-run the analysis using the same settings, and add an item to the log. It is used like this: stylo_replicate("2023-01-27 13:46:26")

@christofs
Copy link
Author

Sounds cool! Will report back as soon as I was able to do a test run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants