Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create RStudio_user.rst #950

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
e3d58b6
Create Studio_user.rst
jcolomb Apr 26, 2023
4542a93
Update Studio_user.rst
jcolomb Apr 26, 2023
52a2a2f
Update Studio_user.rst
jcolomb Apr 26, 2023
78eee42
Update Studio_user.rst
jcolomb Apr 26, 2023
24ae8d9
Update Studio_user.rst
jcolomb Apr 26, 2023
c2f7dad
Update Studio_user.rst
jcolomb Apr 26, 2023
6d5db5e
Update Studio_user.rst
jcolomb Apr 26, 2023
1aff71a
Update Studio_user.rst
jcolomb Apr 27, 2023
fa08b07
Merge branch 'datalad-handbook:main' into master
jcolomb Jun 1, 2023
ca3cc1c
rename file + change beginning (for review before applying to all cha…
jcolomb Jun 1, 2023
71c55f3
Apply suggestions from code review
jcolomb Jul 6, 2023
bc87dfc
ignoremore
jcolomb Jul 6, 2023
f0f8466
Rewriting second part, following the Max and Bobby interactions
jcolomb Jul 11, 2023
fecb534
add images
jcolomb Jul 11, 2023
2f97f94
ref typo ?
jcolomb Jul 16, 2023
ad710fd
Update docs/usecases/RStudio_user.rst
jcolomb Jul 16, 2023
474b6d9
Update docs/usecases/RStudio_user.rst
jcolomb Jul 16, 2023
73ede19
Update docs/usecases/RStudio_user.rst
jcolomb Aug 4, 2023
2ab1360
adding correct path for images
jcolomb Aug 7, 2023
ac69b49
look at code and commands synthax
jcolomb Aug 7, 2023
2f0d118
Merge pull request #2 from datalad-handbook/main
jcolomb Aug 10, 2023
9e928bb
typos
jcolomb Aug 14, 2023
f8c870d
Update docs/usecases/RStudio_user.rst
jcolomb Aug 14, 2023
53e6b62
typo
jcolomb Aug 14, 2023
7dc7feb
Update docs/usecases/RStudio_user.rst
jcolomb Aug 14, 2023
68d6f63
Update intro.rst
jcolomb Aug 14, 2023
435bdde
original .gitignore
jcolomb Oct 6, 2023
e29d9c7
Apply suggestions from code review: mostly typos
jcolomb Oct 6, 2023
a65103f
speel check
jcolomb Oct 6, 2023
b7577eb
DataLad spelling
jcolomb Oct 10, 2023
99e9c58
moving gintonic info into a box
jcolomb Oct 10, 2023
0ef0935
add notes on push
jcolomb Oct 10, 2023
c9c8697
Merge pull request #3 from datalad-handbook/main
jcolomb Oct 10, 2023
746d3ed
trying to correct new image address
jcolomb Oct 10, 2023
72e0dbe
correct copypaste error
jcolomb Oct 10, 2023
c8776da
adding some precisions
jcolomb Oct 18, 2023
299983f
adding comments on datalad run use in practice
jcolomb Oct 18, 2023
1422d0a
add reference to git-annex intro
jcolomb Oct 18, 2023
c64774f
debug links
jcolomb Oct 18, 2023
4460734
trying to clean handbook links
jcolomb Oct 18, 2023
3c6bc95
Fix anchor format
adswa Dec 18, 2023
85e8e5e
fix heading
adswa Dec 18, 2023
2ed4587
Merge branch 'datalad-handbook:main' into master
jcolomb Dec 18, 2023
7074015
add tab for gitusernote
jcolomb Dec 18, 2023
a4a5155
fix references
adswa Dec 18, 2023
d870dde
formatting fix
adswa Dec 19, 2023
ab18dc1
add space
adswa Dec 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 3 additions & 8 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,25 +1,20 @@
docs/_build
.ignored
.idea

# Ignore irrelevant files from the Sublime Text editor
*.sublime-workspace
*.sublime-project

# Ignore irrelevant files from the VS code editor
.vscode/

# Ignore .hgignore for contributors using Mercurial.
.hgignore

# Ignore files generated during build
build/

venv/

*.egg-info

__pycache__

jcolomb marked this conversation as resolved.
Show resolved Hide resolved
*.swp
venvs
.Rproj.user
.Rhistory
02_dataladhandbook-myfork.Rproj
Binary file added docs/_static/img/Rstudio-create.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Rstudio-dataladrun.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/img/Rstudio-terminal.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
200 changes: 200 additions & 0 deletions docs/usecases/RStudio_user.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
.. \_usecase_Rstat:
adswa marked this conversation as resolved.
Show resolved Hide resolved

DataLad and RStudio: First steps
---------------------------
adswa marked this conversation as resolved.
Show resolved Hide resolved

.. index:: ! Usecase; R users quickstart

This use case sketches typical entry points for R and `Rstudio <https://en.wikipedia.org/wiki/RStudio>`_ users.

#. A repository having submodules for data and code is cloned.
#. R scripts are developed in Rstudio
#. R scripts are run using ``datalad run``

(This is a `hello world` type of analysis, used only for demonstration purposes.)

The Challenge
^^^^^^^^^^^^^

Max has been using Rstudio together with :term:`GitHub` for a long time. They know how :term:`Git`
works. Max has learned that Git will not work with their new project,
because there will be too many files and some dataset files will be too large.
Max read the DataLad :ref:`Handbook Basics <basics-intro>`_ and is decided to use DataLad.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
They indeed want to version control larger files, and split files in several repositories, linked as "DataLad dataset hierarchies".
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
Max still want to use Rstudio and a combination of R and python scripts for the
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
data analysis.

Bobby is a data manager who already learned (the hard way), how to handle datalad
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
using Rstudio. They have also created a :term:`GIN` repository with :term:`submodule`\s
for data and for code, using the [Tonic tool and templates](https://gin-tonic.netlify.app).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is primarily emotion (which is OK). However, instead of merely stating that there is also a "GIN" and a "Tonic" part of the scenario, it would be better to say "why".

GIN likely is here, because it could host large files. If that is the is main reason, it should make that explicit. If so, it would be useful to say that GIN is one convenient solution, but not the only one.

Likewise the role of Tonic should be clarified. Right now it seems that it is needed to set up the GIN repos.

Given that the main content of this use case is about executing code locally (and not about pushing the results somehere), it is worth considering moving this information into a box, to avoid the impression that it is necessary to have these for running RStudio with DataLad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good point, moved to a box, and moved the box at the end of this section (around l.79)

He is happy to help Max, as he knows this will allow them to record analysis provenance.


Setting up
^^^^^^^^^^

Max follows the handbook and install datalad on his computer.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
Max first want to clone the repository on their computer, they use the Rstudio
`create a new project` function using the SSH address of the parent repository.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

![Figures of several screenshot demonstrating the creation of new projects on Rstudio](/img/Rstudio-create.jpg)
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

Max can't see submodules content and comes to Bobby.

Bobby comes and run `datalad get . -n -r` in the terminal window of Rstudio.

![Using Rstudio terminal window to give datalad commands](/img/Rstudio-terminal.jpg)


They then explain:
- Rstudio can only use simple Git commands, which do not clone submodule content.
- DataLad command are run in the terminal window. DataLad does not have a R package and do not run in the console
- This specific function `get .` will download all files, it has two options:
- `-n` option means annexed files will not be downloaded
- `-r` option (short for ``--recursive``) means that the function is run in all submodules, recursively

Max is thanking Bobby for the insights.
Before leaving, Bobby gives an additional advice: Our template uses "pure Git" repositories, DataLad functions will work but they will not use git-annex.
Looking at Max incredule face, they explain further: you will now be able to use datalad to manage the submodules and save them all at once, but big files will be added to Git, this will make it unusable very fast.
So you need to turn this pure Git repository into proper :term:`DataLad dataset` (meaning a Git repository with additional features from :term:`git-annex` and DataLad).

Max is a bit puzzled and read the basics chapter of the handbook again.
Then, they see that :command:`datalad create --force` is the correct command to create a DataLad dataset
when the folder already exist, so they run
`datalad create --force -r` in the parent repository.
Now they are sure they set up datalad to work in the repository and all submodules,
since they used the `-r` option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this can conclude with a statement that with this setup, everything is good to go for DataLad commands from the console, for example for saving changes,pushing modifications, pulling updates, or adding siblings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some infos in the box, indeed pushing would need additional setups in the scenario


Working on the code
^^^^^^^^^^^^^^^^^^^

Max starts to write some script he saves in the analysis submodule, and use `datalad save -r -m "this is a first draft of the script"` command in the terminal (in the parent repository).
The commit history of the parent and the analysis repositories shows the message and Max things everything works fine.
Max change the script, but Rstudio does not want to save the changes.
Max save a copy of the script file and call Bobby for help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for this section it makes sense to focus it on the topic of reproducible execution with datalad run.
I think there is no need to spent too much work on rewriting content about the difference between files kept in git versus in git annex (instead, references to existing parts in the handbook).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this in the following version because:

  • beginners may not care about reproducible execution
  • Infos about difference between Git and Git-annex is necessary to explain Rstudio behavior. I was personally very surprised of that behavior and needed testing and thinking to understand what happens. I try to add some reference for more info.

Bobby start to explain what happened:
Datalad saved the script using Git-annex.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
This means that the file was moved somewhere else, and the content was replaced by a code linking to the file location.
The code, which is a tiny file, is saved in Git, while the large file is saved outside of Git.
Because it is :term:`symlink`, Rstudio still read the content of the original file when clicking on it, but it cannot overwrite the file: that file is in read-only mode.
This is explained in detail in the :ref:`Handbook chapters on Git-annex <basics-annex>`_ .
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

One could overwrite the file by first unlocking it (using ``datalad unlock .``), but that would not be very practical, and it would save the script as a binary file, making the version control very inefficient.

You do not want to use Git-annex for scripts, as they are text files which version should be handled by Git..
Bobby then shows how to tell datalad to use git for text files and he runs: ``datalad create -c text2git --force``.

Max can now work on its script as he used to, but commit changes using the ``datalad save -r`` command.




.. gitusernote:: Dangers of text2git

Note that all text
files will be added to git using this option, so if you have large text files
(.csv or .json files) that you want to be added via Git-annex,
you will need to be more precise in what text
file should not be annexed.
See :ref:`Handbook chapters <101-124-procedures>`_ <http://handbook.datalad.org/en/inm7/basics/101-124-procedures.html#>
for details on how text2git change `.gitattributes` to achieve that.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

Running code
^^^^^^^^^^^^

The code use relative path starting in the parent repository, as they are used to do in normal projects, and since the code is run from there in Rstudio.
(Later on, Max realise he can also use git commands from inside the analysis submodule, and he creates a second Rstudio project in that submodule, just to use the git functions he is used to. Code is run from the parent Rstudio project.)

Max is now happy and start working on his code.
In order to test everything, Max put a text file in the data submodule, and write a script that read the file and produce a pdf writing the text as an image.
He runs the code and it works!
He know save it with ``datalad save -r``.
He runs the code again and... oups it fails.

Max thinks a bit about it and remember what he learned before: the pdf file has been annexed and cannot be overwritten.
Max therefore runs ``datalad unlock . -r`` and then runs the code, and it works.
Max realise also that usinng ``datalad save . -r`` lock the files again,
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
it does it also if there is no change in the repository (and therefore no commit made).

At the coffee break, Max meets Bobby and complain about the process.
Bobby use the occasion to say that another problem can arise: if you drop the input files (erase GIT-annex data from your computer once they are on the server), you would also need to download the input files before running the code (using the ``datalad get`` command).

Bobby tells Max it is time to learn about ``datalad run``.

Datalad run with Rscripts
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
^^^^^^^^^^^^^^^^^^^^^^^^^^

Bobby starts with the basics of running R code via datalad run:

Because datalad runs in the terminal, it needs a terminal command to run the script.
For R, that command is "Rscript": ``datalad run Rscript "<path-to-script.r>"``.
The path is relative to where the terminal is, the terminal tab is per default in the working directory of the project. If your code is in a submodule and the data is in another one, you should run this command from the parent repository.

(Bobby needs here to make sure Rscript is a recognised command and set the PATH variable accordingly.)

"What are the advantages of using this command", asks Max.

There are twofolds (at least), answers Bobby.
First, this command will take care of obtaining input files and unlocking output files for you.
Second, and most importantly, the command will record what has been done in the commit message automatically: what input, what script, what output was used.
This code therefore records **provenance**: you will always be able to find what workflow and data version was used to create your figures.

Since Bobby looks very enthusiastic about provenance, Max reads a little more about it in the handbook: usecases/provenance_tracking, https://handbook.datalad.org/en/latest/basics/101-108-run.html#run

Then, Max create a bash script in Rstudio and run it using the usual button (This runs the bash script in the terminal).



.. code-block:: bash


$ datalad run \
--input "file1.csv" \
--input "data/file2.json" \
--output "figures/*.png" \
--explicit \
Rscript "<path-to-script.r>" {inputs} {outputs}
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

![Figures of bash code runing the datalad run command](/img/Rstudio-dataladrun.jpg)

On can set as many input and output files, one can use `*` to define several files with a similar ending (in the example all .png figures will be unlocked). It is good practice to list files in input and output even if they do not need to be handled by datalad, in order to give more information in the commit message.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

.. gitusernote:: behavior explained
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

- Input: To be read, files are downloaded if not present. Note that they are not unlocked (no need for reading them) and that they will not be dropped again after being read.
- Output: files are unlocked so they can be overwritten. If the files are not present (dropped), they will not be downloaded. This may make your code fail: if it does, either get the files manually before running `datalad run`, or remove them in the R code (`r file.remove()`). In other case, it will work and it will even detect when the file has not been modified and make no commit.
- explicit: datalad runs normally only in clean repositories, this includes all submodules. By adding --explicit, datalad will only test that the output files are clean, and only output files will be saved. Please use with care, as the script and data you use will not be tested and provenance information can be lost.
- {inputs} {outputs}: If you add the placeholders, the terminal will actually gives the input and output text as argument to the Rscript bash function. One can access them in the R script with `args <- commandArgs(trailingOnly = TRUE)` (then get them with `args[i]`, with i starts at 1).
- At the end, datalad usually runs `datalad save -r` so that modification made by the code in the whole repository, including submodules will be done (exception when --explicit is given, see above.) This will include any intermediate file created by your code in bash mode, that is using `Rscript "path-to-code.R "` in the terminal (it can happen that bash mode creates more files than running the code directly)
jcolomb marked this conversation as resolved.
Show resolved Hide resolved





.. gitusernote:: advanced tips for datalad run
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

unlocking the files will make its state "unclean", so if you use datalad run, you need to set output options in the function, you cannot unlock files manually before.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

The commit message will only look at the options, whether the code use these input and output files is not checked.

Using `datalad run` correctly is sometimes tricky, and since it does save each time, it can make the repository history quite messy. Make sure to give good commit messages.






.. importantnote:: Take home messages

DataLad commands run in the terminal, not the R Console.

The simplest way to tell DataLad not to use git-annex for your code files is to use ``datalad create -r -c text2git --force`` command.

the ``datalad run Rscript "path-to-script.r"`` command will run your script.

Use additional options to read or write annexed files (and give more info for commit messages).

In your R script, use path relative to the project, not relative to the code position.