From 2efa8b0182d493685111d311cc48f2d1b4c70daf Mon Sep 17 00:00:00 2001 From: Adina Wagner Date: Thu, 9 Nov 2023 12:43:50 +0100 Subject: [PATCH] Move git-annex whereis content, remove dysfunctional demo --- docs/basics/101-116-sharelocal.rst | 106 +------------------- docs/basics/101-117-sharelocal2.rst | 150 +++++++++++++++++----------- 2 files changed, 95 insertions(+), 161 deletions(-) diff --git a/docs/basics/101-116-sharelocal.rst b/docs/basics/101-116-sharelocal.rst index 2e51059ce..69fab3280 100644 --- a/docs/basics/101-116-sharelocal.rst +++ b/docs/basics/101-116-sharelocal.rst @@ -167,111 +167,7 @@ To demonstrate this, you decide to examine the PDFs further. "Opening this file will work, because the content was retrieved from the original dataset.", you explain, proud that this worked just as you -thought it would. Your room mate is excited by this magical -command. You however begin to wonder: how does DataLad know where to look for -that original content? - -This information comes from git-annex. Before getting the next PDF, -let's query git-annex where its content is stored: - -.. index:: - pair: whereis; git-annex command - pair: show file content availability; with git-annex -.. runrecord:: _examples/DL-101-116-105 - :language: console - :workdir: dl-101/mock_user/DataLad-101 - :notes: git-annex whereis to find out where content is stored - :cast: 04_collaboration - - $ git annex whereis books/TLCL.pdf - -Oh, another :term:`shasum` - or, more specifically, a :term:`annex UUID`. This time however not in a symlink... -"That's hard to read -- what is it?" your room mate asks. You can -recognize a path to the dataset on your computer, prefixed with the user -and hostname of your computer. "This", you exclaim, excited about your own realization, -"is my dataset's location I'm sharing it from!" - -.. index:: - pair: set description for dataset location; with DataLad -.. find-out-more:: What is this location, and what if I provided a description? - - Back in the very first section of the Basics, :ref:`createDS`, a :ref:`Find-out-more mentioned the '--description' option ` of :dlcmd:`create`. - With this option, you can provide a description about the dataset *location*. - - The :gitannexcmd:`whereis` command, finally, is where such a description - can become handy: If you had created the dataset with - - .. code-block:: bash - - $ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101 - - the command would show ``course on DataLad-101 on my private laptop`` after - the :term:`shasum` -- and thus a more human-readable description of *where* - file content is stored. - This becomes especially useful when the number of repository copies - increases. If you have only one other dataset it may be easy to - remember what and where it is. But once you have one back-up - of your dataset on a USB stick, one dataset shared with - Dropbox, and a third one on your institutions - :term:`GitLab` instance you will be grateful for the descriptions - you provided these locations with. - - The current report of the location of the dataset is in the format - ``user@host:path``. - - If the physical location of a dataset is not relevant, ambiguous, or volatile, - or if it has an :term:`annex` that could move within the foreseeable lifetime of a - dataset, a custom description with the relevant information on the dataset is - superior. If this is not the case, decide for yourself whether you want to use - the ``--description`` option for future datasets or not depending on what you - find more readable -- a self-made location description, or an automatic - ``user@host:path`` information. - - -The message further informs you that there is only "``(1 copy)``" -of this file content. This makes sense: There -is only your own, original ``DataLad-101`` dataset in which -this book is saved. - -To retrieve file content of an annexed file such as one of -these PDFs, git-annex will try -to obtain it from the locations it knows to contain this content. -It uses the checksums to identify these locations. Every copy -of a dataset will get a unique ID with such a checksum. -Note however that just because git-annex knows a certain location -where content was once it does not guarantee that retrieval will -work. If one location is a USB stick that is in your bag pack instead -of your USB port, -a second location is a hard drive that you deleted all of its -previous contents (including dataset content) from, -and another location is a web server, but you are not connected -to the internet, git-annex will not succeed in retrieving -contents from these locations. -As long as there is at least one location that contains -the file and is accessible, though, git-annex will get the content. -Therefore, for the books in your dataset, retrieving contents works because you -and your room mate share the same file system. If you'd share the dataset -with anyone without access to your file system, ``datalad get`` would not -work, because it cannot access your files. - -But there is one book that does not suffer from this restriction: -The ``bash_guide.pdf``. -This book was not manually downloaded and saved to the dataset with ``wget`` -(thus keeping DataLad in the dark about where it came from), but it was -obtained with the :dlcmd:`download-url` command. This registered -the books original source in the dataset, and here is why that is useful: - -.. runrecord:: _examples/DL-101-116-106 - :language: console - :workdir: dl-101/mock_user/DataLad-101 - - $ git annex whereis books/bash_guide.pdf - -Unlike the ``TLCL.pdf`` book, this book has two sources, and one of them is -``web``. The second to last line specifies the precise URL you downloaded the -file from. Thus, for this book, your room mate is always able to obtain it -(as long as the URL remains valid), even if you would delete your ``DataLad-101`` -dataset. Quite useful, this provenance, right? +thought it would. Let's now turn to the fact that the subdataset ``longnow`` contains neither file content nor file metadata information to explore the contents of the diff --git a/docs/basics/101-117-sharelocal2.rst b/docs/basics/101-117-sharelocal2.rst index f97335cb9..860bb5f99 100644 --- a/docs/basics/101-117-sharelocal2.rst +++ b/docs/basics/101-117-sharelocal2.rst @@ -17,88 +17,126 @@ exactly the specified registered subdataset. And you have mesmerized your room mate by showing him how :term:`git-annex` retrieved large file contents from the original dataset. +Your room mate is excited by this magical command. +You however begin to wonder: how does DataLad know where to look for that original content? -Let's now see the :gitannexcmd:`whereis` command in more detail, -and find out how git-annex knows *where* file content can be obtained from. -Within the original ``DataLad-101`` dataset, you retrieved some of the ``.mp3`` -files via :dlcmd:`get`, but not others. How will this influence the -output of :gitannexcmd:`whereis`, you wonder? - -Together with your room mate, you decide to find out. You navigate -back into the installed dataset, and run :gitannexcmd:`whereis` on a -file that you once retrieved file content for, and on a file -that you did not yet retrieve file content for. -Here is the output for the retrieved file: +This information comes from git-annex. +Before getting another PDF, let's query git-annex where its content is stored: +.. index:: + pair: whereis; git-annex command + pair: show file content availability; with git-annex .. runrecord:: _examples/DL-101-117-101 :language: console :workdir: dl-101/DataLad-101 - :notes: More on how git-annex whereis behaves + :notes: git-annex whereis to find out where content is stored :cast: 04_collaboration # navigate back into the clone of DataLad-101 $ cd ../mock_user/DataLad-101 - # navigate into the subdirectory - $ cd recordings/longnow - # file content exists in original DataLad-101 for this file - $ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 - -And here is the output for a file that you did not yet retrieve -content for in your original ``DataLad-101`` dataset. + $ git annex whereis books/TLCL.pdf + +Oh, another :term:`shasum` - or, more specifically, a :term:`annex UUID`. +This time however not in a symlink... +"That's hard to read -- what is it?" your room mate asks. +You can recognize a path to the dataset on your computer, prefixed with the user and hostname of your computer. +"This", you exclaim, excited about your own realization, "is my dataset's location I'm sharing it from!" + +.. index:: + pair: set description for dataset location; with DataLad +.. find-out-more:: What is this location, and what if I provided a description? + + Back in the very first section of the Basics, :ref:`createDS`, a :ref:`Find-out-more mentioned the '--description' option ` of :dlcmd:`create`. + With this option, you can provide a description about the dataset *location*. + + The :gitannexcmd:`whereis` command, finally, is where such a description + can become handy: If you had created the dataset with + + .. code-block:: bash + + $ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101 + + the command would show ``course on DataLad-101 on my private laptop`` after + the :term:`shasum` -- and thus a more human-readable description of *where* + file content is stored. + This becomes especially useful when the number of repository copies + increases. If you have only one other dataset it may be easy to + remember what and where it is. But once you have one back-up + of your dataset on a USB stick, one dataset shared with + Dropbox, and a third one on your institutions + :term:`GitLab` instance you will be grateful for the descriptions + you provided these locations with. + + The current report of the location of the dataset is in the format + ``user@host:path``. + + If the physical location of a dataset is not relevant, ambiguous, or volatile, + or if it has an :term:`annex` that could move within the foreseeable lifetime of a + dataset, a custom description with the relevant information on the dataset is + superior. If this is not the case, decide for yourself whether you want to use + the ``--description`` option for future datasets or not depending on what you + find more readable -- a self-made location description, or an automatic + ``user@host:path`` information. + + +The message further informs you that there is only "``(1 copy)``" of this file content. +This makes sense: There is only your own, original ``DataLad-101`` dataset in which this book is saved. + +To retrieve file content of an annexed file such as one of these PDFs, git-annex will try to obtain it from the locations it knows to contain this content. +It uses the checksums to identify these locations. +Every copy of a dataset will get a unique ID with such a checksum. +Note however that just because git-annex knows a certain location where content was once it does not guarantee that retrieval will work. +If one location is a USB stick that is in your bag pack instead of your USB port, a second location is a hard drive that you deleted all of its previous contents (including dataset content) from, +and another location is a web server, but you are not connected to the internet, git-annex will not succeed in retrieving contents from these locations. +As long as there is at least one location that contains the file and is accessible, though, git-annex will get the content. +Therefore, for the books in your dataset, retrieving contents works because you and your room mate share the same file system. +If you'd share the dataset with anyone without access to your file system, ``datalad get`` would not work, because it cannot access your files. + +But there is one book that does not suffer from this restriction: +The ``bash_guide.pdf``. +This book was not manually downloaded and saved to the dataset with ``wget`` (thus keeping DataLad in the dark about where it came from), but it was obtained with the :dlcmd:`download-url` command. +This registered the books original source in the dataset, and here is why that is useful: .. runrecord:: _examples/DL-101-117-102 :language: console - :workdir: dl-101/mock_user/DataLad-101/recordings/longnow - :cast: 04_collaboration + :workdir: dl-101/mock_user/DataLad-101 - # but not for this: - $ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3 + $ git annex whereis books/bash_guide.pdf -As you can see, the file content previously downloaded with a -:dlcmd:`get` has a third source, your original dataset on your computer. -The file we did not yet retrieve in the original dataset -only has only two sources. +Unlike the ``TLCL.pdf`` book, this book has two sources, and one of them is ``web``. +The second to last line specifies the precise URL you downloaded the file from. +Thus, for this book, your room mate is always able to obtain it (as long as the URL remains valid), even if you would delete your ``DataLad-101`` dataset. -Let's see how this affects a :dlcmd:`get`: +We can also see a report of the source that git-annex uses to retrieve the content from if we look at the very end of the ``get`` summary. .. runrecord:: _examples/DL-101-117-103 :language: console - :workdir: dl-101/mock_user/DataLad-101/recordings/longnow - :notes: Get a file that is present in original and one that is not - :cast: 04_collaboration + :workdir: dl-101/mock_user/DataLad-101 - # get the first file - $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 + $ datalad get books/TLCL.pdf + $ datalad get books/bash_guide.pdf + +Both of these files were retrieved "``from origin...``". +``Origin`` is Git terminology for "from where the dataset was copied from" -- ``origin`` therefore is the original ``DataLad-101`` dataset. +If your roommate did not have access to the same file system or you'd deleted your ``DataLad-101`` dataset, the second file would be retrieved "``from web...``" - its registered second source, its original download URL. +Let's see this in action for another file. +The ``.mp3`` files in the ``longnow`` seminar series are registered in a similar fashion. .. runrecord:: _examples/DL-101-117-104 :language: console - :workdir: dl-101/mock_user/DataLad-101/recordings/longnow + :workdir: dl-101/DataLad-101 + :notes: More on how git-annex whereis behaves :cast: 04_collaboration - # get the second file - $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3 - - -The most important thing to note is: It worked in both cases, regardless of whether the original -``DataLad-101`` dataset contained the file content or not. - -We can see that git-annex used two different sources to retrieve the content from, -though, if we look at the very end of the ``get`` summary. -The first file was retrieved "``from origin...``". ``Origin`` is Git terminology -for "from where the dataset was copied from" -- ``origin`` therefore is the -original ``DataLad-101`` dataset. - -The second file was retrieved "``from web...``", and thus from a different source. -This source is called ``web`` because it actually is a URL through which this particular -podcast-episode is made available in the first place. You might also have noticed that the -download from web took longer than the retrieval from the directory on the same -file system. But we will get into the details -of this type of content source -once we cover the ``importfeed`` and ``add-url`` functions [#f1]_. + # navigate into the subdirectory + $ cd recordings/longnow + $ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 + $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 -Let's for now add a note on the :gitannexcmd:`whereis` command. Again, do -this in the original ``DataLad-101`` directory, and do not forget to save it. +Quite useful, this provenance, right? +Let's add a note on the :gitannexcmd:`whereis` command. +Again, do this in the original ``DataLad-101`` directory, and do not forget to save it. .. runrecord:: _examples/DL-101-117-105 :language: console