Skip to content

Commit

Permalink
Move git-annex whereis content, remove dysfunctional demo
Browse files Browse the repository at this point in the history
  • Loading branch information
adswa committed Nov 9, 2023
1 parent cd0a100 commit 2efa8b0
Show file tree
Hide file tree
Showing 2 changed files with 95 additions and 161 deletions.
106 changes: 1 addition & 105 deletions docs/basics/101-116-sharelocal.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,111 +167,7 @@ To demonstrate this, you decide to examine the PDFs further.

"Opening this file will work, because the content was retrieved from
the original dataset.", you explain, proud that this worked just as you
thought it would. Your room mate is excited by this magical
command. You however begin to wonder: how does DataLad know where to look for
that original content?

This information comes from git-annex. Before getting the next PDF,
let's query git-annex where its content is stored:

.. index::
pair: whereis; git-annex command
pair: show file content availability; with git-annex
.. runrecord:: _examples/DL-101-116-105
:language: console
:workdir: dl-101/mock_user/DataLad-101
:notes: git-annex whereis to find out where content is stored
:cast: 04_collaboration

$ git annex whereis books/TLCL.pdf

Oh, another :term:`shasum` - or, more specifically, a :term:`annex UUID`. This time however not in a symlink...
"That's hard to read -- what is it?" your room mate asks. You can
recognize a path to the dataset on your computer, prefixed with the user
and hostname of your computer. "This", you exclaim, excited about your own realization,
"is my dataset's location I'm sharing it from!"

.. index::
pair: set description for dataset location; with DataLad
.. find-out-more:: What is this location, and what if I provided a description?

Back in the very first section of the Basics, :ref:`createDS`, a :ref:`Find-out-more mentioned the '--description' option <createdescription>` of :dlcmd:`create`.
With this option, you can provide a description about the dataset *location*.

The :gitannexcmd:`whereis` command, finally, is where such a description
can become handy: If you had created the dataset with

.. code-block:: bash
$ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101
the command would show ``course on DataLad-101 on my private laptop`` after
the :term:`shasum` -- and thus a more human-readable description of *where*
file content is stored.
This becomes especially useful when the number of repository copies
increases. If you have only one other dataset it may be easy to
remember what and where it is. But once you have one back-up
of your dataset on a USB stick, one dataset shared with
Dropbox, and a third one on your institutions
:term:`GitLab` instance you will be grateful for the descriptions
you provided these locations with.

The current report of the location of the dataset is in the format
``user@host:path``.

If the physical location of a dataset is not relevant, ambiguous, or volatile,
or if it has an :term:`annex` that could move within the foreseeable lifetime of a
dataset, a custom description with the relevant information on the dataset is
superior. If this is not the case, decide for yourself whether you want to use
the ``--description`` option for future datasets or not depending on what you
find more readable -- a self-made location description, or an automatic
``user@host:path`` information.


The message further informs you that there is only "``(1 copy)``"
of this file content. This makes sense: There
is only your own, original ``DataLad-101`` dataset in which
this book is saved.

To retrieve file content of an annexed file such as one of
these PDFs, git-annex will try
to obtain it from the locations it knows to contain this content.
It uses the checksums to identify these locations. Every copy
of a dataset will get a unique ID with such a checksum.
Note however that just because git-annex knows a certain location
where content was once it does not guarantee that retrieval will
work. If one location is a USB stick that is in your bag pack instead
of your USB port,
a second location is a hard drive that you deleted all of its
previous contents (including dataset content) from,
and another location is a web server, but you are not connected
to the internet, git-annex will not succeed in retrieving
contents from these locations.
As long as there is at least one location that contains
the file and is accessible, though, git-annex will get the content.
Therefore, for the books in your dataset, retrieving contents works because you
and your room mate share the same file system. If you'd share the dataset
with anyone without access to your file system, ``datalad get`` would not
work, because it cannot access your files.

But there is one book that does not suffer from this restriction:
The ``bash_guide.pdf``.
This book was not manually downloaded and saved to the dataset with ``wget``
(thus keeping DataLad in the dark about where it came from), but it was
obtained with the :dlcmd:`download-url` command. This registered
the books original source in the dataset, and here is why that is useful:

.. runrecord:: _examples/DL-101-116-106
:language: console
:workdir: dl-101/mock_user/DataLad-101

$ git annex whereis books/bash_guide.pdf

Unlike the ``TLCL.pdf`` book, this book has two sources, and one of them is
``web``. The second to last line specifies the precise URL you downloaded the
file from. Thus, for this book, your room mate is always able to obtain it
(as long as the URL remains valid), even if you would delete your ``DataLad-101``
dataset. Quite useful, this provenance, right?
thought it would.

Let's now turn to the fact that the subdataset ``longnow`` contains neither
file content nor file metadata information to explore the contents of the
Expand Down
150 changes: 94 additions & 56 deletions docs/basics/101-117-sharelocal2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,88 +17,126 @@ exactly the specified registered subdataset.

And you have mesmerized your room mate by showing him how :term:`git-annex`
retrieved large file contents from the original dataset.
Your room mate is excited by this magical command.
You however begin to wonder: how does DataLad know where to look for that original content?

Let's now see the :gitannexcmd:`whereis` command in more detail,
and find out how git-annex knows *where* file content can be obtained from.
Within the original ``DataLad-101`` dataset, you retrieved some of the ``.mp3``
files via :dlcmd:`get`, but not others. How will this influence the
output of :gitannexcmd:`whereis`, you wonder?

Together with your room mate, you decide to find out. You navigate
back into the installed dataset, and run :gitannexcmd:`whereis` on a
file that you once retrieved file content for, and on a file
that you did not yet retrieve file content for.
Here is the output for the retrieved file:
This information comes from git-annex.
Before getting another PDF, let's query git-annex where its content is stored:

.. index::
pair: whereis; git-annex command
pair: show file content availability; with git-annex
.. runrecord:: _examples/DL-101-117-101
:language: console
:workdir: dl-101/DataLad-101
:notes: More on how git-annex whereis behaves
:notes: git-annex whereis to find out where content is stored
:cast: 04_collaboration

# navigate back into the clone of DataLad-101
$ cd ../mock_user/DataLad-101
# navigate into the subdirectory
$ cd recordings/longnow
# file content exists in original DataLad-101 for this file
$ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3

And here is the output for a file that you did not yet retrieve
content for in your original ``DataLad-101`` dataset.
$ git annex whereis books/TLCL.pdf

Oh, another :term:`shasum` - or, more specifically, a :term:`annex UUID`.
This time however not in a symlink...
"That's hard to read -- what is it?" your room mate asks.
You can recognize a path to the dataset on your computer, prefixed with the user and hostname of your computer.
"This", you exclaim, excited about your own realization, "is my dataset's location I'm sharing it from!"

.. index::
pair: set description for dataset location; with DataLad
.. find-out-more:: What is this location, and what if I provided a description?

Back in the very first section of the Basics, :ref:`createDS`, a :ref:`Find-out-more mentioned the '--description' option <createdescription>` of :dlcmd:`create`.
With this option, you can provide a description about the dataset *location*.

The :gitannexcmd:`whereis` command, finally, is where such a description
can become handy: If you had created the dataset with

.. code-block:: bash
$ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101
the command would show ``course on DataLad-101 on my private laptop`` after
the :term:`shasum` -- and thus a more human-readable description of *where*
file content is stored.
This becomes especially useful when the number of repository copies
increases. If you have only one other dataset it may be easy to
remember what and where it is. But once you have one back-up
of your dataset on a USB stick, one dataset shared with
Dropbox, and a third one on your institutions
:term:`GitLab` instance you will be grateful for the descriptions
you provided these locations with.

The current report of the location of the dataset is in the format
``user@host:path``.

If the physical location of a dataset is not relevant, ambiguous, or volatile,
or if it has an :term:`annex` that could move within the foreseeable lifetime of a
dataset, a custom description with the relevant information on the dataset is
superior. If this is not the case, decide for yourself whether you want to use
the ``--description`` option for future datasets or not depending on what you
find more readable -- a self-made location description, or an automatic
``user@host:path`` information.


The message further informs you that there is only "``(1 copy)``" of this file content.
This makes sense: There is only your own, original ``DataLad-101`` dataset in which this book is saved.

To retrieve file content of an annexed file such as one of these PDFs, git-annex will try to obtain it from the locations it knows to contain this content.
It uses the checksums to identify these locations.
Every copy of a dataset will get a unique ID with such a checksum.
Note however that just because git-annex knows a certain location where content was once it does not guarantee that retrieval will work.
If one location is a USB stick that is in your bag pack instead of your USB port, a second location is a hard drive that you deleted all of its previous contents (including dataset content) from,
and another location is a web server, but you are not connected to the internet, git-annex will not succeed in retrieving contents from these locations.
As long as there is at least one location that contains the file and is accessible, though, git-annex will get the content.
Therefore, for the books in your dataset, retrieving contents works because you and your room mate share the same file system.
If you'd share the dataset with anyone without access to your file system, ``datalad get`` would not work, because it cannot access your files.

But there is one book that does not suffer from this restriction:
The ``bash_guide.pdf``.
This book was not manually downloaded and saved to the dataset with ``wget`` (thus keeping DataLad in the dark about where it came from), but it was obtained with the :dlcmd:`download-url` command.
This registered the books original source in the dataset, and here is why that is useful:

.. runrecord:: _examples/DL-101-117-102
:language: console
:workdir: dl-101/mock_user/DataLad-101/recordings/longnow
:cast: 04_collaboration
:workdir: dl-101/mock_user/DataLad-101

# but not for this:
$ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3
$ git annex whereis books/bash_guide.pdf

As you can see, the file content previously downloaded with a
:dlcmd:`get` has a third source, your original dataset on your computer.
The file we did not yet retrieve in the original dataset
only has only two sources.
Unlike the ``TLCL.pdf`` book, this book has two sources, and one of them is ``web``.
The second to last line specifies the precise URL you downloaded the file from.
Thus, for this book, your room mate is always able to obtain it (as long as the URL remains valid), even if you would delete your ``DataLad-101`` dataset.

Let's see how this affects a :dlcmd:`get`:
We can also see a report of the source that git-annex uses to retrieve the content from if we look at the very end of the ``get`` summary.

.. runrecord:: _examples/DL-101-117-103
:language: console
:workdir: dl-101/mock_user/DataLad-101/recordings/longnow
:notes: Get a file that is present in original and one that is not
:cast: 04_collaboration
:workdir: dl-101/mock_user/DataLad-101

# get the first file
$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
$ datalad get books/TLCL.pdf
$ datalad get books/bash_guide.pdf

Both of these files were retrieved "``from origin...``".
``Origin`` is Git terminology for "from where the dataset was copied from" -- ``origin`` therefore is the original ``DataLad-101`` dataset.
If your roommate did not have access to the same file system or you'd deleted your ``DataLad-101`` dataset, the second file would be retrieved "``from web...``" - its registered second source, its original download URL.

Let's see this in action for another file.
The ``.mp3`` files in the ``longnow`` seminar series are registered in a similar fashion.

.. runrecord:: _examples/DL-101-117-104
:language: console
:workdir: dl-101/mock_user/DataLad-101/recordings/longnow
:workdir: dl-101/DataLad-101
:notes: More on how git-annex whereis behaves
:cast: 04_collaboration

# get the second file
$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2005_01_15__James_Carse__Religious_War_In_Light_of_the_Infinite_Game.mp3


The most important thing to note is: It worked in both cases, regardless of whether the original
``DataLad-101`` dataset contained the file content or not.

We can see that git-annex used two different sources to retrieve the content from,
though, if we look at the very end of the ``get`` summary.
The first file was retrieved "``from origin...``". ``Origin`` is Git terminology
for "from where the dataset was copied from" -- ``origin`` therefore is the
original ``DataLad-101`` dataset.

The second file was retrieved "``from web...``", and thus from a different source.
This source is called ``web`` because it actually is a URL through which this particular
podcast-episode is made available in the first place. You might also have noticed that the
download from web took longer than the retrieval from the directory on the same
file system. But we will get into the details
of this type of content source
once we cover the ``importfeed`` and ``add-url`` functions [#f1]_.
# navigate into the subdirectory
$ cd recordings/longnow
$ git annex whereis Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
$ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3

Let's for now add a note on the :gitannexcmd:`whereis` command. Again, do
this in the original ``DataLad-101`` directory, and do not forget to save it.
Quite useful, this provenance, right?
Let's add a note on the :gitannexcmd:`whereis` command.
Again, do this in the original ``DataLad-101`` directory, and do not forget to save it.

.. runrecord:: _examples/DL-101-117-105
:language: console
Expand Down

0 comments on commit 2efa8b0

Please sign in to comment.