Skip to content

add new data set

Oliver Beckstein edited this page Jun 16, 2021 · 11 revisions

Outline

MDAnalysisData does not store files and trajectories. Instead, it provides accessor code to seamlessly download (and cache) files from archives.

When you contribute data then you have to do two things

  1. deposit data in an archive under an Open Data compatible license (CC0 or CC-BY preferred)

    We currently have code to work with figshare but it should be straightforward to add code to work with other archive-grade repositorie such as zenodo or DataDryad.

  2. write accessor code in MDAnalysisData

    The accessor code needs the stable archive URL(s) for your files and SHA256 checksums to check the integrity for any downloaded files. You will also add a description of your data set.

Process

Adding a new data set requires:

  1. Put the data on figshare (or another archive-grade repository such as zenodo or DataDryad; some university also provide digital repositories that are suitable). The site must provide stable download links and may not change the content during download because we store a SHA256 checksum. Make sure to choose an Open Data compatible license. (CC0 or CC-BY preferred)
  2. Add a Python module such as MDAnalysisData/adk_equilibrium.py; in many cases you can copy the module and adapt
    • text
    • NAME: name of the data set; will be used as a file name so do not use spaces etc
    • DESCRIPTION: filename of the description file (restructured text format, so has suffix .rst)
    • ARCHIVE: dictionary containing RemoteFileMetadata instances. Keys should describe the file type. Typically
      • topology: topology file (PSF, TPR, ...)
      • trajectory: trajectory coordinate file (DCD, XTC, ...)
      • structure (optional): system with single frame of coordinates (typically PDB, GRO, CRD, ...)
    • name of the fetch_NAME function
    • docs of the fetch_NAME function
  3. Add a description file such as MDAnalysisData/descr/adk_equilibrium.rst; copy this file and adapt. Make sure to add license information.
  4. Import your fetch_NAME function in MDAnalysisData/datasets.py.
  5. Add docs in restructured text format under docs/ (take existing files as examples).

If your data set does not follow the same pattern as the example above (where each file is downloaded separately) then you have to write your own fetch_NAME() function. E.g., you might download a tar file and then unpack the file yourself. Use scikit-learn's sklearn/datasets as examples, make sure that your function sets appropriate attributes in the returned Bunch of records, and fully document what is returned.

RemoteFileMetadata

The RemoteFileMetadata is used by base._fetch_remote(). Typically you will have a local copy of the files during testing. You can compute the SHA256 with the following code:

import MDAnalysisData.base
MDAnalysisData.base._sha256(FILENAME)

or from the commandline

 python -c 'import MDAnalysisData; print(MDAnalysisData.base._sha256("FILENAME"))'
Clone this wiki locally