Skip to content

add new data set

Oliver Beckstein edited this page Oct 4, 2018 · 11 revisions

Outline

Adding a new data set requires:

  1. Put the data on figshare (or another archive-grade repository such as zenodo or DataDryad; some university also provide digital repositories that are suitable). The site must provide stable download links and may not change the content during download because we store a SHA256 checksum. Make sure to choose an Open Data compatible license. (CC0 or CC-BY preferred)
  2. Add a Python module such as MDAnalysisData/adk_equilibrium.py; in many cases you can copy the module and adapt
    • text
    • NAME: name of the data set; will be used as a file name so do not use spaces etc
    • DESCRIPTION: filename of the description file (restructured text format, so has suffix .rst)
    • ARCHIVE: dictionary containing RemoteFileMetadata instances. Keys should describe the file type. Typically
      • topology: topology file (PSF, TPR, ...)
      • trajectory: trajectory coordinate file (DCD, XTC, ...)
      • structure (optional): system with single frame of coordinates (typically PDB, GRO, CRD, ...)
    • name of the fetch_NAME function
    • docs of the fetch_NAME function
  3. Add a description file such as MDAnalysisData/descr/adk_equilibrium.rst; copy this file and adapt. Make sure to add license information.
  4. Import your fetch_NAME function in MDAnalysisData/datasets.py.

If your data set does not follow the same pattern as the example above (where each file is downloaded separately) then you have to write your own fetch_NAME() function. E.g., you might download a tar file and then unpack the file yourself. Use scikit-learn's sklearn/datasets as examples, make sure that your function sets appropriate attributes in the returned Bunch of records, and fully document what is returned.

RemoteFileMetadata

The RemoteFileMetadata is used by base._fetch_remote(). Typically you will have a local copy of the files during testing. You can compute the SHA256 with the following code:

import MDAnalysisData.base
MDAnalysisData.base._sha256(filename)
Clone this wiki locally