add new data set

Adding a new data set requires:

Put the data on figshare (or another archive-grade repository such as zenodo or DataDryad; some university also provide digital repositories that are suitable). The site must provide stable download links and may not change the content during download because we store a SHA256 checksum. Make sure to choose an Open Data compatible license. (CC0 or CC-BY preferred)
Add a Python module such as MDAnalysisData/adk_equilibrium.py; in many cases you can copy the module and adapt
- text
- NAME: name of the data set; will be used as a file name so do not use spaces etc
- DESCRIPTION: filename of the description file (restructured text format, so has suffix .rst)
- ARCHIVE: dictionary containing RemoteFileMetadata instances. Keys should describe the file type. Typically
  - topology: topology file (PSF, TPR, ...)
  - trajectory: trajectory coordinate file (DCD, XTC, ...)
  - structure (optional): system with single frame of coordinates (typically PDB, GRO, CRD, ...)
- name of the fetch_NAME function
- docs of the fetch_NAME function
Add a description file such as MDAnalysisData/descr/adk_equilibrium.rst; copy this file and adapt. Make sure to add license information.
Import your fetch_NAME function in MDAnalysisData/datasets.py.

If your data set does not follow the same pattern as the example above (where each file is downloaded separately) then you have to write your own fetch_NAME() function. E.g., you might download a tar file and then unpack the file yourself. Use scikit-learn's sklearn/datasets as examples, make sure that your function sets appropriate attributes in the returned Bunch of records, and fully document what is returned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new data set

Clone this wiki locally