GitHub - LAAC-LSCP/datalad-procedures: Procedures for creating new datasets

Installation instructions

This assumes you've already installed ChildProject. If you haven't, start following these instructions

If needed, activate the ChildProjectVenv virtual environment

conda activate childproject

If the above line doesn't work, you may have installed ChildProject generally, rather than in a virtual environment.

!Warning! If none of this rings a bell, you may have not installed ChildProject at all. To do so, follow these instructions

Download the package

git clone git@github.com:LAAC-LSCP/datalad-procedures.git
cd datalad-procedures

Install the dependencies

apt-get install git-annex || brew install git-annex
pip3 install -r requirements.txt

Install the procedures

python3 install.py

At this point, a message may ask you if you want to establish a fingerprint; say yes.

Check the installation

datalad run-procedure --discover

Expected output:

cfg_laac1 (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_laac1.py) [python_script] cfg_yoda (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_yoda.py) [python_script] cfg_el1000 (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_el1000.py) [python_script] cfg_text2git (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_text2git.py) [python_script] cfg_metadatatypes (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_metadatatypes.py) [python_script] cfg_laac2 (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_laac2.py) [python_script] cfg_laac (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_laac.py) [python_script]

Usage

The LAAC template

This is the default template you should use most of the time. This will use GIN for hosting the different siblings of your dataset.

This template can create up to 3 remotes:

origin : this is the default remote, it will hold all the annexed data except for the data stored under confidential folder(s)
confidential : this remote only stores the content of confidential files (under a confidential folder, wherever in the dataset)
public : this remote holds the data that we consider non sensitive, converted automated annotation content can be shared without risk so this includes:
- vtc converted annotations : under annotations/vtc/converted
- vcm converted annotations : under annotations/vcm/converted
- alice converted annotations : under annotations/alice/output/converted which is usually the conversion of the alice output and annotations/alice/converted which is usually the merge of alice and vtc (to get the speaker_type information)
- its converted annotations : under annotations/its/converted

Keep in mind that those are fixed path so when adding annotations, use those exact names of set to make sure it gets correctly published to your public remote.

Because we use GIN for hosting your remote repositories, you need to make sure to set up your ssh key:

go to https://gin.g-node.org/
log in
click on top right on your avatar, choose parameters
click on SSH keys at the left, then click on add a key
do cat ~/.ssh/id_rsa.pub -- if you get an error, that means your computer does not yet have an ssh key, so follow these instructions to create one; if not, copy the output and paste it into the key area

Then you can create datasets as follows:

Using the browser capabilities on GIN, create three (depending on if you have confidential/public repos, could be 1 or 2) empty repositories in your GIN organization: <dataset-name> , <dataset-name>-public and <dataset-name>-confidential, e.g. dataset1 , dataset1-public and dataset1-confidential. Here's an example of creation of the first (i.e. non confidential and non public); notice that (a) you need to create the repo from the organization (and not your personal account) and (b) you need to uncheck the box at the bottom during actual creation.

Run the following script (edit the environment variables to suit your configuration):

export GIN_ORGANIZATION='LAAC-LSCP' # name of your GIN organization
export CONFIDENTIAL_DATASET=0 # set to 1 if there should be a confidential sibling
export PUBLIC_DATASET=0 # set to 1 if there should be a public sibling
datalad create -c laac dataset-name

The output you'll see looks something like this:

[INFO ] Creating a new annex repo at /Users/acristia/Documents/git-data/rague [INFO ] Scanning for unlocked files (this may take some time) [INFO ] Running procedure cfg_laac [INFO ] == Command start (output follows) ===== [INFO ] Could not enable annex remote origin. This is expected if origin is a pure Git remote, or happens if it is not accessible. [WARNING] Could not detect whether origin carries an annex. If origin is a pure Git remote, this is expected.
.: origin(-) [git@gin.g-node.org:/LAAC-LSCP/rague.git (git)] .: origin(+) [git@gin.g-node.org:/LAAC-LSCP/rague.git (git)]
[INFO ] Could not enable annex remote confidential. This is expected if confidential is a pure Git remote, or happens if it is not accessible. [WARNING] Could not detect whether confidential carries an annex. If confidential is a pure Git remote, this is expected.
.: confidential(-) [git@gin.g-node.org:/LAAC-LSCP/rague-confidential.git (git)] .: confidential(+) [git@gin.g-node.org:/LAAC-LSCP/rague-confidential.git (git)]
[INFO ] Configure additional publication dependency on "confidential"
.: origin(+) [git@gin.g-node.org:/LAAC-LSCP/rague.git (git)] [INFO ] == Command exit (modification check follows) ===== create(ok): /Users/acristia/Documents/git-data/rague (dataset)

The LAAC1 template

The LAAC1 template creates a dataset with two siblings:

One on GitHub
One on a specified SSH location

export GITHUB_ORGANIZATION="LAAC-LSCP" # name of your GitHub organization
export DATASET_PATH="/location/of/your/datasets/" # remote location of the dataset in the server
export SSH_HOSTNAME="your.cluster.com" # hostname/alias of your ssh server

datalad create -c laac1 dataset-name

The LAAC2 template

The LAAC2 template creates a dataset with three GIN siblings:

One containing all the data
One containing confidential data, but not the recordings
One containing all non-confidential data

Since this relies on GIN, you need to make sure to set up your ssh key:

go to https://gin.g-node.org/
log in
click on top right on your avatar, choose parameters
click on SSH keys at the left, then click on add a key
do cat ~/.ssh/id_rsa.pub -- if you get an error, that means your computer does not yet have an ssh key, so follow these instructions to create one; if not, copy the output and paste it into the key area

Then you can create datasets as follows:

export GIN_ORGANIZATION="LAAC-LSCP" # name of your GitHub organization
datalad create -c laac2 dataset-name

The EL1000 template

You should not use this template anymore as the LAAC template can do the same (and more)

Since this template relies on GIN, you need to make sure to set up your ssh key:

go to https://gin.g-node.org/
log in
click on top right on your avatar, choose parameters
click on SSH keys at the left, then click on add a key
do cat ~/.ssh/id_rsa.pub -- if you get an error, that means your computer does not yet have an ssh key, so follow these instructions to create one; if not, copy the output and paste it into the key area

Then you can create datasets as follows:

Using the browser capabilities on GIN, create two empty repositories in your GIN organization: <dataset-name> and <dataset-name>-confidential, e.g. dataset1 and dataset1-confidential. Here's an example of creation of the first (i.e. non confidential); notice that (a) you need to create the repo from the organization (and not your personal account) and (b) you need to uncheck the box at the bottom during actual creation.

Run the following script (edit the environment variables to suit your configuration):

export GIN_ORGANIZATION='EL1000' # name of your GIN organization
export CONFIDENTIAL_DATASET=0 # set to 1 if there should be a confidential sibling
datalad create -c el1000 dataset-name

For instance, in the example above, we'd do the following, because this is a dataset that has some confidential content:

export GIN_ORGANIZATION='EL1000' # name of your GIN organization
export CONFIDENTIAL_DATASET=1 # set to 1 if there should be a confidential sibling
datalad create -c el1000 rague

And here is an example of a dataset that has some no content:

export GIN_ORGANIZATION='EL1000' # name of your GIN organization
export CONFIDENTIAL_DATASET=0 # set to 1 if there should be a confidential sibling
datalad create -c el1000 lyon

The output you'll see looks like this:

[INFO ] Creating a new annex repo at /Users/acristia/Documents/git-data/rague [INFO ] Scanning for unlocked files (this may take some time) [INFO ] Running procedure cfg_el1000 [INFO ] == Command start (output follows) ===== [INFO ] Could not enable annex remote origin. This is expected if origin is a pure Git remote, or happens if it is not accessible. [WARNING] Could not detect whether origin carries an annex. If origin is a pure Git remote, this is expected.
.: origin(-) [git@gin.g-node.org:/EL1000/rague.git (git)] .: origin(+) [git@gin.g-node.org:/EL1000/rague.git (git)]
[INFO ] Could not enable annex remote confidential. This is expected if confidential is a pure Git remote, or happens if it is not accessible. [WARNING] Could not detect whether confidential carries an annex. If confidential is a pure Git remote, this is expected.
.: confidential(-) [git@gin.g-node.org:/EL1000/rague-confidential.git (git)] .: confidential(+) [git@gin.g-node.org:/EL1000/rague-confidential.git (git)]
[INFO ] Configure additional publication dependency on "confidential"
.: origin(+) [git@gin.g-node.org:/EL1000/rague.git (git)] [INFO ] == Command exit (modification check follows) ===== create(ok): /Users/acristia/Documents/git-data/rague (dataset)

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.datalad		.datalad
procedures		procedures
templates		templates
.gitattributes		.gitattributes
.gitmodules		.gitmodules
README.md		README.md
install.py		install.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation instructions

If needed, activate the ChildProjectVenv virtual environment

Download the package

Install the dependencies

Install the procedures

Check the installation

Usage

The LAAC template

The LAAC1 template

The LAAC2 template

The EL1000 template

About

Releases

Packages

Contributors 3

Languages

LAAC-LSCP/datalad-procedures

Folders and files

Latest commit

History

Repository files navigation

Installation instructions

If needed, activate the ChildProjectVenv virtual environment

Download the package

Install the dependencies

Install the procedures

Check the installation

Usage

The LAAC template

The LAAC1 template

The LAAC2 template

The EL1000 template

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages