This assumes you've already installed ChildProject. If you haven't, start following these instructions
conda activate childproject
If the above line doesn't work, you may have installed ChildProject generally, rather than in a virtual environment.
!Warning! If none of this rings a bell, you may have not installed ChildProject at all. To do so, follow these instructions
git clone git@github.com:LAAC-LSCP/datalad-procedures.git
cd datalad-procedures
apt-get install git-annex || brew install git-annex
pip3 install -r requirements.txt
python3 install.py
At this point, a message may ask you if you want to establish a fingerprint; say yes.
datalad run-procedure --discover
Expected output:
cfg_laac1 (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_laac1.py) [python_script] cfg_yoda (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_yoda.py) [python_script] cfg_el1000 (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_el1000.py) [python_script] cfg_text2git (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_text2git.py) [python_script] cfg_metadatatypes (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_metadatatypes.py) [python_script] cfg_laac2 (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_laac2.py) [python_script] cfg_laac (/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/datalad/resources/procedures/cfg_laac.py) [python_script]
This is the default template you should use most of the time. This will use GIN for hosting the different siblings of your dataset.
This template can create up to 3 remotes:
- origin : this is the default remote, it will hold all the annexed data except for the data stored under
confidential
folder(s) - confidential : this remote only stores the content of confidential files (under a
confidential
folder, wherever in the dataset) - public : this remote holds the data that we consider non sensitive, converted automated annotation content can be shared without risk so this includes:
- vtc converted annotations : under
annotations/vtc/converted
- vcm converted annotations : under
annotations/vcm/converted
- alice converted annotations : under
annotations/alice/output/converted
which is usually the conversion of the alice output andannotations/alice/converted
which is usually the merge of alice and vtc (to get the speaker_type information) - its converted annotations : under
annotations/its/converted
- vtc converted annotations : under
Keep in mind that those are fixed path so when adding annotations, use those exact names of set to make sure it gets correctly published to your public remote.
Because we use GIN for hosting your remote repositories, you need to make sure to set up your ssh key:
- go to https://gin.g-node.org/
- log in
- click on top right on your avatar, choose parameters
- click on SSH keys at the left, then click on add a key
- do cat ~/.ssh/id_rsa.pub -- if you get an error, that means your computer does not yet have an ssh key, so follow these instructions to create one; if not, copy the output and paste it into the key area
Then you can create datasets as follows:
- Using the browser capabilities on GIN, create three (depending on if you have confidential/public repos, could be 1 or 2) empty repositories in your GIN organization:
<dataset-name>
,<dataset-name>-public
and<dataset-name>-confidential
, e.g.dataset1
,dataset1-public
anddataset1-confidential
. Here's an example of creation of the first (i.e. non confidential and non public); notice that (a) you need to create the repo from the organization (and not your personal account) and (b) you need to uncheck the box at the bottom during actual creation.
- Run the following script (edit the environment variables to suit your configuration):
export GIN_ORGANIZATION='LAAC-LSCP' # name of your GIN organization
export CONFIDENTIAL_DATASET=0 # set to 1 if there should be a confidential sibling
export PUBLIC_DATASET=0 # set to 1 if there should be a public sibling
datalad create -c laac dataset-name
The output you'll see looks something like this:
[INFO ] Creating a new annex repo at /Users/acristia/Documents/git-data/rague [INFO ] Scanning for unlocked files (this may take some time) [INFO ] Running procedure cfg_laac [INFO ] == Command start (output follows) ===== [INFO ] Could not enable annex remote origin. This is expected if origin is a pure Git remote, or happens if it is not accessible. [WARNING] Could not detect whether origin carries an annex. If origin is a pure Git remote, this is expected.
.: origin(-) [git@gin.g-node.org:/LAAC-LSCP/rague.git (git)] .: origin(+) [git@gin.g-node.org:/LAAC-LSCP/rague.git (git)]
[INFO ] Could not enable annex remote confidential. This is expected if confidential is a pure Git remote, or happens if it is not accessible. [WARNING] Could not detect whether confidential carries an annex. If confidential is a pure Git remote, this is expected.
.: confidential(-) [git@gin.g-node.org:/LAAC-LSCP/rague-confidential.git (git)] .: confidential(+) [git@gin.g-node.org:/LAAC-LSCP/rague-confidential.git (git)]
[INFO ] Configure additional publication dependency on "confidential"
.: origin(+) [git@gin.g-node.org:/LAAC-LSCP/rague.git (git)] [INFO ] == Command exit (modification check follows) ===== create(ok): /Users/acristia/Documents/git-data/rague (dataset)
The LAAC1 template creates a dataset with two siblings:
- One on GitHub
- One on a specified SSH location
export GITHUB_ORGANIZATION="LAAC-LSCP" # name of your GitHub organization
export DATASET_PATH="/location/of/your/datasets/" # remote location of the dataset in the server
export SSH_HOSTNAME="your.cluster.com" # hostname/alias of your ssh server
datalad create -c laac1 dataset-name
The LAAC2 template creates a dataset with three GIN siblings:
- One containing all the data
- One containing confidential data, but not the recordings
- One containing all non-confidential data
Since this relies on GIN, you need to make sure to set up your ssh key:
- go to https://gin.g-node.org/
- log in
- click on top right on your avatar, choose parameters
- click on SSH keys at the left, then click on add a key
- do cat ~/.ssh/id_rsa.pub -- if you get an error, that means your computer does not yet have an ssh key, so follow these instructions to create one; if not, copy the output and paste it into the key area
Then you can create datasets as follows:
export GIN_ORGANIZATION="LAAC-LSCP" # name of your GitHub organization
datalad create -c laac2 dataset-name
You should not use this template anymore as the LAAC template can do the same (and more)
Since this template relies on GIN, you need to make sure to set up your ssh key:
- go to https://gin.g-node.org/
- log in
- click on top right on your avatar, choose parameters
- click on SSH keys at the left, then click on add a key
- do cat ~/.ssh/id_rsa.pub -- if you get an error, that means your computer does not yet have an ssh key, so follow these instructions to create one; if not, copy the output and paste it into the key area
Then you can create datasets as follows:
- Using the browser capabilities on GIN, create two empty repositories in your GIN organization:
<dataset-name>
and<dataset-name>-confidential
, e.g.dataset1
anddataset1-confidential
. Here's an example of creation of the first (i.e. non confidential); notice that (a) you need to create the repo from the organization (and not your personal account) and (b) you need to uncheck the box at the bottom during actual creation.
- Run the following script (edit the environment variables to suit your configuration):
export GIN_ORGANIZATION='EL1000' # name of your GIN organization
export CONFIDENTIAL_DATASET=0 # set to 1 if there should be a confidential sibling
datalad create -c el1000 dataset-name
For instance, in the example above, we'd do the following, because this is a dataset that has some confidential content:
export GIN_ORGANIZATION='EL1000' # name of your GIN organization
export CONFIDENTIAL_DATASET=1 # set to 1 if there should be a confidential sibling
datalad create -c el1000 rague
And here is an example of a dataset that has some no content:
export GIN_ORGANIZATION='EL1000' # name of your GIN organization
export CONFIDENTIAL_DATASET=0 # set to 1 if there should be a confidential sibling
datalad create -c el1000 lyon
The output you'll see looks like this:
[INFO ] Creating a new annex repo at /Users/acristia/Documents/git-data/rague [INFO ] Scanning for unlocked files (this may take some time) [INFO ] Running procedure cfg_el1000 [INFO ] == Command start (output follows) ===== [INFO ] Could not enable annex remote origin. This is expected if origin is a pure Git remote, or happens if it is not accessible. [WARNING] Could not detect whether origin carries an annex. If origin is a pure Git remote, this is expected.
.: origin(-) [git@gin.g-node.org:/EL1000/rague.git (git)] .: origin(+) [git@gin.g-node.org:/EL1000/rague.git (git)]
[INFO ] Could not enable annex remote confidential. This is expected if confidential is a pure Git remote, or happens if it is not accessible. [WARNING] Could not detect whether confidential carries an annex. If confidential is a pure Git remote, this is expected.
.: confidential(-) [git@gin.g-node.org:/EL1000/rague-confidential.git (git)] .: confidential(+) [git@gin.g-node.org:/EL1000/rague-confidential.git (git)]
[INFO ] Configure additional publication dependency on "confidential"
.: origin(+) [git@gin.g-node.org:/EL1000/rague.git (git)] [INFO ] == Command exit (modification check follows) ===== create(ok): /Users/acristia/Documents/git-data/rague (dataset)