We're hosting primap datasets using github for metadata storage and cloudflare R2. It essentially has three components:
- The github storage holds a git repository with the usual branches containing un-annexed files like code files and other small text files and symlinks for data files. It also carries the git-annex branch with metadata (content hashes) for the data files, but crucially not the content of the actual data files. The github repo can be public or private, as configured on github.
- Cloudflare R2 contains all data files (using their hashed content as filename) in an
object-orient storage. For each repository there is one
bucket
in R2. To access a bucket (both for reading and writing), you need to have an access key. - Cloudflare R2 also offers the option to publish a
bucket
so that it is readable without an access key (this is not the default).
Taken together, this enables us to use github to manage the metadata (e.g. using branches and pull requests for collaboration) while the data is a relatively cheap and efficient object storage, all publicly available if desired.
Internal developers get the rights to create new datasets and read from and write to every public and private repository. For that, you'll need a cloudflare account for web access, as well as an API key to access repositories.
First, get an account on cloudflare. Speak to Mika or Jared.
Once you have an account on cloudflare, you need to generate the API keys:
- Go to https://dash.cloudflare.com/
- Log in.
- If it asks, choose the account "Climate Resource".
- In the left menu, choose "R2 Object Storage".
- On the right, select "Manage R2 API tokens".
- Press "Create API token".
- Use the name
{initials}-rw
, where{initials}
are initials you identify with, likemp
for Mika Pflüger. - As Permissions, choose "Object Read & Write".
- For "Specify bucket(s)" select "Apply to all buckets in this account".
- As TTL specify "Forever" for ease of use, or specify "1 year" for higher security (but you will need to regenerate the token every year).
- Do not specify anything in "Client IP Filtering".
- Presse "Create API Token"
Now, this will display (once only) a page with the API keys. In particular, you need the "Access Key ID" and the "Secret Access Key ID". Save them somewhere safe.
To work with the API keys easily without having to type them out all the time, you
can export them using a function defined in your .bashrc
. In your ~/.bashrc
file,
add this to the end:
primap_datalad_creds () {
export AWS_ACCESS_KEY_ID="access-key-id"
export AWS_SECRET_ACCESS_KEY="secret-access-key-id"
}
After reopening your terminal, you can now execute primap_datalad_creds
once and
will be able to run any datalad commands which need access to the R2 until you close
the terminal. Alternatively, if you do not use any other AWS api keys and don't want
to be bothered to execute a function each time you open a new shell before you can
use datalad properly, you can also just add the two export
lines to your .bashrc
without the surrounding function definition so that they get run everytime and the
environment variables are always available.
External collaborators get read or read-and-write rights to specific repositories, but not the right to create new repositories. As such, they only need an API key, not a cloudflare account.
To set an API key up for an external collaborator:
- Go to https://dash.cloudflare.com/
- Log in.
- If it asks, choose the account "Climate Resource".
- In the left menu, choose "R2 Object Storage".
- On the right, select "Manage R2 API tokens".
- Press "Create API token".
- Use the name
external-{initials}-rw
, where{initials}
are initials for the external collaborator. - As Permissions, choose "Object Read & Write" if you want to give write access, otherwise choose "Object Read only" (only useful for private repos, otherwise they should just use public access).
- For "Specify bucket(s)" select "Apply to specific buckets only" and select buckets that the collaborator should have access to.
- As TTL specify "Forever" for ease of use, or specify "1 year" for higher security (but you will need to regenerate the token every year).
- Do not specify anything in "Client IP Filtering".
- Presse "Create API Token"
Now, this will display (once only) a page with the API keys. In particular, you need the "Access Key ID" and the "Secret Access Key ID". Send them to the collaborator.
To work with these API keys, the collaborator will likely want to set up their
~/.bashrc
as explained above in the section for internal developers.
To create a new repository, first create the cloudflare R2 bucket and optionally enable public access:
- Go to https://dash.cloudflare.com/
- Log in.
- If it asks, choose the account "Climate Resource".
- In the left menu, choose "R2 Object Storage".
- Press the blue "Create bucket" button.
- Decide on a dataset name, then use
primap-datalad-{name}
as the bucket name, where you replace{name}
by the dataset name likeunfccc-di
, so that the resulting bucket name is something likeprimap-datalad-unfccc-di
. - For "Location", choose "Automatic" and provide a "location hint" for "Western Europe (WEUR)".
- For the "Default storage class", choose "Standard" (should be the default).
- Hit "Create bucket" at the bottom.
- The next steps (11., 12., 13.) are optional, and only necessary if you want public access to your repository.
- The bucket overview loads, here navigate to the "Settings" tab, there you find under the "Public Access" heading in the "R2.dev subdomain" section a button "Allow access" - press it.
- Type "allow" to confirm that you want public access, then press "Allow" again.
- Copy the "Public R2.dev Bucket URL", you'll need it later.
Now, create your datalad repository. In the terminal, navigate to the location where you want your local clone of the datalad repository (without creating the new directory yet). Run:
datalad create -c text2git $name
cd $name
where you replace $name
by the dataset name like unfccc-di
.
Next, we'll add R2 as a sibling (i.e. publication target) to the new datalad repository. We have to use git-annex directly to do this.
If you add a bucket with public access, use:
primap_datalad_creds # skip if you set up your .bashrc to always inject the secrets
git annex initremote public-r2 type=S3 encryption=none signature=v4 region=auto protocol=https \
autoenable=true \
bucket=primap-datalad-$name \
host=2aa5172b2bba093c516027d6fa13cdc8.r2.cloudflarestorage.com \
publicurl=$publicurl
If you add a bucket with private access, use:
primap_datalad_creds # skip if you set up your .bashrc to always inject the secrets
git annex initremote r2 type=S3 encryption=none signature=v4 region=auto protocol=https \
autoenable=true \
bucket=primap-datalad-$name \
host=2aa5172b2bba093c516027d6fa13cdc8.r2.cloudflarestorage.com
where you replace $name
by the dataset name like unfccc-di
and $publicurl
by the
public URL you copied when creating the cloudflare R2 bucket. If you didn't copy it,
you can find it on the bucket's page in the settings under the heading "Public Access"
in the section "R2.dev subdomain" at the entry "Public R2.dev Bucket URL". Copy it fully,
it should look something like https://pub-lotsofcharacters.r2.dev
.
Now, the output of datalad siblings
should look like this:
.: here(+) [git]
.: public-r2(+) [git]
Now, we'll add github as a sibling to the repo. This can be done using datalad directly:
datalad create-sibling-github primap-community/$name \
--publish-depends public-r2 \
--access-protocol ssh
where you replace $name
again by the dataset name like unfccc-di
.
Now, the output of datalad siblings
should look like this:
.: here(+) [git]
.: public-r2(+) [git]
.: github(-) [git@github.com:primap-community/github-r2-datalad-test.git (git)]
Finally, push the dataset to github, which will automatically push to R2 as well because we configured it as a publication dependency.
This section refers to repositories that have already been initialised by datalad. If you are using a repository that was previously only managed via git, take a look at Move an existing git repository to R2.
To move an existing datalad repository, first create the cloudflare R2 bucket and optionally enable public access:
- Go to https://dash.cloudflare.com/
- Log in.
- If it asks, choose the account "Climate Resource".
- In the left menu, choose "R2 Object Storage".
- Press the blue "Create bucket" button.
- Use
primap-datalad-{name}
as the bucket name, where you replace{name}
by the dataset name likeunfccc-di
, so that the resulting bucket name is something likeprimap-datalad-unfccc-di
. - For "Location", choose "Automatic" and provide a "location hint" for "Western Europe (WEUR)".
- For the "Default storage class", choose "Standard" (should be the default).
- Hit "Create bucket" at the bottom.
- The next steps (11., 12., 13.) are optional, and only necessary if you want to maintain public access to your repository.
- The bucket overview loads, here navigate to the "Settings" tab, there you find under the "Public Access" heading in the "R2.dev subdomain" section a button "Allow access" - press it.
- Type "allow" to confirm that you want public access, then press "Allow" again.
- Copy the "Public R2.dev Bucket URL", you'll need it later.
Now, we'll add R2 as a sibling (i.e. publication target) to the existing datalad repository. We have to use git-annex directly to do this.
If you add a bucket with public access, use:
primap_datalad_creds # skip if you set up your .bashrc to always inject the secrets
git annex initremote public-r2 type=S3 encryption=none signature=v4 region=auto protocol=https \
autoenable=true \
bucket=primap-datalad-$name \
host=2aa5172b2bba093c516027d6fa13cdc8.r2.cloudflarestorage.com \
publicurl=$publicurl
If you add a bucket with private access, use:
primap_datalad_creds # skip if you set up your .bashrc to always inject the secrets
git annex initremote r2 type=S3 encryption=none signature=v4 region=auto protocol=https \
autoenable=true \
bucket=primap-datalad-$name \
host=2aa5172b2bba093c516027d6fa13cdc8.r2.cloudflarestorage.com
where you replace $name
by the dataset name like unfccc-di
and $publicurl
by the
public URL you copied when creating the cloudflare R2 bucket. If you didn't copy it,
you can find it on the bucket's page in the settings under the heading "Public Access"
in the section "R2.dev subdomain" at the entry "Public R2.dev Bucket URL". Copy it fully,
it should look something like https://pub-lotsofcharacters.r2.dev
.
Now, the output of datalad siblings
should look like this:
.: here(+) [git]
.: public-r2(+) [git]
.: origin(-) [https://github.com/mikapfl/unfccc_di_data.git (git)]
.: datalad-archives(+) [datalad-archives]
.: ginhemio-storage(+) [https://gin.hemio.de/CR/unfcc_di_data (git)]
.: ginhemio(+) [https://gin.hemio.de/CR/unfcc_di_data (git)]
Your r2 remote will be called "public-r2" or "r2". Note that your github sibling might not be named "origin" and you might not have the "datalad-archives" sibling at all and you might only have one of "ginhemio" and "ginhemio-storage". This all depends on the prior hosting history of the dataset and might therefore differ between datasets. Important is only that you at this stage still have one ginhemio sibling and the github sibling.
Now, we'll add a publication dependency on public-r2 to the github remote and remove the publication dependency on ginhemio:
datalad siblings configure -s $github_sibling_name --publish-depends $r2_sibling_name
Replace $github_sibling_name
with the name of your github sibling (usually, github
or origin
). Replace $r2_sibling_name
with the name of the R2 sibling (public-r2
or r2
).
Option A: If we are confident that all the files are stored on disc, we can push the dataset to github, which will automatically push to R2 as well, because we configured it as a publication dependency. This might take a while because it transfers all data:
datalad push --to $github_sibling_name
Option B: If some files are only on a remote (they are broken symlinks on our local machine),
we could either download and the upload again
(datalad get .
and datalad push --to $github_sibling_name
) or we can use git-annex
directly to copy the files (faster for large data sets):
git annex copy --to $r2_sibling_name --from-anywhere --all
Finally, remove the obsolete ginhemio siblings:
datalad siblings remove -s ginhemio
datalad siblings remove -s ginhemio-storage
if you only have one ginhemio sibling, only remove this one. Also, we have to tell git-annex to never enable the storage sibling and not try to fetch data from ginhemio any more:
git annex configremote ginhemio-storage autoenable=false
git annex dead ginhemio-storage
and push the results again:
datalad push --to $github_sibling_name
This section refers to repositories that have not been initialised by datalad. The process is very similar to the one for moving an existing datalad repository, but we have to initialise the dataset as well.
To move an existing datalad repository, first create the cloudflare R2 bucket and optionally enable public access:
- Go to https://dash.cloudflare.com/
- Log in.
- If it asks, choose the account "Climate Resource".
- In the left menu, choose "R2 Object Storage".
- Press the blue "Create bucket" button.
- Use
primap-datalad-{name}
as the bucket name, where you replace{name}
by the dataset name likeunfccc-di
, so that the resulting bucket name is something likeprimap-datalad-unfccc-di
. - For "Location", choose "Automatic" and provide a "location hint" for "Western Europe (WEUR)".
- For the "Default storage class", choose "Standard" (should be the default).
- Hit "Create bucket" at the bottom.
- The next steps (11., 12., 13.) are optional, and only necessary if you want to maintain public access to your repository.
- The bucket overview loads, here navigate to the "Settings" tab, there you find under the "Public Access" heading in the "R2.dev subdomain" section a button "Allow access" - press it.
- Type "allow" to confirm that you want public access, then press "Allow" again.
- Copy the "Public R2.dev Bucket URL", you'll need it later.
First, we need to initialise our dataset with:
datalad create -c text2git $name
If we expect to have large CSV files, we need to ensure that the CSV files are stored in the git-annex
by adding this line to the .gitattributes
file, which should be in the root directory of the repository:
*.csv annex.largefiles=anything
If the files we want to push to the R2 bucket are excluded by the .gitignore
file, we need to
remove the lines that exclude them.
Now, we'll add R2 as a sibling (i.e. publication target) to the existing datalad repository. We have to use git-annex directly to do this.
If you add a bucket with public access, use:
primap_datalad_creds # skip if you set up your .bashrc to always inject the secrets
git annex initremote public-r2 type=S3 encryption=none signature=v4 region=auto protocol=https \
autoenable=true \
bucket=primap-datalad-$name \
host=2aa5172b2bba093c516027d6fa13cdc8.r2.cloudflarestorage.com \
publicurl=$publicurl
If you add a bucket with private access, use:
primap_datalad_creds # skip if you set up your .bashrc to always inject the secrets
git annex initremote r2 type=S3 encryption=none signature=v4 region=auto protocol=https \
autoenable=true \
bucket=primap-datalad-$name \
host=2aa5172b2bba093c516027d6fa13cdc8.r2.cloudflarestorage.com
where you replace $name
by the dataset name like unfccc-di
and $publicurl
by the
public URL you copied when creating the cloudflare R2 bucket. If you didn't copy it,
you can find it on the bucket's page in the settings under the heading "Public Access"
in the section "R2.dev subdomain" at the entry "Public R2.dev Bucket URL". Copy it fully,
it should look something like https://pub-lotsofcharacters.r2.dev
.
Now, the output of datalad siblings
should look like this:
.: here(+) [git]
.: public-r2(+) [git]
.: origin(-) [https://github.com/mikapfl/unfccc_di_data.git (git)]
Note that your github sibling might not be named "origin" and your r2 remote will be called "public-r2" or "r2".
Now, we'll add a publication dependency on r2 to the github remote:
datalad siblings configure -s $github_sibling_name --publish-depends $r2_sibling_name
Replace $github_sibling_name
with the name of your github sibling (usually, github
or origin
). Replace $r2_name
with the name of the R2 sibling (public-r2
or r2
).
We can push the dataset to github, which will automatically push to R2 as well, because we configured it as a publication dependency. This might take a while because it transfers all data:
datalad push --to $github_sibling_name
If someone already set up the R2 remote, we still need to enable the sibling in our local repository. Our local repository should be already aware of the sibling, but we might not be able to see it yet.
First we check if the R2 sibling has already been added with:
git annex whereis path/to/some/file
Choose a file that is stored in the datalad remote (e.g. .nc
or .csv
files). We should now see a list of remotes
where the file is stored. One of them will have the name public-r2
and show the url of the R2 bucket. Note that it could also have another name, if the remote wasn't set up
as described above.
If the command does nothing we can try the same for all files and interrupt the process after a few files:
git annex whereis
If public-r2
is in the list of available remotes, we can activate the sibling with:
datalad siblings enable -s public-r2
We can now list all the activated siblings with:
datalad siblings
The result should show the public-r2
sibling.
If we want to see the name of the bucket we can use:
git show git-annex:remote.log
If the datalad push
command doesn't work as expected, we can
push with additional info on the individual steps:
datalad -l debug push --to r2
If you see something like err: 'fatal: 'your-remote' does not appear to be a git repository
,
you can ignore that.
datalad push --to r2 --data anything
git-annex wanted r2
To check wether your file is in the git-annex branch:
git-annex whereis path/to/file
If nothing happens, your file is not in the annex branch. If you expect it to be in the git-annex
go to .gitattributes
and .gitignore
to see which files are excluded / included.
You can check the integrity of the files in the annex with:
git-annex fsck --from $r2_sibling_name --fast --all --quiet
--fast
performs a quick check without downloading the files
-- quiet
shows only the errors (no errors, no output)
--all
runs the check on all files in the repository