Scripts used by the Lattice data coordination team for single cell data wrangling
-
Create a virtual environment. This example uses anaconda. Other options would also work, like venv or pyenv
conda create --name lattice python=3.11
You will need to be in this environment for the following instructions
conda activate lattice
-
Install the following packages
conda install -c conda-forge pint jsonschema boto3 jupyter bs4 squidpy scanpy python-magic
pip install cellxgene-schema requests openpyxl Pillow gspread gspread_formatting oauth2client crcmod lxml pyometiff
-
Define variables in your environment based on the various servers you might submit to based on an alias for each server (
ALIAS_KEY
,ALIAS_SECRET
,ALIAS_SERVER
). For example, when submitting to the production instance of Lattice, you might call thisprod
. So you'd define the following three variables.$ conda env config vars set PROD_KEY=<key>
$ conda env config vars set PROD_SECRET=<secret>
$ conda env config vars set PROD_SERVER=https://www.lattice-data.org/
Your demo access will be the same, but the demo server will change with each new demo.
$ conda env config vars set DEMO_KEY=<key>
$ conda env config vars set DEMO_SECRET=<secret>
-
After defining those, you'll need to reactivate your environment
conda activate lattice
You can then confirm that they are defined
conda env config vars list
cellxgene_resources/
for curating towards CZ CELLxGENE Discover
-
curation_qa.ipynb Quality assurance checks on an AnnData object
-
curation_sample_code.ipynb Various samples of how to manipulate an AnnData object during curation
-
HCA_data_table.ipynb Compiles studies from CELLxGENE, HCA Data Portal, HCA Publications, and Bionetwork atlas lists
-
upload_local.ipynb Submitting local files to CELLxGENE
Please note:
This script utilizes the single-cell-curation repo which should be cloned to the following directory~/GitClones/CZI/
and CXG API keys should be stored in~/Documents/keys/cxg-api-key.txt
scripts/
for curating towards or out of Lattice DB
-
checkfiles.py Gathers data file content information and compares with submitted metadata run instructions If running locally, may need to install Homebrew and
brew install md5sha1sum
somd5sum
can run from checkfiles -
DCP_mapper.py Transforms a Lattice Dataset into HCA DCP-approved schema and stages at the DCP for submission to the HCA Portal run instructions
Requires additional steps:pip install google-api-python-client google-cloud-storage
$ conda env config vars set GOOGLE_APPLICATION_CREDENTIALS=<creds.json>
-
DCP_project_ready.ipynb Validates a project staged for submission to the HCA Data Portal.
-
flattener.py Transforms a contributor matrix, raw count data, and Lattice metadata into a cellxgene-approved matrix file run instructions
-
geo_metadata.py Transforms a Lattice Dataset into GEO submission format
-
make_template.py Produces a tabular representation of Lattice schema submittable properties, for ease of wrangling
Requires additional steps:
Follow instructions here to enable API & generate credentials
$ conda env config vars set CLIENT_SECRET_FILE=<creds.json>
-
qcmetrics_reader.py Transforms quality metrics and other processing information from various files of a standard CellRanger outs/ directory into the Lattice schema
-
query_by_dataset_lab.ipynb Return Donor, Sample, or Suspension objects from the Lattice DB for a given Dataset or Lab
-
s3_recent_uploads.ipynb Return files recently uploaded to the submitter S3 buckets
-
submit_metadata.py Transforms tabulated metadata into json objects and posts/patches to the Lattice DB use instructions
-
validate_demo.ipynb Compares various aspects of the production DB and a specified demo DB to identify potential bugs.
-
validate_checksums.py Identifies any duplicated files in the Lattice DB. To be executed after each checkfiles run.