Internal tooling for the Nextstrain team to ingest and curate SARS-CoV-2 genome sequences. This is open source, but we are not intending to support this to be used by outside groups.
If you're using Pipenv (see below), then run commands from ./bin/…
inside a pipenv shell
or wrapped with pipenv run ./bin/…
.
- Run
./bin/fetch-from-gisaid > data/gisaid.ndjson
- Run
./bin/transform-gisaid data/gisaid.ndjson
- Look at
data/gisaid/sequences.fasta
anddata/gisaid/metadata.tsv
The ingest pipelines are triggered from the GitHub workflows .github/workflows/ingest-master-*.yml
and …/ingest-branch-*.yml
but run on AWS Batch via the nextstrain build --aws-batch
infrastructure.
They're run on pushes to master
that modify source-data/*-annotations.tsv
and on pushes to other branches.
Pushes to branches other than master
upload files to branch-specific paths in the S3 bucket, don't send notifications, and don't trigger Nextstrain rebuilds, so that they don't interfere with the production data.
AWS credentials are stored in this repository's secrets and are associated with the nextstrain-ncov-ingest-uploader
IAM user in the Bedford Lab AWS account, which is locked down to reading and publishing only the gisaid.ndjson
, metadata.tsv
, and sequences.fasta
files and their zipped equivalents in the nextstrain-ncov-private
S3 bucket.
A full run is a now done in 3 steps via manual triggers:
- Fetch new sequences and ingest them by running
./bin/trigger fetch-and-ingest --user <your-github-username>
. - Add manual annotations, update location hierarchy as needed, and run ingest without fetching new sequences.
- Pushes of
source-data/*-annotations.tsv
to the master branch will automatically trigger a run of ingest. - You can also run ingest manually by running
./bin/trigger ingest --user <your-github-username>
.
- Pushes of
- Once all manual fixes are complete, trigger a rebuild of nextstrain/ncov by running
./bin/trigger rebuild --user <your-github-username>
.
See the output of ./bin/trigger fetch-and-ingest --user <your-github-username>
, ./bin/trigger ingest
or ./bin/trigger rebuild
for more information about authentication with GitHub.
Note: running ./bin/trigger
posts a GitHub repository_dispatch
.
Regardless of which branch you are on, it will trigger the specified action on the master branch.
Manual annotations should be added to source-data/gisaid_annotations.tsv
.
A common pattern is expected to be:
- Run https://github.com/nextstrain/ncov.
- Discover metadata that needs fixing.
- Update
source-data/gisaid_annotations.tsv
. - Push changes to
master
and re-downloadgisaid/metadata.tsv
.
New location hierarchies should be manually added to source-data/location_hierarchy.tsv
.
A common pattern is expected to be:
- Run the ingest.
- Discover new location hierarchies via Slack that need review.
- Update
source-data/location_hierarchy.tsv
. - Push changes to
master
so the next ingest will have an updated "source of truth" to draw from.
Some Nextstrain team members may be interested in receiving alerts when new GISAID strains are added from specific locations, e.g. Portugal or Louisiana.
To add a custom alert configuration, create a new entry in new-sequence-alerts-config.json
.
Each resolution (region, division, country, location) accepts a list of strings of areas of interest.
Note that these strings must match the area name exactly.
To set up custom alerts, you'll need to retrieve your Slack member ID.
Note that the user
field in each alert configuration is for human use only -- it need not match your Slack display name or username.
To view your Slack member ID, open up the Slack menu by clicking your name at the top, and click on 'View profile'.
Then, click on 'More'.
You can then copy your Slack member ID from the menu that appears.
Enter this into the slack_member_id
field of your alert configuration.
Clades assigned with Nextclade are currently cached in nextclade.tsv
on S3 bucket and only incremental updates for new sequences are performed during the daily ingests. This clade cache goes stale with time. It is necessary to perform full update of nextclade.tsv
file periodically, recomputing clades for all of the GISAID sequences all over again, to account for changes in the data. Same goes for when updating Nextclade versions, as they may lead to changes in clade assignment logic. Massive amounts of compute is required and it is not currently feasible to do this computation on current infrastructure, so it should be done elsewhere. As of November 2020, for 200k sequences, it takes approximately 2-3 hours on an on-prem Xeon machine with 16 cores/32 threads.
Use ./bin/gisaid-get-all-clades
to perform this update.
Python >= 3.6+, Node.js >= 12 (14 recommended) and yarn v1 are required.
git clone https://github.com/nextstrain/ncov-ingest
cd ncov-ingest
yarn install
pipenv sync
BATCH_SIZE=1000 pipenv run ./bin/gisaid-get-all-clades
The resulting data/gisaid/nextclade.tsv
should be placed on S3 bucked, replacing the one produced by the last daily ingest:
./bin/upload-to-s3 data/gisaid/nextclade.tsv s3://nextstrain-ncov-private/nextclade.tsv.gz
It will be picked up by the next ingest.
The best time for the update is between daily builds. There is usually no rush, even if the globally recomputed nextclade.tsv
is lagging behind one or two days, it will be incrementally updated by the next daily ingest.
Run pipenv sync
to setup an isolated Python 3.7 environment using the pinned dependencies.
If you don't have Pipenv, install it first with brew install pipenv
or python3 -m pip install pipenv
.
Node.js >= 12 and yarn v1 are required for the Nextclade part. Make sure you run yarn install
to install Nextclade. A global installation should also work, however a specific version is required. (see package.json
). Check Nextclade CLI readme for more details.
GISAID_API_ENDPOINT
GISAID_USERNAME_AND_PASSWORD
AWS_DEFAULT_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
SLACK_TOKEN
SLACK_CHANNELS