This is the ingest pipeline for zika virus sequences.
Follow the standard installation instructions for Nextstrain's suite of software tools.
NOTE: All command examples assume you are within the
ingest
directory. If running commands from the outerzika
directory, please replace the.
withingest
Fetch sequences with
nextstrain build . data/sequences.ndjson
Run the complete ingest pipeline with
nextstrain build .
This will produce two files (within the ingest
directory):
results/metadata.tsv
results/sequences.fasta
Run the complete ingest pipeline and upload results to AWS S3 with
nextstrain build \
--env AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY \
. \
upload_all \
--configfile build-configs/nextstrain-automation/config.yaml
Do the following to include sequences from static FASTA files.
-
Convert the FASTA files to NDJSON files with:
./ingest/bin/fasta-to-ndjson \ --fasta {path-to-fasta-file} \ --fields {fasta-header-field-names} \ --separator {field-separator-in-header} \ --exclude {fields-to-exclude-in-output} \ > ingest/data/{file-name}.ndjson
-
Add the following to the
.gitignore
to allow the file to be included in the repo:!ingest/data/{file-name}.ndjson
-
Add the
file-name
(without the.ndjson
extension) as a source toingest/defaults/config.yaml
. This will tell the ingest pipeline to concatenate the records to the GenBank sequences and run them through the same transform pipeline.
Configuration takes place in defaults/config.yaml
by default.
Optional configs for uploading files and Slack notifications are in defaults/optional.yaml
.
The complete ingest pipeline with AWS S3 uploads and Slack notifications uses the following environment variables:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
SLACK_TOKEN
SLACK_CHANNELS
These are optional environment variables used in our automated pipeline for providing detailed Slack notifications.
GITHUB_RUN_ID
- provided viagithub.run_id
in a GitHub Action workflowAWS_BATCH_JOB_ID
- provided via AWS Batch Job environment variables
GenBank sequences and metadata are fetched via NCBI datasets.
This repository uses git subrepo
to manage copies of ingest scripts in ingest/vendored, from nextstrain/ingest.
See vendored/README.md for instructions on how to update the vendored scripts.