Name		Name	Last commit message	Last commit date
parent directory ..
build-configs		build-configs
defaults		defaults
profiles/default		profiles/default
rules		rules
scripts		scripts
vendored		vendored
README.md		README.md
Snakefile		Snakefile

README.md

nextstrain.org/zika/ingest

This is the ingest pipeline for zika virus sequences.

Software requirements

Follow the standard installation instructions for Nextstrain's suite of software tools.

Usage

NOTE: All command examples assume you are within the ingest directory. If running commands from the outer zika directory, please replace the . with ingest

Fetch sequences with

nextstrain build . data/sequences.ndjson

Run the complete ingest pipeline with

nextstrain build .

This will produce two files (within the ingest directory):

results/metadata.tsv
results/sequences.fasta

Run the complete ingest pipeline and upload results to AWS S3 with

nextstrain build \
    --env AWS_ACCESS_KEY_ID \
    --env AWS_SECRET_ACCESS_KEY \
    . \
        upload_all \
        --configfile build-configs/nextstrain-automation/config.yaml

Adding new sequences not from GenBank

Static Files

Do the following to include sequences from static FASTA files.

Convert the FASTA files to NDJSON files with:

./ingest/bin/fasta-to-ndjson \
    --fasta {path-to-fasta-file} \
    --fields {fasta-header-field-names} \
    --separator {field-separator-in-header} \
    --exclude {fields-to-exclude-in-output} \
    > ingest/data/{file-name}.ndjson

Add the following to the .gitignore to allow the file to be included in the repo:
```
!ingest/data/{file-name}.ndjson
```
Add the file-name (without the .ndjson extension) as a source to ingest/defaults/config.yaml. This will tell the ingest pipeline to concatenate the records to the GenBank sequences and run them through the same transform pipeline.

Configuration

Configuration takes place in defaults/config.yaml by default. Optional configs for uploading files and Slack notifications are in defaults/optional.yaml.

Environment Variables

The complete ingest pipeline with AWS S3 uploads and Slack notifications uses the following environment variables:

Required

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
SLACK_TOKEN
SLACK_CHANNELS

Optional

These are optional environment variables used in our automated pipeline for providing detailed Slack notifications.

GITHUB_RUN_ID - provided via github.run_id in a GitHub Action workflow
AWS_BATCH_JOB_ID - provided via AWS Batch Job environment variables

Input data

GenBank data

GenBank sequences and metadata are fetched via NCBI datasets.

`ingest/vendored`

This repository uses git subrepo to manage copies of ingest scripts in ingest/vendored, from nextstrain/ingest.

See vendored/README.md for instructions on how to update the vendored scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest

ingest

README.md

nextstrain.org/zika/ingest

Software requirements

Usage

Adding new sequences not from GenBank

Static Files

Configuration

Environment Variables

Required

Optional

Input data

GenBank data

`ingest/vendored`

Files

ingest

Directory actions

More options

Directory actions

More options

Latest commit

History

ingest

Folders and files

parent directory

README.md

nextstrain.org/zika/ingest

Software requirements

Usage

Adding new sequences not from GenBank

Static Files

Configuration

Environment Variables

Required

Optional

Input data

GenBank data

ingest/vendored

`ingest/vendored`