-
Notifications
You must be signed in to change notification settings - Fork 403
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1075 from nextstrain/100k-open
Add 100k open samples
- Loading branch information
Showing
5 changed files
with
96 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,27 @@ | ||
## Aim | ||
|
||
To build a representative 100k dataset which is available for testing / developing builds locally. | ||
This is intended to run weekly via a GitHub action (which triggers a job to be run on AWS). | ||
It will make two files available: | ||
This is intended to run weekly via a GitHub action (which triggers jobs to be run on AWS). | ||
It will upload these files: | ||
|
||
* `s3://nextstrain-data/files/ncov/open/100k/metadata.tsv.xz` | ||
* `s3://nextstrain-data/files/ncov/open/100k/sequences.fasta.xz` | ||
* `s3://nextstrain-ncov-private/100k/metadata.tsv.xz` | ||
* `s3://nextstrain-ncov-private/100k/sequences.fasta.xz` | ||
|
||
While this profile is not recommended to be run locally, you can see what rules would be run via: | ||
|
||
``` | ||
snakemake --cores 1 --configfile nextstrain_profiles/100k/config.yaml -npf upload --dag | dot -Tpdf > dag.pdf | ||
snakemake --cores 1 --configfile nextstrain_profiles/100k/config-gisaid.yaml -npf upload --dag | dot -Tpdf > dag-100k-gisaid.pdf | ||
snakemake --cores 1 --configfile nextstrain_profiles/100k/config-open.yaml -npf upload --dag | dot -Tpdf > dag-100k-open.pdf | ||
``` | ||
|
||
To run manually you can trigger the GitHub action or run the job locally via: | ||
To run manually you can trigger the GitHub action (recommended) or run the jobs locally via: | ||
``` | ||
nextstrain build --aws-batch --cpus 16 --memory 31GiB --detach . \ | ||
--configfile nextstrain_profiles/100k/config.yaml \ | ||
--configfile nextstrain_profiles/100k/config-gisaid.yaml \ | ||
-f upload | ||
nextstrain build --aws-batch --cpus 16 --memory 31GiB --detach . \ | ||
--configfile nextstrain_profiles/100k/config-open.yaml \ | ||
-f upload | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# This file is largely duplicated from `config-gisaid.yaml` - please | ||
# see that file for comments | ||
S3_DST_BUCKET: "nextstrain-data/files/ncov/open/100k" # TODO XXX | ||
S3_DST_ORIGINS: [needed-for-workflow-but-unused] | ||
deploy_url: needed_for_workflow_but_unused | ||
custom_rules: | ||
- workflow/snakemake_rules/export_for_nextstrain.smk | ||
inputs: | ||
- name: open | ||
metadata: "s3://nextstrain-data/files/ncov/open/metadata.tsv.zst" | ||
aligned: "s3://nextstrain-data/files/ncov/open/sequences.fasta.zst" | ||
skip_sanitize_metadata: true | ||
builds: | ||
100k: | ||
subsampling_scheme: 100k_scheme | ||
upload: | ||
metadata.tsv.xz: results/100k/100k_subsampled_metadata.tsv.xz | ||
sequences.fasta.xz: results/100k/100k_subsampled_sequences.fasta.xz | ||
filter: | ||
exclude_where: "division='USA'" | ||
subsampling: | ||
100k_scheme: | ||
50k_early: | ||
group_by: "year month country" | ||
max_sequences: 50000 | ||
max_date: "--max-date 1Y" | ||
50k_late: | ||
group_by: "year month country" | ||
max_sequences: 50000 | ||
min_date: "--min-date 1Y" |