-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GitHub Action for Nextclade annotations #158
Conversation
Adds rules, config, and GitHub Action file to support running Nextclade on all available sequences. Not yet tested.
Remove unnecessary configuration parameters from the Nextclade build config and update the workflow to allow these parameters to be missing. Since Snakemake evaluates the Python code in each rule's inputs, outputs, and params, rules that we don't plan to run in the workflow can produce key errors when their config parameters are not defined.
Simplifies the logic to get Nextclade datasets by following the same pattern as the flu_frequencies workflow [1] where we grab the default dataset for a given lineage and segment instead of specifying a reference name. The "broad" and more recent references for H3N2 HA, for example, are not too different from each other, but the Nextclade annotations for the former are far more verbose than for the latter. We also want the files produced by this workflow to plug directly into the flu_frequencies workflow logic, so it is best to use the same approach here. [1] https://github.com/nextstrain/flu_frequencies/blob/6e4298fac3361f4a6751d85bcb963064dbb9eee1/Snakefile#L95
Adds NA to list of segments, since we want to know the subclade annotations for NA as well as HA and use these data to estimate frequencies.
Not doing an in-depth review, just reading over this PR because I'm interested in how you handled the different segments.
It'd be good to discuss what's the easiest for downstream users here (granted that may currently only be the flu-frequencies workflow?).
It's possible to test a new GitHub Action with either Tom's or Victor's work-around |
Add temporary trigger for Nextclade workflow on PR event. This should trigger the workflow when I push the update to the PR. If it works, I should drop this commit again.
Totally agree, @joverlee521! @plsteinberg is one such downstream user who has to manually join the metadata and HA Nextclade annotations for their project, so we might have an idea of how that could be improved based on their feedback. 😄
Gah, I knew this existed but forgot the mechanics. Thank you for the reminder! Trying with Victor's approach now. |
The workflow ran successfully, so removing this trigger.
Reduce threads requested for Nextclade runs from 16 to 12 so we can run 3 Nextclade jobs at once (one per lineage) on a 36-core instance of AWS Batch.
I'm going to merge this now, so I can start running the GitHub Action after our weekly ingests. We should continue to discuss this implementation in the future, though, since there is likely still remove to improve the user experience. |
Description of proposed changes
Adds rules, config, and GitHub Action file to support running Nextclade on all available HA and NA sequences for H1N1pdm, H3N2, and Vic.
The new Snakemake logic lives in a custom build config named
nextclade
which skips most of the standard workflow build logic, using only the "download from S3" rules and its own custom rules to run Nextclade for the lineages and segments defined in the build config. This custom logic runs Nextclade with the default dataset per lineage and segment just like the flu_frequencies workflow and uploads the results to S3. Once we have merged this PR, we should be able to automatically run Nextclade with each ingest of new data and run flu_frequencies from the resulting files on S3.This PR includes a couple of minor changes to other parts of the standard workflow to allow the Nextclade build config YAML to be as simple as possible and also to allow all workflows to download the parsed sequences and metadata from S3 instead of downloading the raw sequences and parsing them again locally.
To run the Nextclade workflow, use the following command:
This uploads Nextclade annotations and alignment files to S3 per lineage and segment like
seasonal-flu/vic/ha/nextclade.tsv.xz
andseasonal-flu/vic/ha/aligned.fasta.xz
.This PR does not produce a merged metadata and Nextclade annotations file like the one used in the flu_frequencies workflow or the analogous ncov metadata files with Nextclade annotations included. I stopped short of creating this merged file in S3, too, because we use a single metadata TSV per lineage (for all segments) but we produce Nextclade annotations per segment. We could upload the metadata TSVs per lineage and segment with Nextclade annotations merged into a single file, but this would duplicate a lot of information across segments. Maybe that duplication is acceptable, but it's worth discussing more internally first.
Note: since this PR adds a new GitHub Action, I can't test the action until we've merged the PR into master.Related issue(s)
Related to #144
Checklist