Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a smaller test dataset #33

Open
5 of 9 tasks
aryarm opened this issue Jul 4, 2021 · 0 comments
Open
5 of 9 tasks

a smaller test dataset #33

aryarm opened this issue Jul 4, 2021 · 0 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@aryarm
Copy link
Owner

aryarm commented Jul 4, 2021

Our current test dataset comprises all of chr1 in two different samples: the Jurkat sample and the MOLT4 cell line. It takes about an hour to run the entire pipeline with this dataset.

Ideally, we would have a dataset that runs in under 10 mins or so. This could then be incorporated into a Github CI pipeline that runs automatically upon release of each major and minor version increment, so that we can know when a change that we've made to the code leads to a change in the results.

  • find SNVs and indels supported by all callers
  • choose just one or two peaks that overlap those variants from each of the two samples
  • subset the example dataset to reads that only overlap those peaks
  • also try to subset the reference genome that is packaged with the example data, since the ref genome appears to be the largest file, right now
  • rerun the pipeline with the smaller dataset and tweak the dataset as necessary to make it run quickly
  • use snakemake --generate-unit-tests to create a bunch of tests that can be executed using pytest
    • I'm running into issues with this. It doesn't work for outputs marked as pipe and there are some problems with other directories (see edge cases fail with --generate-unit-tests snakemake/snakemake#1104)
    • fix issues and ensure test coverage is appropriate
    • remove any unnecessary tests to ensure the test directory is small and can be properly included in version history (edit: this won't be possible, after all - b/c the test directory has to include the outputs of each rule ugh)
  • (optionally) create a Github action like this one to execute pytest upon each major or minor version increment and confirm the tests pass successfully
@aryarm aryarm added the enhancement New feature or request label Jul 4, 2021
@aryarm aryarm self-assigned this Jul 4, 2021
@aryarm aryarm added this to the VarCA v2.0.0 milestone Jul 4, 2021
aryarm added a commit that referenced this issue Jul 16, 2021
A limit on the size of our datasets is that manta requires that there be at least 100 high-quality reads. So we can't go smaller than that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant