a smaller test dataset #33

aryarm · 2021-07-04T17:35:20Z

Our current test dataset comprises all of chr1 in two different samples: the Jurkat sample and the MOLT4 cell line. It takes about an hour to run the entire pipeline with this dataset.

Ideally, we would have a dataset that runs in under 10 mins or so. This could then be incorporated into a Github CI pipeline that runs automatically upon release of each major and minor version increment, so that we can know when a change that we've made to the code leads to a change in the results.

find SNVs and indels supported by all callers
choose just one or two peaks that overlap those variants from each of the two samples
subset the example dataset to reads that only overlap those peaks
also try to subset the reference genome that is packaged with the example data, since the ref genome appears to be the largest file, right now
rerun the pipeline with the smaller dataset and tweak the dataset as necessary to make it run quickly
use snakemake --generate-unit-tests to create a bunch of tests that can be executed using pytest
- I'm running into issues with this. It doesn't work for outputs marked as pipe and there are some problems with other directories (see edge cases fail with --generate-unit-tests snakemake/snakemake#1104)
- fix issues and ensure test coverage is appropriate
- remove any unnecessary tests to ensure the test directory is small and can be properly included in version history (edit: this won't be possible, after all - b/c the test directory has to include the outputs of each rule ugh)
(optionally) create a Github action like this one to execute pytest upon each major or minor version increment and confirm the tests pass successfully

The text was updated successfully, but these errors were encountered:

A limit on the size of our datasets is that manta requires that there be at least 100 high-quality reads. So we can't go smaller than that.

aryarm added the enhancement New feature or request label Jul 4, 2021

aryarm self-assigned this Jul 4, 2021

aryarm added this to the VarCA v2.0.0 milestone Jul 4, 2021

aryarm added a commit that referenced this issue Jul 16, 2021

create script for generating small test datasets (resolves #33)

450afdc

A limit on the size of our datasets is that manta requires that there be at least 100 high-quality reads. So we can't go smaller than that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a smaller test dataset #33

a smaller test dataset #33

aryarm commented Jul 4, 2021 •

edited

Loading

a smaller test dataset #33

a smaller test dataset #33

Comments

aryarm commented Jul 4, 2021 • edited Loading

aryarm commented Jul 4, 2021 •

edited

Loading