Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn / fail when we encounter massive datasets? #37

Open
fedarko opened this issue Aug 19, 2022 · 0 comments
Open

Warn / fail when we encounter massive datasets? #37

fedarko opened this issue Aug 19, 2022 · 0 comments
Labels
documentation Improvements or additions to documentation notes Not necessarily a single issue, just thoughts on something testing Plans for future tests, issues with current tests, etc.

Comments

@fedarko
Copy link
Owner

fedarko commented Aug 19, 2022

Not "massive" in the sense of "a large HiFi dataset", but massive in the sense of "this dataset is unrealistically massive and will start to cause weird overflow problems".

Fast-failing (e.g. This contig is too long) is fine, IMO -- the main thing I want to avoid is producing silently incorrect results.

I imagine most of the code should either work as expected, or fail loudly for arbitrarily large datasets. Python is good for this sort of stuff (in my experience, at least): it supports arbitrarily-large numbers, for example, and it'll throw an OverflowError if you try to make a ridiculously long string.

The main thing worth worrying about, I think, is our use of external libraries: samtools, minimap2, bcftools, prodigal, pysam, pysamstats, LJA.

I guess this issue can track the problems these libraries have with massive datasets; we can then add our own checks into strainFlye that fail fast and warn users if any of these problems come up.

  • bcftools index, using the default CSI format, "...supports indexing of chromosomes up to length 2^31."

    • For reference, 2^31 = 2,147,483,648 (2.14 billion). It's unlikely we'd see prokaryotic genomes this long, I think, but I could imagine this happening eventually.
    • It isn't clear to me what bcftools index's behavior is when it encounters a chromosome longer than this -- does it fail silently or loudly?
  • BCF files: see sections 1.3 and 6.3.3 of the spec for info on supported datatypes for each field.

... There are more issues besides this, this is just a start of this list.

@fedarko fedarko added documentation Improvements or additions to documentation testing Plans for future tests, issues with current tests, etc. notes Not necessarily a single issue, just thoughts on something labels Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation notes Not necessarily a single issue, just thoughts on something testing Plans for future tests, issues with current tests, etc.
Projects
None yet
Development

No branches or pull requests

1 participant