Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output file of "unreasonable position" information during naive calling #48

Open
5 tasks
fedarko opened this issue Sep 11, 2022 · 0 comments
Open
5 tasks
Labels
enhancement New feature or request performance gotta go fast

Comments

@fedarko
Copy link
Owner

fedarko commented Sep 11, 2022

Figuring out unreasonable positions in a contig requires going through each position in this contig in the alignment; for edge 6104 (CAMP) in the SheepGut dataset, this takes about 13 minutes on the cluster. Which is reasonable, but slow. This is by far the longest step in FDR estimation (which, aside from this, takes maybe ~2 minutes or so for the ≥ 1 Mbp contigs in SheepGut), so if we could cut this time down that'd be great.

Since fdr estimate -- as of writing -- assumes that the input BCF file comes from call, it would make sense to have call produce a file listing unreasonable positions in all contigs that the user can directly pass into fdr estimate. This way, we avoid needing to look at the alignment at all (so, we can remove the --bam parameter from fdr estimate). Having this file of unreasonable positions around could be useful for other purposes, also (maybe for the mutation matrices)?

I'm not sure what the ideal format of this file would be, since the number of unreasonable positions in a given contig can range from 0 to length(contig). Relevant Stack Overflow thread here, although in that example the max number of columns has a hard limit across the entire file.

Maybe we could use a TSV file with two columns ("contig name" and "unreasonable positions"), where the second column contains a comma-separated list of (1-indexed, probably?) positions. This way, the file can be loaded easily using pandas or other TSV parsers (one "column" for each contig, besides the name), and each contig's list of positions can be treated as just a string and parsed easily.

  • Move unreasonable position detection over to call
  • Have call output a TSV of unreasonable positions for each contig
  • Modify fdr estimate to replace BAM input with the unreasonable-position TSV input
  • Update tests
  • Modify tutorials accordingly

maaaybe too much work for the short term

@fedarko fedarko added enhancement New feature or request performance gotta go fast labels Sep 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance gotta go fast
Projects
None yet
Development

No branches or pull requests

1 participant