You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Figuring out unreasonable positions in a contig requires going through each position in this contig in the alignment; for edge 6104 (CAMP) in the SheepGut dataset, this takes about 13 minutes on the cluster. Which is reasonable, but slow. This is by far the longest step in FDR estimation (which, aside from this, takes maybe ~2 minutes or so for the ≥ 1 Mbp contigs in SheepGut), so if we could cut this time down that'd be great.
Since fdr estimate -- as of writing -- assumes that the input BCF file comes from call, it would make sense to have call produce a file listing unreasonable positions in all contigs that the user can directly pass into fdr estimate. This way, we avoid needing to look at the alignment at all (so, we can remove the --bam parameter from fdr estimate). Having this file of unreasonable positions around could be useful for other purposes, also (maybe for the mutation matrices)?
I'm not sure what the ideal format of this file would be, since the number of unreasonable positions in a given contig can range from 0 to length(contig). Relevant Stack Overflow thread here, although in that example the max number of columns has a hard limit across the entire file.
Maybe we could use a TSV file with two columns ("contig name" and "unreasonable positions"), where the second column contains a comma-separated list of (1-indexed, probably?) positions. This way, the file can be loaded easily using pandas or other TSV parsers (one "column" for each contig, besides the name), and each contig's list of positions can be treated as just a string and parsed easily.
Move unreasonable position detection over to call
Have call output a TSV of unreasonable positions for each contig
Modify fdr estimate to replace BAM input with the unreasonable-position TSV input
Update tests
Modify tutorials accordingly
maaaybe too much work for the short term
The text was updated successfully, but these errors were encountered:
Figuring out unreasonable positions in a contig requires going through each position in this contig in the alignment; for edge 6104 (CAMP) in the SheepGut dataset, this takes about 13 minutes on the cluster. Which is reasonable, but slow. This is by far the longest step in FDR estimation (which, aside from this, takes maybe ~2 minutes or so for the ≥ 1 Mbp contigs in SheepGut), so if we could cut this time down that'd be great.
Since
fdr estimate
-- as of writing -- assumes that the input BCF file comes fromcall
, it would make sense to havecall
produce a file listing unreasonable positions in all contigs that the user can directly pass intofdr estimate
. This way, we avoid needing to look at the alignment at all (so, we can remove the--bam
parameter fromfdr estimate
). Having this file of unreasonable positions around could be useful for other purposes, also (maybe for the mutation matrices)?I'm not sure what the ideal format of this file would be, since the number of unreasonable positions in a given contig can range from
0
tolength(contig)
. Relevant Stack Overflow thread here, although in that example the max number of columns has a hard limit across the entire file.Maybe we could use a TSV file with two columns ("contig name" and "unreasonable positions"), where the second column contains a comma-separated list of (1-indexed, probably?) positions. This way, the file can be loaded easily using pandas or other TSV parsers (one "column" for each contig, besides the name), and each contig's list of positions can be treated as just a string and parsed easily.
call
call
output a TSV of unreasonable positions for each contigfdr estimate
to replace BAM input with the unreasonable-position TSV inputmaaaybe too much work for the short term
The text was updated successfully, but these errors were encountered: