-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi-C workflow #139
base: main
Are you sure you want to change the base?
Hi-C workflow #139
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another partial review. Still chugging through this 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is a doozy. I skimmed some parts. Need to do another closer review of some parts, but this is enough feedback for now
docker/bedtools/2.31.1/Dockerfile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this version dir should be 2.31.1-0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is named after bedtools
, why does it have hic
scripts in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, it was just plain bedtools
, but with the decision to remove the embedded scripts, those needed to get built in to some image and this one depends on bedtools
. I can go somewhere else, but then we'll need to install bedtools
in to another container.
docker/bowtie2/2.5.4/Dockerfile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.5.4-0
docker/hilow/1.0.0/Dockerfile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe consolidate some of these new Docker images? It looks like we can maybe merge bedtools
, hilow
and juicertools
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were trying to avoid these types of monolithic images? We can do that, but is that the direction we want to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. There's a balance to be found, but I'm not sure what it is yet... Responding to #139 (comment) in this thread to consolidate conversation.
Maybe we can merge hilow
and juicertools
(is my understanding correct that they are usually used together?), building that image with bedtools
and moving the hic
script which depends on bedtools
into there? That's a bit "monolithic", but keeps the bedtools
image "clean" and hopefully isn't too sprawling.
Thoughts on that?
scripts/hic/filter_hic.py
Outdated
args = get_args() | ||
|
||
f=open(args.filter_pairs) | ||
blackID=defaultdict(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this was just a copy+paste, but can we use proper casing in this file? And blackID
should probably be updated to exclude_list
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably wise. These are just copy+paste.
scripts/hic/qc_hic.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we incorporate some Python tooling? A linter of some sort, formatter, etc? Don't really want to ask you to rewrite all these py scripts which appear to be mostly copy+paste, but also they aren't really conformant to any Python standards...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran both of these scripts through black
. Do you think we should to something additional? Now that we're maintaining a scripts
directory, we should probably do this at a repo-level. I didn't do any formatting of the methylation
scripts in that PR. That also contains R code, which I have no experience formatting/linting (I suspect R is always ugly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running through black
is a good start, and these do look better now.
My major concern is that with the maintenance of a scripts
directory, these files are quite visible now and as such they should be up to our code quality standards. However our code quality standards aren't exactly defined anywhere, so I suppose that gives us some latitude?
There's a million and one Python lint frameworks out there, and I don't have strong feelings about which to use, but I do think we should be using something.
My local VScode set-up uses the pylance
extension from Microsoft (https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) and according to those docs, under the hood pylance is using pyright
(https://github.com/microsoft/pyright).
Maybe we can add a black
check and a pyright
check to our CI? That doesn't sound too labor intensive on the surface.
} | ||
} | ||
|
||
task fastq_to_sam { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task fastq_to_sam { | |
task fastq_to_ubam { |
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I do think your suggestion makes sense, the underlying Picard tool is FastqToSam
and I don't want to break that connection.
} | ||
|
||
Float fastq_size = size(read_one_fastq_gz, "GiB") + size(read_two_fastq_gz, "GiB") | ||
Int disk_size_gb = ceil(fastq_size * 2) + 10 + modify_disk_size_gb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this calculation solid? I'm not really sure what to expect for the size of a uBAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect them to be around the same size as the FASTQ files. Since they're name sorted (since they're unaligned), you're not going to gain compression from shared sequence (most likely). I'm not sure if we should have additional buffer.
@@ -1327,7 +1327,90 @@ task faidx { | |||
cpu: 1 | |||
memory: "4 GB" | |||
disks: "~{disk_size_gb} GB" | |||
container: "quay.io/biocontainers/samtools:1.17--h00cdaf9_0" | |||
container: "quay.io/biocontainers/samtools:1.19.2--h50ea8bc_0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be updated for the rest of the file? I thought you added a lint or a CI or something to catch these cases...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was just one stray task that didn't get updated. Nothing else uses 1.17--h00cdaf9_0
at the moment.
external_help: "https://www.htslib.org/doc/samtools-sort.html", | ||
} | ||
prefix: "Prefix for the output file. The extension `.bam` will be added." | ||
uncompressed: "Output uncompressed BAM?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think until streaming comes to WDL, we should probably avoid allowing uncompressed output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some engines (e.g. dxCompiler) support streaming at the moment.
call bowtie2.build { input: | ||
reference = reference_download.downloaded_file, | ||
prefix = basename(reference_fa_name, ".fa.gz"), | ||
ncpu = 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably shouldn't be hardcoded here.
Implements a workflow to generate a bowtie2-aligned BAM and a
.hic
file for analysis. This utilizes the commonly used HiC-Pro workflow.