Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi-C workflow #139

Open
wants to merge 126 commits into
base: main
Choose a base branch
from
Open

Hi-C workflow #139

wants to merge 126 commits into from

Conversation

adthrasher
Copy link
Member

@adthrasher adthrasher commented Mar 22, 2024

Implements a workflow to generate a bowtie2-aligned BAM and a .hic file for analysis. This utilizes the commonly used HiC-Pro workflow.

@adthrasher adthrasher self-assigned this Mar 22, 2024
@adthrasher adthrasher mentioned this pull request Apr 29, 2024
tools/bowtie2.wdl Outdated Show resolved Hide resolved
Copy link
Member

@a-frantz a-frantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another partial review. Still chugging through this 😅

tools/hilow.wdl Outdated Show resolved Hide resolved
tools/hilow.wdl Outdated Show resolved Hide resolved
tools/hilow.wdl Outdated Show resolved Hide resolved
tools/hilow.wdl Outdated Show resolved Hide resolved
tools/hilow.wdl Show resolved Hide resolved
tools/hilow.wdl Outdated Show resolved Hide resolved
@adthrasher adthrasher requested a review from a-frantz January 15, 2025 20:35
Copy link
Member

@a-frantz a-frantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a doozy. I skimmed some parts. Need to do another closer review of some parts, but this is enough feedback for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this version dir should be 2.31.1-0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is named after bedtools, why does it have hic scripts in it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, it was just plain bedtools, but with the decision to remove the embedded scripts, those needed to get built in to some image and this one depends on bedtools. I can go somewhere else, but then we'll need to install bedtools in to another container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.5.4-0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe consolidate some of these new Docker images? It looks like we can maybe merge bedtools, hilow and juicertools?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were trying to avoid these types of monolithic images? We can do that, but is that the direction we want to go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. There's a balance to be found, but I'm not sure what it is yet... Responding to #139 (comment) in this thread to consolidate conversation.

Maybe we can merge hilow and juicertools (is my understanding correct that they are usually used together?), building that image with bedtools and moving the hic script which depends on bedtools into there? That's a bit "monolithic", but keeps the bedtools image "clean" and hopefully isn't too sprawling.

Thoughts on that?

args = get_args()

f=open(args.filter_pairs)
blackID=defaultdict(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was just a copy+paste, but can we use proper casing in this file? And blackID should probably be updated to exclude_list or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably wise. These are just copy+paste.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we incorporate some Python tooling? A linter of some sort, formatter, etc? Don't really want to ask you to rewrite all these py scripts which appear to be mostly copy+paste, but also they aren't really conformant to any Python standards...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran both of these scripts through black. Do you think we should to something additional? Now that we're maintaining a scripts directory, we should probably do this at a repo-level. I didn't do any formatting of the methylation scripts in that PR. That also contains R code, which I have no experience formatting/linting (I suspect R is always ugly).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running through black is a good start, and these do look better now.

My major concern is that with the maintenance of a scripts directory, these files are quite visible now and as such they should be up to our code quality standards. However our code quality standards aren't exactly defined anywhere, so I suppose that gives us some latitude?

There's a million and one Python lint frameworks out there, and I don't have strong feelings about which to use, but I do think we should be using something.

My local VScode set-up uses the pylance extension from Microsoft (https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) and according to those docs, under the hood pylance is using pyright (https://github.com/microsoft/pyright).

Maybe we can add a black check and a pyright check to our CI? That doesn't sound too labor intensive on the surface.

}
}

task fastq_to_sam {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
task fastq_to_sam {
task fastq_to_ubam {

???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I do think your suggestion makes sense, the underlying Picard tool is FastqToSam and I don't want to break that connection.

}

Float fastq_size = size(read_one_fastq_gz, "GiB") + size(read_two_fastq_gz, "GiB")
Int disk_size_gb = ceil(fastq_size * 2) + 10 + modify_disk_size_gb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this calculation solid? I'm not really sure what to expect for the size of a uBAM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect them to be around the same size as the FASTQ files. Since they're name sorted (since they're unaligned), you're not going to gain compression from shared sequence (most likely). I'm not sure if we should have additional buffer.

@@ -1327,7 +1327,90 @@ task faidx {
cpu: 1
memory: "4 GB"
disks: "~{disk_size_gb} GB"
container: "quay.io/biocontainers/samtools:1.17--h00cdaf9_0"
container: "quay.io/biocontainers/samtools:1.19.2--h50ea8bc_0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be updated for the rest of the file? I thought you added a lint or a CI or something to catch these cases...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was just one stray task that didn't get updated. Nothing else uses 1.17--h00cdaf9_0 at the moment.

external_help: "https://www.htslib.org/doc/samtools-sort.html",
}
prefix: "Prefix for the output file. The extension `.bam` will be added."
uncompressed: "Output uncompressed BAM?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think until streaming comes to WDL, we should probably avoid allowing uncompressed output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some engines (e.g. dxCompiler) support streaming at the moment.

call bowtie2.build { input:
reference = reference_download.downloaded_file,
prefix = basename(reference_fa_name, ".fa.gz"),
ncpu = 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be hardcoded here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants