Skip to content

Added differential analysis function #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 7, 2025
Merged

Conversation

nbpeterson3
Copy link
Contributor

#2 - Differential analysis function that takes ECLIPSE output (GRanges object) and BAM files and utilizes csaw and edgeR to perform sliding window-based read counting of classified super enhancer regions, normalization, filtering based on global background, statistical testing, and merging of windows into consolidated regions. It returns a GRanges object with the merged intervals and combined test statistics.

Copy link
Owner

@j-andrews7 j-andrews7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am testing on a real dataset, so may have more comments soon.

#' Default is `150`.
#' @param window.spacing An integer specifying the distance (bp) between windows.
#' Default is `50`.
#' @param paired A boolean specifying whether reads are paired-end.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this must be specified.


# ------------ WINDOWS ------------
# concatenate super enhancer regions (super == T)
all.SEs <- c(se.1[se.1$super], se.2[se.2$super])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to allow this filtering to be turned off and the analysis run for all stitched regions if wanted. The default can be to limit to SEs as you have it here.

#' Default is `50`.
#' @param paired A boolean specifying whether reads are paired-end.
#' Default is `NULL`.
#' @param read.length An integer specifying the read length. If paired-end, this argument is ignored.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fragment.length instead of read.length.

window.size = 150, window.spacing = 50,
# optional args to readParam for restricting chromosomes or excluding intervals
# restrict = NULL, discard = NULL,
paired = NULL, read.length = NULL, quality = 20,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead and set read.length (soon to be fragment.length) to 200 bp by default, as that step is indeed pretty slow, especially if you have a lot of bams.

#' differential analysis results including log-fold changes, p-values, FDR, and other statistics.
#'
#' @importFrom GenomicRanges slidingWindows reduce mcols
#' @importFrom csaw readParam correlateReads maximizeCcf regionCounts windowCounts filterWindowsGlobal normOffsets asDGEList
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing mergeResults here.

@j-andrews7
Copy link
Owner

j-andrews7 commented Apr 1, 2025

Solid, this runs and the results look relatively sensible.

It'd be nice to have a way to prioritize which SEs to look at more closely based on how dramatic the differences are. If you wanted to write a summary function that spits out counts of differential regions on a per-SE basis, that'd be helpful.

Something that returns a GRanges with SE region, ID, num_windows_diff, num_windows_up, num_windows_down.

And then the last 3 columns but expressed as ratios of total windows for region.

Could also just do all of that in this function and return two ranges objects as a named list - one with the windows, one with the consensus SEs with this info appended. That might be cleaner actually. Thoughts?

@j-andrews7
Copy link
Owner

One other note is there is no way to control the comparison direction, e.g. group1 vs group2 or vice versa in terms of the results returned.

It might be cleaner to be more explicit with how the BAMs and groups are defined. bam.files could be split to group1.bams and group2.bams and then conditions replaced with group1.name and group2.name so that the group1 vs group2 comparison is explicit and the directionality of the output is clear. Right now, you have to look at the underlying data to determine what "up" and "down" mean in the results.

@j-andrews7
Copy link
Owner

Also, let's stick with underscore delimiters for function names rather than camelCase. Maybe call this one find_differential.

@j-andrews7
Copy link
Owner

After a bit more use, I think the bg.fc should probably be less aggressive by default, as I am seeing loss of "real" signal regions. I think being somewhat less aggressive with this setting makes sense since we're already feeding in enriched regions.

I will test a few other settings and see what value might be more generally appropriate. This will likely vary from dataset to dataset to some degree.

As an example, with default settings I see instances like this:
image

Where the bottom are the regions being retained for testing and those clearly "real" signal regions on the right are being filtered.

@j-andrews7
Copy link
Owner

I also think we don't limit the max width of the combined regions by default.

@j-andrews7
Copy link
Owner

I am gonna merge this, as I need the functionality to test with other new changes I'm working on.

Some of the easier stuff here (like default params), I will probably just swap, but I'll open issues for other stuff that could use their own PR.

@j-andrews7 j-andrews7 merged commit 2cf1bcd into j-andrews7:dev Apr 7, 2025
@nbpeterson3
Copy link
Contributor Author

After a bit more use, I think the bg.fc should probably be less aggressive by default, as I am seeing loss of "real" signal regions. I think being somewhat less aggressive with this setting makes sense since we're already feeding in enriched regions.

I will test a few other settings and see what value might be more generally appropriate. This will likely vary from dataset to dataset to some degree.

As an example, with default settings I see instances like this: image

Where the bottom are the regions being retained for testing and those clearly "real" signal regions on the right are being filtered.

From your testing, have you found a more reasonable threshold?

@j-andrews7
Copy link
Owner

j-andrews7 commented Apr 10, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants