Skip to content

Commit

Permalink
Add bigwig value ascii histogram and extreme value counts to optional…
Browse files Browse the repository at this point in the history
… bigwig outlier report (#6179)

* adding new tool bigwig_outlier_bed

* remove redundant script

* replace test.bw with test-data/1.bigwig to decrease test outputs below 1MB maximum for linter.

* added bigtools to bio.tools for this new entry
added edam

* Add a proper version command by importing pybedtools

* fix doi for pybigtools

* Incorporate Bjoern's ideas.

Single select to configure 3 possible outputs.

Upper quantile now required. Chide if no low quantile and no output because low not in the choice.

Help vastly expanded.

* Added the bigwig metadata name as the label for bed feature name so no need for the user to supply one.

* Clearing out old test data. Passes tests here but diffs in CI. Something odd.

* readding fixed test outputs again again.

* Ah. was overwriting a bed with the two bigwig test.

* remove print leftovers

* Separated python script to enable access for linting in CI

* fix flake8 complaints with black on the new separate python script

* fix imports

* Do not make the output contig statistics table if there's no table output needed.

* make the table calculations optional and mostly as a separate function since they may not be needed.

* Clean up some comments.
makeTableRow renamed

* Update tools/bigwig_outlier_bed/.shed.yml

Co-authored-by: Björn Grüning <bjoern@gruenings.eu>

* Update tools/bigwig_outlier_bed/.shed.yml

Co-authored-by: Björn Grüning <bjoern@gruenings.eu>

* Clean up all field prompt text and consolidate the help text into chunks.

* remove test for both qhi/qlo because qhi is not optional

* fix broken qlo parameter passing and tests - add test for no qlo.

* Flake fix

* update to 0.2.0 pybigtools
test outputs are different :(

* add ascii histogram and table of descriptive stats to tableout

* fixes to tableout

* clean up table output

* update readme

* add test and option to only produce table - no bed so faster.
extend tool version string

* typo

* update tests to add bigwig label to histogram rows

* fix flake

* fix test outputs

* remove non-standard version token to avoid update bot failure

---------

Co-authored-by: Björn Grüning <bjoern@gruenings.eu>
  • Loading branch information
fubar2 and bgruening authored Jul 25, 2024
1 parent 4f8728c commit 3cce4c7
Show file tree
Hide file tree
Showing 14 changed files with 5,160 additions and 4,656 deletions.
70 changes: 68 additions & 2 deletions tools/bigwig_outlier_bed/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

### July 30 2024 for the VGP

This code will soon become a Galaxy tool, for building some of the [NIH MARBL T2T assembly polishing](https://github.com/marbl/training) tools as Galaxy workflows.
This is a Galaxy tool, for building some of the [NIH MARBL T2T assembly polishing](https://github.com/marbl/training) tools as Galaxy workflows.

JBrowse2 2.12.3 update will include a plugin for optional colours to distinguish bed features, shown being tested in the screenshots below.
JBrowse2 now includes a plugin for optional colours to distinguish bed features, shown being tested in the screenshots below.

### Find and mark BigWig peaks to a bed file for display

Expand All @@ -29,3 +29,69 @@ It is just not feasible to hold all contigs in the entire decoded bigwig in RAM
better to sample across all chromosomes so as not to lose any systematic differences between them - the current method will hide those
differences unfortunately. Sampling might be possible. Looking at the actual quantile values across a couple of test bigwigs suggests that
there is not much variation between chromosomes but there's now a tabular report to check them for each input bigwig.

### Table reports

The optional table output report gives a crude histogram and the top/bottom 10 values to help
understand what is likely to be informative. In this example, there are 26700 zero values so
using a lower cutoff quantile is likely to have a lot of them, although a large window requirement
will decease the overload...

Descriptive measures
bigwig test
contig chr10_PATERNAL
n 135711693
mean 12.178164
std 7.997467
min 0.000000
max 365.000000
qtop 364.00
qbot noqlo
First/Last 10 value counts
Value Count
0.00 26700
1.00 82900
2.00 261400
3.00 676993
4.00 1665500
5.00 3125700
6.00 5078000
7.00 7469000
8.00 10191700
9.00 12544600
355.00 100
356.00 100
357.00 300
358.00 100
360.00 500
361.00 300
362.00 200
363.00 600
364.00 900
365.00 700
Histogram of bigwig values
chr10_PATERNAL 18.25 | 127,047,593 | **************************************************************************
chr10_PATERNAL 36.50 | 7,510,000 | ****
chr10_PATERNAL 54.75 | 818,900 |
chr10_PATERNAL 73.00 | 117,200 |
chr10_PATERNAL 91.25 | 51,900 |
chr10_PATERNAL 109.50 | 44,200 |
chr10_PATERNAL 127.75 | 21,600 |
chr10_PATERNAL 146.00 | 17,900 |
chr10_PATERNAL 164.25 | 16,400 |
chr10_PATERNAL 182.50 | 18,600 |
chr10_PATERNAL 200.75 | 5,400 |
chr10_PATERNAL 219.00 | 6,600 |
chr10_PATERNAL 237.25 | 6,200 |
chr10_PATERNAL 255.50 | 3,900 |
chr10_PATERNAL 273.75 | 4,500 |
chr10_PATERNAL 292.00 | 7,100 |
chr10_PATERNAL 310.25 | 3,000 |
chr10_PATERNAL 328.50 | 2,700 |
chr10_PATERNAL 346.75 | 3,500 |
chr10_PATERNAL 365.00 | 4,500 |
chr10_PATERNAL ------------ |------------ |
chr10_PATERNAL N= | 135,711,693 |
chr10_PATERNAL ------------ |------------ |


148 changes: 132 additions & 16 deletions tools/bigwig_outlier_bed/bigwig_outlier_bed.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
"""

import argparse
import copy
import os
import sys
from pathlib import Path
Expand All @@ -22,6 +21,90 @@
import pybigtools


class asciihist:

def __init__(
self,
data,
bins=10,
minmax=None,
str_tag="",
scale_output=80,
generate_only=True,
):
"""
https://gist.github.com/bgbg/608d9ef4fd75032731651257fe67fc81
Create an ASCII histogram from an interable of numbers.
Author: Boris Gorelik boris@gorelik.net. based on http://econpy.googlecode.com/svn/trunk/pytrix/pytrix.py
License: MIT
"""
self.data = data
self.minmax = minmax
self.str_tag = str_tag
self.bins = bins
self.generate_only = generate_only
self.scale_output = scale_output
self.itarray = np.asanyarray(self.data)
if self.minmax == "auto":
self.minmax = np.percentile(data, [5, 95])
if self.minmax[0] == self.minmax[1]:
# for very ugly distributions
self.minmax = None
if self.minmax is not None:
# discard values that are outside minmax range
mn = self.minmax[0]
mx = self.minmax[1]
self.itarray = self.itarray[self.itarray >= mn]
self.itarray = self.itarray[self.itarray <= mx]

def draw(self):
values, counts = np.unique(self.data, return_counts=True)
if len(values) <= 20:
self.bins = len(values)
ret = []
if self.itarray.size:
total = len(self.itarray)
counts, cutoffs = np.histogram(self.itarray, bins=self.bins)
cutoffs = cutoffs[1:]
if self.str_tag:
self.str_tag = "%s " % self.str_tag
else:
self.str_tag = ""
if self.scale_output is not None:
scaled_counts = counts.astype(float) / counts.sum() * self.scale_output
else:
scaled_counts = counts
footerbar = "{:s}{:s} |{:s} |".format(
self.str_tag,
"-" * 12,
"-" * 12,
)
if self.minmax is not None:
ret.append(
"Trimmed to range (%s - %s)"
% (str(self.minmax[0]), str(self.minmax[1]))
)
for cutoff, original_count, scaled_count in zip(
cutoffs, counts, scaled_counts
):
ret.append(
"{:s}{:>12.2f} |{:>12,d} | {:s}".format(
self.str_tag, cutoff, original_count, "*" * int(scaled_count)
)
)
ret.append(footerbar)
ret.append("{:s}{:>12s} |{:>12,d} |".format(self.str_tag, "N=", total))
ret.append(footerbar)
ret.append("")
else:
ret = []
if not self.generate_only:
for line in ret:
print(line)
ret = "\n".join(ret)
return ret


class findOut:

def __init__(self, args):
Expand All @@ -34,14 +117,16 @@ def __init__(self, args):
self.bedouthilo = args.bedouthilo
self.tableoutfile = args.tableoutfile
self.bedwin = args.minwin
self.qhi = args.qhi
self.qlo = None
try:
f = float(args.qlo)
self.qlo = f
except Exception as e:
s = str(e)
print(s, ' qlo=', args.qlo)
self.qhi = None
if args.outbeds != "outtab":
self.qhi = args.qhi
if args.qlo:
try:
f = float(args.qlo)
self.qlo = f
except Exception:
print("qlo not provided")
nbw = len(args.bigwig)
nlab = len(args.bigwiglabels)
if nlab < nbw:
Expand Down Expand Up @@ -106,10 +191,11 @@ def makeTableRow(self, bw, bwlabel, chr):
def makeBed(self):
bedhi = []
bedlo = []
restab = []
bwlabels = self.bwlabels
bwnames = self.bwnames
if self.tableoutfile:
restab = ["bigwig\tcontig\tn\tmean\tstd\tmin\tmax\tqtop\tqbot"]
bwnames.sort()
reshead = "bigwig\tcontig\tn\tmean\tstd\tmin\tmax\tqtop\tqbot"
for i, bwname in enumerate(bwnames):
bwlabel = bwlabels[i].replace(" ", "")
fakepath = "in%d.bw" % i
Expand All @@ -120,16 +206,38 @@ def makeBed(self):
bwf = pybigtools.open(fakepath)
chrlist = bwf.chroms()
chrs = list(chrlist.keys())
chrs.sort()
for chr in chrs:
first_few = None
bw = bwf.values(chr)
values, counts = np.unique(bw, return_counts=True)
nvalues = len(values)
if nvalues <= 20:
histo = "\n".join(
[
"%s: %f occurs %d times" % (chr, values[x], counts[x])
for x in range(len(values))
]
)
else:
last10 = range(nvalues - 10, nvalues)
first_few = ["%.2f\t%d" % (values[x], counts[x]) for x in range(10)]
first_few += ["%.2f\t%d" % (values[x], counts[x]) for x in last10]
first_few.insert(0, "First/Last 10 value counts\nValue\tCount")
ha = asciihist(data=bw, bins=20, str_tag="%s_%s" % (bwlabel, chr))
histo = ha.draw()
histo = (
"\n".join(first_few)
+ "\nHistogram of %s bigwig values\n" % bwlabel
+ histo
)
bw = bw[~np.isnan(bw)] # some have NaN if parts of a contig not covered
if self.qhi is not None:
self.bwtop = np.quantile(bw, self.qhi)
bwhi = self.processVals(bw, isTop=True)
for j, seg in enumerate(bwhi):
if seg[1] - seg[0] >= self.bedwin:
score = np.sum(bw[seg[0]:seg[1]])
seglen = seg[1] - seg[0]
if seglen >= self.bedwin:
score = np.sum(bw[seg[0]:seg[1]]) / float(seglen)
bedhi.append(
(
chr,
Expand All @@ -144,7 +252,7 @@ def makeBed(self):
bwlo = self.processVals(bw, isTop=False)
for j, seg in enumerate(bwlo):
if seg[1] - seg[0] >= self.bedwin:
score = -1 * np.sum(bw[seg[0]:seg[1]])
score = -1 * np.sum(bw[seg[0]:seg[1]]) / float(seglen)
bedlo.append(
(
chr,
Expand All @@ -156,7 +264,15 @@ def makeBed(self):
)
if self.tableoutfile:
row = self.makeTableRow(bw, bwlabel, chr)
restab.append(copy.copy(row))
resheadl = reshead.split("\t")
rowl = row.split()
desc = ["%s\t%s" % (resheadl[x], rowl[x]) for x in range(len(rowl))]
desc.insert(0, "Descriptive measures")
descn = "\n".join(desc)
restab.append(descn)
restab.append(histo)
if os.path.isfile(fakepath):
os.remove(fakepath)
if self.tableoutfile:
stable = "\n".join(restab)
with open(self.tableoutfile, "w") as t:
Expand All @@ -175,7 +291,7 @@ def makeBed(self):
allbed = bedlo + bedhi
self.writeBed(allbed, self.bedouthilo)
some = True
if not some:
if not ((self.outbeds == "outtab") or some):
sys.stderr.write(
"Invalid configuration - no output could be created. Was qlo missing and only low output requested for example?"
)
Expand Down
38 changes: 25 additions & 13 deletions tools/bigwig_outlier_bed/bigwig_outlier_bed.xml
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
<tool name="Bigwig extremes to bed features" id="bigwig_outlier_bed" version="@TOOL_VERSION@" profile="22.05">
<tool name="Bigwig extremes to bed features" id="bigwig_outlier_bed" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="22.05">
<description>Writes high and low bigwig runs as features in a bed file</description>
<macros>
<token name="@TOOL_VERSION@">0.2.0</token>
<token name="@NUMPY_VERSION@">2.0.0</token>
<token name="@PYTHON_VERSION@">3.12.3</token>
<token name="@VERSION_SUFFIX@">0</token>
</macros>
<edam_topics>
<edam_topic>topic_0157</edam_topic>
Expand All @@ -17,7 +17,7 @@
</xrefs>
<requirements>
<requirement type="package" version="@PYTHON_VERSION@">python</requirement>
<requirement type="package" version="@NUMPY_VERSION@">numpy</requirement>
<requirement type="package" version="2.0.0">numpy</requirement>
<requirement type="package" version="@TOOL_VERSION@">pybigtools</requirement>
</requirements>
<required_files>
Expand Down Expand Up @@ -50,41 +50,44 @@
#if $qlo:
--qlo '$qlo'
#end if
#if $tableout == "create":
#if $tableout == "create" or $outbeds == "outtab":
--tableoutfile '$tableoutfile'
#end if
]]></command>
<inputs>
<param name="bigwig" type="data" optional="false" label="Choose one or more bigwig file(s) to return outlier regions as a bed file"
help="If more than one, MUST all use the same reference sequence to be displayable. Feature names will include the bigwig label." format="bigwig" multiple="true"/>
<param name="minwin" type="integer" value="10" label="Minimum continuous bases to count as a high or low bed feature"
help="Continuous features as long or longer than this window size will appear as bed features"/>
<param name="qhi" type="float" value="0.99" label="Quantile cutoff for a high region - 0.99 will cut off at or above the 99th percentile" help="Required" optional="false"/>
<param name="qlo" type="float" value="0.01" label="Quantile cutoff for a low region - 0.01 will cut off at or below the 1st percentile." help="Optional" optional="true"/>
<param name="outbeds" type="select" label="Select the required bed file outputs" help="Any combination of the 3 different kinds of bed file output can be made">
help="Minimum continuous length to count as a bed feature. If windowed bigwig, must be bigger than window size to have any effect"/>
<param name="qhi" type="float" value="0.99999" label="Quantile cutoff for a high region - 0.99999 will cut off at about 1 in 100,000"
help="1 per 100k might be a few thousand features in a 200M chromosome - depends on the distribution - see the table output" optional="false"/>
<param name="qlo" type="float" value="0.00001" label="Quantile cutoff for a low region - 0.01 will cut off at or below the 1st percentile." help="Optional" optional="true"/>
<param name="outbeds" type="select" label="Select the required bed file outputs or none for a bigwig value distribution report"
help="Any combination of the 3 different kinds of bed file output can be made">
<option value="outhilo" selected="true">Make 1 bed output with both low and high regions</option>
<option value="outhi">Make 1 bed output with high regions only</option>
<option value="outlo">Make 1 bed output with low regions only</option>
<option value="outall">Make 3 bed outputs with low and high together in one, high in one and low in the other</option>
<option value="outlohi">Make 2 bed outputs with high in one and low in the other</option>
<option value="outtab">NO bed outputs. Report bigwig value distribution only</option>
</param>
<param name="tableout" type="select" label="Write a table showing contig statistics for each bigwig input" help="">
<option value="donotmake">Do not create this report</option>
<option value="create" selected="true">Create this report</option>
</param>
</inputs>
<outputs>
<data name="bedouthilo" format="bed" label="High_and_low_bed" hidden="false">
<data name="bedouthilo" format="bed" label="High_and_low_bed">
<filter>outbeds in ["outall", "outhilo"]</filter>
</data>
<data name="bedouthi" format="bed" label="High bed" hidden="false">
<data name="bedouthi" format="bed" label="High bed">
<filter>outbeds in ["outall", "outlohi", "outhi"]</filter>
</data>
<data name="bedoutlo" format="bed" label="Low bed" hidden="false">
<data name="bedoutlo" format="bed" label="Low bed">
<filter>outbeds in ["outall", "outlohi", "outlo"]</filter>
</data>
<data name="tableoutfile" format="tabular" label="Contig statistics" hidden="false">
<filter>tableout == "create"</filter>
<data name="tableoutfile" format="txt" label="Contig statistics">
<filter>tableout == "create" or outbeds == "outtab"</filter>
</data>
</outputs>
<tests>
Expand All @@ -97,6 +100,15 @@
<param name="qlo" value="0.01"/>
<param name="tableout" value="donotmake"/>
</test>
<test expect_num_outputs="1">
<output name="tableoutfile" value="table_only_sample" compare="diff" lines_diff="0"/>
<param name="outbeds" value="outtab"/>
<param name="bigwig" value="bigwig_sample,1.bigwig"/>
<param name="minwin" value="10"/>
<param name="qhi" value="0.99"/>
<param name="qlo" value="0.01"/>
<param name="tableout" value="create"/>
</test>
<test expect_num_outputs="2">
<output name="bedouthilo" value="bedouthilo_sample" compare="diff" lines_diff="0"/>
<output name="tableoutfile" value="table_sample" compare="diff" lines_diff="0"/>
Expand Down
Loading

0 comments on commit 3cce4c7

Please sign in to comment.