Add bigwig value ascii histogram and extreme value counts to optional…

… bigwig outlier report (#6179) * adding new tool bigwig_outlier_bed * remove redundant script * replace test.bw with test-data/1.bigwig to decrease test outputs below 1MB maximum for linter. * added bigtools to bio.tools for this new entry added edam * Add a proper version command by importing pybedtools * fix doi for pybigtools * Incorporate Bjoern's ideas. Single select to configure 3 possible outputs. Upper quantile now required. Chide if no low quantile and no output because low not in the choice. Help vastly expanded. * Added the bigwig metadata name as the label for bed feature name so no need for the user to supply one. * Clearing out old test data. Passes tests here but diffs in CI. Something odd. * readding fixed test outputs again again. * Ah. was overwriting a bed with the two bigwig test. * remove print leftovers * Separated python script to enable access for linting in CI * fix flake8 complaints with black on the new separate python script * fix imports * Do not make the output contig statistics table if there's no table output needed. * make the table calculations optional and mostly as a separate function since they may not be needed. * Clean up some comments. makeTableRow renamed * Update tools/bigwig_outlier_bed/.shed.yml Co-authored-by: Björn Grüning <bjoern@gruenings.eu> * Update tools/bigwig_outlier_bed/.shed.yml Co-authored-by: Björn Grüning <bjoern@gruenings.eu> * Clean up all field prompt text and consolidate the help text into chunks. * remove test for both qhi/qlo because qhi is not optional * fix broken qlo parameter passing and tests - add test for no qlo. * Flake fix * update to 0.2.0 pybigtools test outputs are different :( * add ascii histogram and table of descriptive stats to tableout * fixes to tableout * clean up table output * update readme * add test and option to only produce table - no bed so faster. extend tool version string * typo * update tests to add bigwig label to histogram rows * fix flake * fix test outputs * remove non-standard version token to avoid update bot failure --------- Co-authored-by: Björn Grüning <bjoern@gruenings.eu>
galaxyproject · Jul 25, 2024 · 3cce4c7 · 3cce4c7
1 parent 4f8728c
commit 3cce4c7
Show file tree

Hide file tree

Showing 14 changed files with 5,160 additions and 4,656 deletions.
diff --git a/tools/bigwig_outlier_bed/README.md b/tools/bigwig_outlier_bed/README.md
@@ -2,9 +2,9 @@
 
 ### July 30 2024 for the VGP
 
-This code will soon become a Galaxy tool, for building some of the [NIH MARBL T2T assembly polishing](https://github.com/marbl/training) tools as Galaxy workflows.
+This is a Galaxy tool, for building some of the [NIH MARBL T2T assembly polishing](https://github.com/marbl/training) tools as Galaxy workflows.
 
-JBrowse2 2.12.3 update will include a plugin for optional colours to distinguish bed features, shown being tested in the screenshots below.
+JBrowse2 now includes a plugin for optional colours to distinguish bed features, shown being tested in the screenshots below.
 
 ### Find and mark BigWig peaks to a bed file for display
 
@@ -29,3 +29,69 @@ It is just not feasible to hold all contigs in the entire decoded bigwig in RAM
 better to sample across all chromosomes so as not to lose any systematic differences between them - the current method will hide those
 differences unfortunately. Sampling might be possible. Looking at the actual quantile values across a couple of test bigwigs suggests that
 there is not much variation between chromosomes but there's now a tabular report to check them for each input bigwig.
+
+### Table reports
+
+The optional table output report gives a crude histogram and the top/bottom 10 values to help 
+understand what is likely to be informative. In this example, there are 26700 zero values so
+using a lower cutoff quantile is likely to have a lot of them, although a large window requirement
+will decease the overload...
+
+Descriptive measures
+bigwig  test
+contig  chr10_PATERNAL
+n       135711693
+mean    12.178164
+std     7.997467
+min     0.000000
+max     365.000000
+qtop    364.00
+qbot    noqlo
+First/Last 10 value counts
+Value   Count
+0.00    26700
+1.00    82900
+2.00    261400
+3.00    676993
+4.00    1665500
+5.00    3125700
+6.00    5078000
+7.00    7469000
+8.00    10191700
+9.00    12544600
+355.00  100
+356.00  100
+357.00  300
+358.00  100
+360.00  500
+361.00  300
+362.00  200
+363.00  600
+364.00  900
+365.00  700
+Histogram of bigwig values
+chr10_PATERNAL        18.25 | 127,047,593 | **************************************************************************
+chr10_PATERNAL        36.50 |   7,510,000 | ****
+chr10_PATERNAL        54.75 |     818,900 |
+chr10_PATERNAL        73.00 |     117,200 |
+chr10_PATERNAL        91.25 |      51,900 |
+chr10_PATERNAL       109.50 |      44,200 |
+chr10_PATERNAL       127.75 |      21,600 |
+chr10_PATERNAL       146.00 |      17,900 |
+chr10_PATERNAL       164.25 |      16,400 |
+chr10_PATERNAL       182.50 |      18,600 |
+chr10_PATERNAL       200.75 |       5,400 |
+chr10_PATERNAL       219.00 |       6,600 |
+chr10_PATERNAL       237.25 |       6,200 |
+chr10_PATERNAL       255.50 |       3,900 |
+chr10_PATERNAL       273.75 |       4,500 |
+chr10_PATERNAL       292.00 |       7,100 |
+chr10_PATERNAL       310.25 |       3,000 |
+chr10_PATERNAL       328.50 |       2,700 |
+chr10_PATERNAL       346.75 |       3,500 |
+chr10_PATERNAL       365.00 |       4,500 |
+chr10_PATERNAL ------------ |------------ |
+chr10_PATERNAL           N= | 135,711,693 |
+chr10_PATERNAL ------------ |------------ |
+
+
diff --git a/tools/bigwig_outlier_bed/bigwig_outlier_bed.py b/tools/bigwig_outlier_bed/bigwig_outlier_bed.py
@@ -13,7 +13,6 @@
 """
 
 import argparse
-import copy
 import os
 import sys
 from pathlib import Path
@@ -22,6 +21,90 @@
 import pybigtools
 
 
+class asciihist:
+
+    def __init__(
+        self,
+        data,
+        bins=10,
+        minmax=None,
+        str_tag="",
+        scale_output=80,
+        generate_only=True,
+    ):
+        """
+        https://gist.github.com/bgbg/608d9ef4fd75032731651257fe67fc81
+        Create an ASCII histogram from an interable of numbers.
+        Author: Boris Gorelik boris@gorelik.net. based on  http://econpy.googlecode.com/svn/trunk/pytrix/pytrix.py
+        License: MIT
+        """
+        self.data = data
+        self.minmax = minmax
+        self.str_tag = str_tag
+        self.bins = bins
+        self.generate_only = generate_only
+        self.scale_output = scale_output
+        self.itarray = np.asanyarray(self.data)
+        if self.minmax == "auto":
+            self.minmax = np.percentile(data, [5, 95])
+            if self.minmax[0] == self.minmax[1]:
+                # for very ugly distributions
+                self.minmax = None
+        if self.minmax is not None:
+            # discard values that are outside minmax range
+            mn = self.minmax[0]
+            mx = self.minmax[1]
+            self.itarray = self.itarray[self.itarray >= mn]
+            self.itarray = self.itarray[self.itarray <= mx]
+
+    def draw(self):
+        values, counts = np.unique(self.data, return_counts=True)
+        if len(values) <= 20:
+            self.bins = len(values)
+        ret = []
+        if self.itarray.size:
+            total = len(self.itarray)
+            counts, cutoffs = np.histogram(self.itarray, bins=self.bins)
+            cutoffs = cutoffs[1:]
+            if self.str_tag:
+                self.str_tag = "%s " % self.str_tag
+            else:
+                self.str_tag = ""
+            if self.scale_output is not None:
+                scaled_counts = counts.astype(float) / counts.sum() * self.scale_output
+            else:
+                scaled_counts = counts
+            footerbar = "{:s}{:s} |{:s} |".format(
+                self.str_tag,
+                "-" * 12,
+                "-" * 12,
+            )
+            if self.minmax is not None:
+                ret.append(
+                    "Trimmed to range (%s - %s)"
+                    % (str(self.minmax[0]), str(self.minmax[1]))
+                )
+            for cutoff, original_count, scaled_count in zip(
+                cutoffs, counts, scaled_counts
+            ):
+                ret.append(
+                    "{:s}{:>12.2f} |{:>12,d} | {:s}".format(
+                        self.str_tag, cutoff, original_count, "*" * int(scaled_count)
+                    )
+                )
+            ret.append(footerbar)
+            ret.append("{:s}{:>12s} |{:>12,d} |".format(self.str_tag, "N=", total))
+            ret.append(footerbar)
+            ret.append("")
+        else:
+            ret = []
+        if not self.generate_only:
+            for line in ret:
+                print(line)
+        ret = "\n".join(ret)
+        return ret
+
+
 class findOut:
 
     def __init__(self, args):
@@ -34,14 +117,16 @@ def __init__(self, args):
         self.bedouthilo = args.bedouthilo
         self.tableoutfile = args.tableoutfile
         self.bedwin = args.minwin
-        self.qhi = args.qhi
         self.qlo = None
-        try:
-            f = float(args.qlo)
-            self.qlo = f
-        except Exception as e:
-            s = str(e)
-            print(s, ' qlo=', args.qlo)
+        self.qhi = None
+        if args.outbeds != "outtab":
+            self.qhi = args.qhi
+            if args.qlo:
+                try:
+                    f = float(args.qlo)
+                    self.qlo = f
+                except Exception:
+                    print("qlo not provided")
         nbw = len(args.bigwig)
         nlab = len(args.bigwiglabels)
         if nlab < nbw:
@@ -106,10 +191,11 @@ def makeTableRow(self, bw, bwlabel, chr):
     def makeBed(self):
         bedhi = []
         bedlo = []
+        restab = []
         bwlabels = self.bwlabels
         bwnames = self.bwnames
-        if self.tableoutfile:
-            restab = ["bigwig\tcontig\tn\tmean\tstd\tmin\tmax\tqtop\tqbot"]
+        bwnames.sort()
+        reshead = "bigwig\tcontig\tn\tmean\tstd\tmin\tmax\tqtop\tqbot"
         for i, bwname in enumerate(bwnames):
             bwlabel = bwlabels[i].replace(" ", "")
             fakepath = "in%d.bw" % i
@@ -120,16 +206,38 @@ def makeBed(self):
             bwf = pybigtools.open(fakepath)
             chrlist = bwf.chroms()
             chrs = list(chrlist.keys())
-            chrs.sort()
             for chr in chrs:
+                first_few = None
                 bw = bwf.values(chr)
+                values, counts = np.unique(bw, return_counts=True)
+                nvalues = len(values)
+                if nvalues <= 20:
+                    histo = "\n".join(
+                        [
+                            "%s: %f occurs %d times" % (chr, values[x], counts[x])
+                            for x in range(len(values))
+                        ]
+                    )
+                else:
+                    last10 = range(nvalues - 10, nvalues)
+                    first_few = ["%.2f\t%d" % (values[x], counts[x]) for x in range(10)]
+                    first_few += ["%.2f\t%d" % (values[x], counts[x]) for x in last10]
+                    first_few.insert(0, "First/Last 10 value counts\nValue\tCount")
+                    ha = asciihist(data=bw, bins=20, str_tag="%s_%s" % (bwlabel, chr))
+                    histo = ha.draw()
+                    histo = (
+                        "\n".join(first_few)
+                        + "\nHistogram of %s bigwig values\n" % bwlabel
+                        + histo
+                    )
                 bw = bw[~np.isnan(bw)]  # some have NaN if parts of a contig not covered
                 if self.qhi is not None:
                     self.bwtop = np.quantile(bw, self.qhi)
                     bwhi = self.processVals(bw, isTop=True)
                     for j, seg in enumerate(bwhi):
-                        if seg[1] - seg[0] >= self.bedwin:
-                            score = np.sum(bw[seg[0]:seg[1]])
+                        seglen = seg[1] - seg[0]
+                        if seglen >= self.bedwin:
+                            score = np.sum(bw[seg[0]:seg[1]]) / float(seglen)
                             bedhi.append(
                                 (
                                     chr,
@@ -144,7 +252,7 @@ def makeBed(self):
                     bwlo = self.processVals(bw, isTop=False)
                     for j, seg in enumerate(bwlo):
                         if seg[1] - seg[0] >= self.bedwin:
-                            score = -1 * np.sum(bw[seg[0]:seg[1]])
+                            score = -1 * np.sum(bw[seg[0]:seg[1]]) / float(seglen)
                             bedlo.append(
                                 (
                                     chr,
@@ -156,7 +264,15 @@ def makeBed(self):
                             )
                 if self.tableoutfile:
                     row = self.makeTableRow(bw, bwlabel, chr)
-                    restab.append(copy.copy(row))
+                    resheadl = reshead.split("\t")
+                    rowl = row.split()
+                    desc = ["%s\t%s" % (resheadl[x], rowl[x]) for x in range(len(rowl))]
+                    desc.insert(0, "Descriptive measures")
+                    descn = "\n".join(desc)
+                    restab.append(descn)
+                    restab.append(histo)
+        if os.path.isfile(fakepath):
+            os.remove(fakepath)
         if self.tableoutfile:
             stable = "\n".join(restab)
             with open(self.tableoutfile, "w") as t:
@@ -175,7 +291,7 @@ def makeBed(self):
             allbed = bedlo + bedhi
             self.writeBed(allbed, self.bedouthilo)
             some = True
-        if not some:
+        if not ((self.outbeds == "outtab") or some):
             sys.stderr.write(
                 "Invalid configuration - no output could be created. Was qlo missing and only low output requested for example?"
             )

diff --git a/tools/bigwig_outlier_bed/bigwig_outlier_bed.xml b/tools/bigwig_outlier_bed/bigwig_outlier_bed.xml
@@ -1,9 +1,9 @@
-<tool name="Bigwig extremes to bed features" id="bigwig_outlier_bed" version="@TOOL_VERSION@" profile="22.05">
+<tool name="Bigwig extremes to bed features" id="bigwig_outlier_bed" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="22.05">
   <description>Writes high and low bigwig runs as features in a bed file</description>
   <macros>
   <token name="@TOOL_VERSION@">0.2.0</token>
-  <token name="@NUMPY_VERSION@">2.0.0</token>
   <token name="@PYTHON_VERSION@">3.12.3</token>
+  <token name="@VERSION_SUFFIX@">0</token>
   </macros>
   <edam_topics>
       <edam_topic>topic_0157</edam_topic>
@@ -17,7 +17,7 @@
   </xrefs>
   <requirements>
     <requirement type="package" version="@PYTHON_VERSION@">python</requirement>
-    <requirement type="package" version="@NUMPY_VERSION@">numpy</requirement>
+    <requirement type="package" version="2.0.0">numpy</requirement>
     <requirement type="package" version="@TOOL_VERSION@">pybigtools</requirement>
   </requirements>
   <required_files>
@@ -50,41 +50,44 @@
 #if $qlo:
 --qlo '$qlo'
 #end if
-#if $tableout == "create":
+#if $tableout == "create" or $outbeds == "outtab":
  --tableoutfile '$tableoutfile'
 #end if
 ]]></command>
   <inputs>
     <param name="bigwig" type="data" optional="false" label="Choose one or more bigwig file(s) to return outlier regions as a bed file" 
       help="If more than one, MUST all use the same reference sequence to be displayable. Feature names will include the bigwig label." format="bigwig" multiple="true"/>
     <param name="minwin" type="integer" value="10" label="Minimum continuous bases to count as a high or low bed feature" 
-      help="Continuous features as long or longer than this window size will appear as bed features"/>
-    <param name="qhi" type="float" value="0.99" label="Quantile cutoff for a high region - 0.99 will cut off at or above the 99th percentile" help="Required" optional="false"/>
-    <param name="qlo" type="float" value="0.01" label="Quantile cutoff for a low region - 0.01 will cut off at or below the 1st percentile." help="Optional" optional="true"/>
-    <param name="outbeds" type="select" label="Select the required bed file outputs" help="Any combination of the 3 different kinds of bed file output can be made">
+      help="Minimum continuous length to count as a bed feature. If windowed bigwig, must be bigger than window size to have any effect"/>
+    <param name="qhi" type="float" value="0.99999" label="Quantile cutoff for a high region - 0.99999 will cut off at about 1 in 100,000" 
+       help="1 per 100k might be a few thousand features in a 200M chromosome - depends on the distribution - see the table output" optional="false"/>
+    <param name="qlo" type="float" value="0.00001" label="Quantile cutoff for a low region - 0.01 will cut off at or below the 1st percentile." help="Optional" optional="true"/>
+    <param name="outbeds" type="select" label="Select the required bed file outputs or none for a bigwig value distribution report" 
+      help="Any combination of the 3 different kinds of bed file output can be made">
       <option value="outhilo" selected="true">Make 1 bed output with both low and high regions</option>
       <option value="outhi">Make 1 bed output with high regions only</option>
       <option value="outlo">Make 1 bed output with low regions only</option>
       <option value="outall">Make 3 bed outputs with low and high together in one, high in one and low in the other</option>
       <option value="outlohi">Make 2 bed outputs with high in one and low in the other</option>
+      <option value="outtab">NO bed outputs. Report bigwig value distribution only</option>
     </param>
     <param name="tableout" type="select" label="Write a table showing contig statistics for each bigwig input" help="">
       <option value="donotmake">Do not create this report</option>
       <option value="create" selected="true">Create this report</option>
     </param>
   </inputs>
   <outputs>
-    <data name="bedouthilo" format="bed" label="High_and_low_bed" hidden="false">
+    <data name="bedouthilo" format="bed" label="High_and_low_bed">
       <filter>outbeds in ["outall", "outhilo"]</filter>
     </data>
-    <data name="bedouthi" format="bed" label="High bed" hidden="false">
+    <data name="bedouthi" format="bed" label="High bed">
       <filter>outbeds in ["outall", "outlohi", "outhi"]</filter>
     </data>
-    <data name="bedoutlo" format="bed" label="Low bed" hidden="false">
+    <data name="bedoutlo" format="bed" label="Low bed">
       <filter>outbeds in ["outall", "outlohi", "outlo"]</filter>
     </data>
-    <data name="tableoutfile" format="tabular" label="Contig statistics" hidden="false">
-      <filter>tableout == "create"</filter>
+    <data name="tableoutfile" format="txt" label="Contig statistics">
+      <filter>tableout == "create" or outbeds == "outtab"</filter>
     </data>
   </outputs>
   <tests>
@@ -97,6 +100,15 @@
       <param name="qlo" value="0.01"/>
       <param name="tableout" value="donotmake"/>
     </test>
+    <test expect_num_outputs="1">
+      <output name="tableoutfile" value="table_only_sample" compare="diff" lines_diff="0"/>
+      <param name="outbeds" value="outtab"/>
+      <param name="bigwig" value="bigwig_sample,1.bigwig"/>
+      <param name="minwin" value="10"/>
+      <param name="qhi" value="0.99"/>
+      <param name="qlo" value="0.01"/>
+      <param name="tableout" value="create"/>
+    </test>
     <test expect_num_outputs="2">
       <output name="bedouthilo" value="bedouthilo_sample" compare="diff" lines_diff="0"/>
       <output name="tableoutfile" value="table_sample" compare="diff" lines_diff="0"/>