Merge pull request #209 from qiyunzhu/vector

Formalized the NumPy + Numba solution for ordinal mapper
qiyunzhu · Oct 4, 2024 · da32abf · da32abf
2 parents ddea40a + 0edbaff
commit da32abf
Show file tree

Hide file tree

Showing 14 changed files with 720 additions and 654 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -43,7 +43,9 @@ jobs:
         run: pycodestyle .
 
       - name: Run unit tests
-        run: coverage run -m unittest && coverage lcov
+        run: |
+          export NUMBA_DISABLE_JIT=1
+          coverage run -m unittest && coverage lcov
 
       - name: Coveralls
         uses: coverallsapp/github-action@v2

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,9 +1,13 @@
 # Change Log
 
-## Version 0.1.6-dev
+## Version 0.1.7-dev
+
+### Added
+- Formally adopted the NumPy + Numba solution in the ordinal mapper. This significantly accelerated the algorithm ([#209](https://github.com/qiyunzhu/woltka/pull/209)).
 
 ### Changed
 - Changed default output subject coverage (`--outcov`) coordinates into BED-like (0-based, exclusive end). The output can be directly parsed by programs like bedtools. Also added support for GFF-like and other custom formats, as controled by paramter `--outcov-fmt` ([#204](https://github.com/qiyunzhu/woltka/pull/204) and [#205](https://github.com/qiyunzhu/woltka/pull/205)).
+- Default chunk size is now 1024 for plain and range mapeprs, and 2 ** 20 = 1048576 for ordinal mapper. The latter represents the number of valid query-subject pairs ([#209](https://github.com/qiyunzhu/woltka/pull/209)).
 
 
 ## Version 0.1.6 (2/22/2024)

diff --git a/ci/conda_requirements.txt b/ci/conda_requirements.txt
@@ -1,2 +1,3 @@
-cython
+numpy
+numba
 biom-format
diff --git a/doc/cli.md b/doc/cli.md
@@ -109,8 +109,8 @@ Option | Description
 
 Option | Description
 --- | ---
-`--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,000 for plain or range mapping, or 1,000,000 for ordinal mapping.
-`--cache` | Number of recent classification results to cache for faster subsequent classifications. Default: 1024.
+`--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,024 for plain or range mapping, or 2 ** 20 = 1,048,576 for ordinal mapping. The latter cannot exceed 2 ** 22.
+`--cache` | Number of recent classification results to cache for faster subsequent classifications. Default: 1,024.
 `--no-exe` | Disable calling external programs (`gzip`, `bzip2` and `xz`) for decompression. Otherwise, Woltka will use them if available for faster processing, or switch back to Python if not.
 
 

diff --git a/doc/perform.md b/doc/perform.md
@@ -62,7 +62,7 @@ Simple read-gene matching, with Numba [acceleration](install.md#acceleration) |
 
 Two Woltka parameters visibly impacts Woltka's speed:
 
-- `--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,000 for plain or range mapping, or 1,000,000 for ordinal mapping.
+- `--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,024 for plain or range mapping, or 2 ** 20 = 1,048,576 for ordinal mapping. The latter cannot exceed 2 ** 22.
 - `--cache` | Number of recent classification results to cache for faster subsequent classifications. Default: 1024.
 
 Their default values were set based on our experience. However, alternative values could improve (or reduce) performance depending on the computer hardware, input file type, and database capacity. if you plan to routinely process bulks of biological big data using the same setting, we recommend that you do a few test runs on a small dataset and find out the values that work the best for you.

diff --git a/woltka/align.py b/woltka/align.py
@@ -8,7 +8,6 @@
 # The full license is in the file LICENSE, distributed with this software.
 # ----------------------------------------------------------------------------
 
-
 """Functions for parsing alignment / mapping files.
 
 Notes
@@ -45,7 +44,7 @@
 from functools import lru_cache
 
 
-def plain_mapper(fh, fmt=None, excl=None, n=1000):
+def plain_mapper(fh, fmt=None, excl=None, n=1024):
     """Read an alignment file in chunks and yield query-to-subject(s) maps.
 
     Parameters
@@ -619,59 +618,6 @@ def cigar_to_lens_ord(cigar):
     return align, align + offset
 
 
-def parse_sam_file_pd(fh, n=65536):
-    """Parse a SAM file (sam) using Pandas.
-
-    Parameters
-    ----------
-    fh : file handle
-        SAM file to parse.
-    n : int, optional
-        Chunk size.
-
-    Yields
-    ------
-    None
-
-    Notes
-    -----
-    This is a SAM file parser using Pandas. It is slower than the current
-    parser. The `read_csv` is fast, but the data frame manipulation slows
-    down the process. It is here for reference only.
-    """
-    return
-    # with pd.read_csv(fp, sep='\t',
-    #                  header=None,
-    #                  comment='@',
-    #                  na_values='*',
-    #                  usecols=[0, 1, 2, 3, 5],
-    #                  names=['qname', 'flag', 'rname', 'pos', 'cigar'],
-    #                  dtype={'qname': str,
-    #                         'flag':  np.uint16,
-    #                         'rname': str,
-    #                         'pos':   int,
-    #                         'cigar': str},
-    #                  chunksize=n) as reader:
-    #     for chunk in reader:
-    #         chunk.dropna(subset=['rname'], inplace=True)
-    #         # this is slow, because of function all
-    #         chunk['length'], offset = zip(*chunk['cigar'].apply(
-    #             cigar_to_lens))
-    #         chunk['right'] = chunk['pos'] + offset
-    #         # this is slow, because of function all
-    #         # chunk['qname'] = chunk[['qname', 'flag']].apply(
-    #         #   qname_by_flag, axis=1)
-    #         # this is a faster method
-    #         chunk['qname'] += np.where(
-    #             chunk['qname'].str[-2:].isin(('/1', '/2')), '',
-    #             np.where(np.bitwise_and(chunk['flag'], (1 << 6)), '/1',
-    #                      np.where(np.bitwise_and(chunk['flag'], (1 << 7)),
-    #                      '/2', '')))
-    #         chunk['score'] = 0
-    #         yield from chunk[['qname', 'rname', 'score', 'length',
-    #                           'pos', 'right']].values
-
-
 def parse_map_file(fh, *args):
     """Parse a simple mapping file.
 

diff --git a/woltka/biom.py b/woltka/biom.py
@@ -164,8 +164,8 @@ def round_biom(table: biom.Table, digits=0):
     digits : int, optional
         Digits after the decimal point.
 
-    Notes
-    -----
+    Examples
+    --------
     There is a fully vectorized, much faster alternate:
 
     >>> arr = table.matrix_data.data