Skip to content

Commit

Permalink
Merge pull request #209 from qiyunzhu/vector
Browse files Browse the repository at this point in the history
Formalized the NumPy + Numba solution for ordinal mapper
  • Loading branch information
qiyunzhu authored Oct 4, 2024
2 parents ddea40a + 0edbaff commit da32abf
Show file tree
Hide file tree
Showing 14 changed files with 720 additions and 654 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ jobs:
run: pycodestyle .

- name: Run unit tests
run: coverage run -m unittest && coverage lcov
run: |
export NUMBA_DISABLE_JIT=1
coverage run -m unittest && coverage lcov
- name: Coveralls
uses: coverallsapp/github-action@v2
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# Change Log

## Version 0.1.6-dev
## Version 0.1.7-dev

### Added
- Formally adopted the NumPy + Numba solution in the ordinal mapper. This significantly accelerated the algorithm ([#209](https://github.com/qiyunzhu/woltka/pull/209)).

### Changed
- Changed default output subject coverage (`--outcov`) coordinates into BED-like (0-based, exclusive end). The output can be directly parsed by programs like bedtools. Also added support for GFF-like and other custom formats, as controled by paramter `--outcov-fmt` ([#204](https://github.com/qiyunzhu/woltka/pull/204) and [#205](https://github.com/qiyunzhu/woltka/pull/205)).
- Default chunk size is now 1024 for plain and range mapeprs, and 2 ** 20 = 1048576 for ordinal mapper. The latter represents the number of valid query-subject pairs ([#209](https://github.com/qiyunzhu/woltka/pull/209)).


## Version 0.1.6 (2/22/2024)
Expand Down
3 changes: 2 additions & 1 deletion ci/conda_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
cython
numpy
numba
biom-format
4 changes: 2 additions & 2 deletions doc/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,8 @@ Option | Description

Option | Description
--- | ---
`--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,000 for plain or range mapping, or 1,000,000 for ordinal mapping.
`--cache` | Number of recent classification results to cache for faster subsequent classifications. Default: 1024.
`--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,024 for plain or range mapping, or 2 ** 20 = 1,048,576 for ordinal mapping. The latter cannot exceed 2 ** 22.
`--cache` | Number of recent classification results to cache for faster subsequent classifications. Default: 1,024.
`--no-exe` | Disable calling external programs (`gzip`, `bzip2` and `xz`) for decompression. Otherwise, Woltka will use them if available for faster processing, or switch back to Python if not.


Expand Down
2 changes: 1 addition & 1 deletion doc/perform.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Simple read-gene matching, with Numba [acceleration](install.md#acceleration) |

Two Woltka parameters visibly impacts Woltka's speed:

- `--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,000 for plain or range mapping, or 1,000,000 for ordinal mapping.
- `--chunk` | Number of unique queries to read and parse in each chunk of an alignment file. Default: 1,024 for plain or range mapping, or 2 ** 20 = 1,048,576 for ordinal mapping. The latter cannot exceed 2 ** 22.
- `--cache` | Number of recent classification results to cache for faster subsequent classifications. Default: 1024.

Their default values were set based on our experience. However, alternative values could improve (or reduce) performance depending on the computer hardware, input file type, and database capacity. if you plan to routinely process bulks of biological big data using the same setting, we recommend that you do a few test runs on a small dataset and find out the values that work the best for you.
Expand Down
56 changes: 1 addition & 55 deletions woltka/align.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------


"""Functions for parsing alignment / mapping files.
Notes
Expand Down Expand Up @@ -45,7 +44,7 @@
from functools import lru_cache


def plain_mapper(fh, fmt=None, excl=None, n=1000):
def plain_mapper(fh, fmt=None, excl=None, n=1024):
"""Read an alignment file in chunks and yield query-to-subject(s) maps.
Parameters
Expand Down Expand Up @@ -619,59 +618,6 @@ def cigar_to_lens_ord(cigar):
return align, align + offset


def parse_sam_file_pd(fh, n=65536):
"""Parse a SAM file (sam) using Pandas.
Parameters
----------
fh : file handle
SAM file to parse.
n : int, optional
Chunk size.
Yields
------
None
Notes
-----
This is a SAM file parser using Pandas. It is slower than the current
parser. The `read_csv` is fast, but the data frame manipulation slows
down the process. It is here for reference only.
"""
return
# with pd.read_csv(fp, sep='\t',
# header=None,
# comment='@',
# na_values='*',
# usecols=[0, 1, 2, 3, 5],
# names=['qname', 'flag', 'rname', 'pos', 'cigar'],
# dtype={'qname': str,
# 'flag': np.uint16,
# 'rname': str,
# 'pos': int,
# 'cigar': str},
# chunksize=n) as reader:
# for chunk in reader:
# chunk.dropna(subset=['rname'], inplace=True)
# # this is slow, because of function all
# chunk['length'], offset = zip(*chunk['cigar'].apply(
# cigar_to_lens))
# chunk['right'] = chunk['pos'] + offset
# # this is slow, because of function all
# # chunk['qname'] = chunk[['qname', 'flag']].apply(
# # qname_by_flag, axis=1)
# # this is a faster method
# chunk['qname'] += np.where(
# chunk['qname'].str[-2:].isin(('/1', '/2')), '',
# np.where(np.bitwise_and(chunk['flag'], (1 << 6)), '/1',
# np.where(np.bitwise_and(chunk['flag'], (1 << 7)),
# '/2', '')))
# chunk['score'] = 0
# yield from chunk[['qname', 'rname', 'score', 'length',
# 'pos', 'right']].values


def parse_map_file(fh, *args):
"""Parse a simple mapping file.
Expand Down
4 changes: 2 additions & 2 deletions woltka/biom.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,8 +164,8 @@ def round_biom(table: biom.Table, digits=0):
digits : int, optional
Digits after the decimal point.
Notes
-----
Examples
--------
There is a fully vectorized, much faster alternate:
>>> arr = table.matrix_data.data
Expand Down
Loading

0 comments on commit da32abf

Please sign in to comment.