Skip to content

Commit

Permalink
Merge pull request #16 from complextissue/dev
Browse files Browse the repository at this point in the history
Performance improvements, .csv transcript gene maps and better piscem/oarfish support
  • Loading branch information
maltekuehl authored Sep 30, 2024
2 parents d386a3a + 92cb5ce commit a483354
Show file tree
Hide file tree
Showing 55 changed files with 280,299 additions and 1,070 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
strategy:
fail-fast: true
matrix:
python-version: [3.11.4]
python-version: [3.9.16]

steps:
- uses: actions/checkout@v3
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ ipython_config.py
# Environments
.env
.venv
.python-version
env/
venv/
ENV/
Expand All @@ -43,6 +44,9 @@ ENV/
.dmypy.json
dmypy.json

# Ruff
.ruff_cache/

# Pyre type checker
.pyre/

Expand All @@ -66,6 +70,7 @@ requirements.dev.txt
manuscript.pdf
/data/rpgn_example/
/rpgn/
/benchmark/

# All .DS_Store files
**/.DS_Store
Expand Down
1 change: 0 additions & 1 deletion .python-version

This file was deleted.

2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@ authors:
- family-names: "Puelles"
given-names: "Victor"
title: "pytximport: Gene count estimation from transcript quantification files in Python"
version: 0.8.0
version: 0.9.0
date-released: 2024-07-11
url: "https://github.com/complextissue/pytximport"
2 changes: 1 addition & 1 deletion INSTALL.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Dependencies

To fulfill all dependencies for this project, **all** of the following steps are required.
`pytximport` only targets support for `python` versions greater than or equal `3.8`.
`pytximport` only targets support for `python` versions greater than or equal `3.9`.

## Installation for `pytximport`

Expand Down
105 changes: 81 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# pytximport

<hr />

[![Version](https://img.shields.io/pypi/v/pytximport)](https://pypi.org/project/pytximport/)
[![License](https://img.shields.io/pypi/l/pytximport)](https://github.com/complextissue/pytximport)
![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/complextissue/pytximport/ci.yml)
Expand All @@ -11,24 +13,48 @@
[![Code Style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)

`pytximport` is a Python package for efficient gene count estimation based on transcript quantification files produced by pseudoalignment/quasi-mapping tools such as `kallisto` or `salmon`. `pytximport` is a port of the popular [tximport Bioconductor R package](https://bioconductor.org/packages/release/bioc/html/tximport.html).
`pytximport` is a Python package for efficient (gene-)count estimation from transcript quantification files produced by pseudoalignment/quasi-mapping tools such as `salmon`, `kallisto`, `rsem` and others. `pytximport` is a port of the popular [tximport Bioconductor R package](https://bioconductor.org/packages/release/bioc/html/tximport.html).

## Installation

The recommended way to install `pytximport` is through Bioconda:

```bash
mamba install -c bioconda pytximport
```

`pytximport` can also be installed via pip:

```bash
pip install pytximport
```

While not required, we recommend users also install `pyarrow` for faster import of tab-separated value-based quantification files:

```bash
mamba install -c conda-forge pyarrow-core
```

or:

```bash
pip install pyarrow
```

## Quick Start

You can either import the `tximport` function in your Python files:

```python
from pytximport import tximport
from pytximport.utils import create_transcript_gene_map

transcript_gene_map = create_transcript_gene_map(species="human")

results = tximport(
file_paths,
"salmon",
transcript_gene_mapping,
data_type="salmon",
transcript_gene_map=transcript_gene_map,
)
```

Expand All @@ -40,17 +66,16 @@ pytximport -i ./sample_1.sf -i ./sample_2.sf -t salmon -m ./tx2gene_map.tsv -o .

Common options are:

- `-i`: The input files.
- `-t`: The input type, e.g., `salmon`, `kallisto` or `tsv`.
- `-m`: The map to match transcript ids to their gene ids. Expected column names are `transcript_id` and `gene_id`.
- `-o`: The output path.
- `-c`: The count transform to apply. Leave out for none, other options include `scaled_tpm`, `length_scaled_tpm` and `dtu_scaled_tpm`.
- `-gl`: Whether the input is already gene-level counts. Provide this flag when importing gene counts from RSEM.
- `-tx`: Whether to return transcript-level counts without gene summarization.
- `-id`: The column name containing the transcript ids, in case it differs from the typical naming standards for the configured input file type.
- `-counts`: The column name containing the transcript counts, in case it differs from the typical naming standards for the configured input file type.
- `-length`: The column name containing the transcript lenghts, in case it differs from the typical naming standards for the configured input file type.
- `-tpm`: The column name containing the transcript abundance, in case it differs from the typical naming standards for the configured input file type.
- `-i`: The path to an quantification file. To provide multiple input files, use `-i input1.sf -i input2.sf ...`.
- `-t`: The type of quantification file, e.g. `salmon`, `kallisto` and others.
- `-m`: The path to the transcript to gene map. Either a tab-separated (.tsv) or comma-separated (.csv) file. Expected column names are `transcript_id` and `gene_id`.
- `-o`: The output path to save the resulting counts to.
- `-of`: The format of the output file. Either `csv` or `h5ad`.
- `-ow`: Provide this flag to overwrite an existing file at the output path.
- `-c`: The method to calculate the counts from the abundance. Leave empty to use counts. For differential gene expression analysis, we recommend using `length_scaled_tpm`. For differential transcript expression analysis, we recommend using `scaled_tpm`. For differential isoform usage analysis, we recommend using `dtu_scaled_tpm`.
- `-ir`: Provide this flag to make use of inferential replicates. Will use the median of the inferential replicates.
- `-gl`: Provide this flag when importing gene-level counts from RSEM files.
- `-tx`: Provide this flag to return transcript-level instead of gene-summarized data. Incompatible with gene-level input and `counts_from_abundance=length_scaled_tpm`.
- `--help`: Display all configuration options.

## Documentation
Expand All @@ -59,7 +84,7 @@ Detailled documentation is made available at: [https://pytximport.readthedocs.io

## Development status

`pytximport` is still in development and has not yet reached version 1.0.0 in the [SemVer](https://semver.org/) versioning scheme. While it should work for most use cases and we regularly compare outputs against the R implementation, expect breaking changes. If you encounter any problems, please open a GitHub issue. If you are a Python developer, we welcome pull requests implementing missing features, adding more extensive unit tests and bug fixes.
`pytximport` is still in development and has not yet reached version 1.0.0 in the [SemVer](https://semver.org/) versioning scheme. While it should work for almost all use cases and we regularly compare outputs against the R implementation, breaking changes between minor versions may occur. If you encounter any problems, please open a GitHub issue. If you are a Python developer, we welcome pull requests implementing missing features, adding more extensive unit tests and bug fixes.

## Motivation

Expand All @@ -70,8 +95,8 @@ The `tximport` package has become a main stay in the bulk RNA sequencing communi

Please cite both the original publication as well as this Python implementation:

- Kuehl, M., & Puelles, V. (2024). pytximport: Gene count estimation from transcript quantification files in Python (Version 0.9.0) [Computer software]. https://github.com/complextissue/pytximport
- Charlotte Soneson, Michael I. Love, Mark D. Robinson. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences, F1000Research, 4:1521, December 2015. doi: 10.12688/f1000research.7563.1
- Kuehl, M., & Puelles, V. (2024). pytximport: Gene count estimation from transcript quantification files in Python (Version 0.8.0) [Computer software]. https://github.com/complextissue/pytximport

## License

Expand All @@ -81,15 +106,47 @@ The software is provided under the GNU General Public License version 3. Please

Generally, outputs from `pytximport` correspond to the outputs from `tximport` within the accuracy allowed by multiple floating point operations and small implementation differences in its dependencies when using the same configuration. If you observe larger discrepancies, please open an issue.

While the outputs are roughly identical for the same configuration, there remain some differences between the packages:
While the outputs are identical within floating point tolerance for the same configuration, there remain some differences between the packages:

- `pytximport` can be used from the command line.
- `pytximport` supports `AnnData` format outputs (set `output_type` to `anndata`), enabling seamless integration with the `scverse`.
- Argument order and argument defaults may differ between the implementations.
- Additional features:
- When `ignore_transcript_version` is set, the transcript version will not only be scrapped from the quantization file but also from the provided transcript to gene mapping.
- When `biotype_filter` is set, all transcripts that do not contain any of the provided biotypes will be removed prior to all other steps.
- When `save_path` is configured, a count matrix will be saved as a .csv file.
Features unique to `pytximport`:

- Generating transcript-to-gene maps, either from a BioMart server or an `annotation.gtf` file. Use `create_transcript_gene_map` or `create_transcript_gene_map_from_annotation` from `pytximport.utils`.
- Command line interface. Type `pytximport --help` into your terminal to explore all options.
- `AnnData`-support, enabling seamless integration with the `scverse`.
- Saving outputs directly to file (use the `output_path` argument).
- Removing transcript versions from **both** the quantification files and the transcript-to-gene map when `ignore_transcript_version` is provided.
- Post-hoc biotype-filtering. Set `biotype_filter` to a whitelist of possible biotypes contained within the bar-separated values of your transcript ids.

Features unique to `tximport`

Argument order and argument defaults may differ between the implementations.

## Contributing

Contributions are welcome. Contributors are asked to follow the Contributor Covenant Code of Conduct.

To set up `pytximport` for development on your machine, we recommend to git clone the dev branch:

```bash
git clone --depth 1 -b dev https://github.com/complextissue/pytximport.git
cd pytximport
pyenv local 3.9
make create-venv
source .venv/source/activate
make install-dev
```

Since `pytximport` is linted and formatted, the repository contains a list of recommended VS Code extensions in `.vscode/extensions.json`. If you are using a different editor, please make sure to set up your environment to use the same linters and formatters.

For new features and non-obvious bug fixes, we kindly ask that you create a GitHub issue before submitting a PR.

## Running the tests locally

Please follow the steps described in the "Contributing" section. Once you have setup your development environment, you can run the unit tests locally:

```bash
make coverage-report
```

## Building the documentation locally

Expand Down
19 changes: 8 additions & 11 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
author = "Malte Kuehl"

# The full version, including alpha/beta/rc tags
release = "0.8.0"
release = "0.9.0"


# -- General configuration ---------------------------------------------------
Expand Down Expand Up @@ -91,24 +91,21 @@
napoleon_use_param = True

intersphinx_mapping = dict( # noqa: C408
cycler=("https://matplotlib.org/cycler/", None),
matplotlib=("https://matplotlib.org/", None),
numpy=("https://docs.scipy.org/doc/numpy/", None),
numpy=("https://numpy.org/doc/stable/", None),
pandas=("https://pandas.pydata.org/pandas-docs/stable/", None),
pytest=("https://docs.pytest.org/en/latest/", None),
xarray=("https://docs.xarray.dev/en/stable/", None),
python=("https://docs.python.org/3", None),
scipy=("https://docs.scipy.org/doc/scipy/reference/", None),
seaborn=("https://seaborn.pydata.org/", None),
sklearn=("https://scikit-learn.org/stable/", None),
)

# -- Options for HTML output -------------------------------------------------

html_theme = "furo"

html_theme_options = {
"announcement": "<a href='https://pypi.org/project/pytximport/'><strong>pytximport</strong></a> has been released!",
}
# html_theme_options = {
# "announcement": (
# "<a href='https://pypi.org/project/pytximport/'><strong>pytximport</strong></a> has been released!",
# )
# }

html_title = "pytximport"

Expand Down
Loading

0 comments on commit a483354

Please sign in to comment.