Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bam tag histogram mrfifo #103

Open
wants to merge 63 commits into
base: fast-cmdline
Choose a base branch
from
Open

Conversation

marvin-jens
Copy link
Member

Replacing dropseq-tools BamTagHistogram

This tool has become a pain-point as we now routinely have hundreds of millions of legitimate, spatial barcodes in open-st data.
This PR features a complete, drop-in replacement re-written in python using mrfifo. It is about 10x faster and uses less RAM. I've included unit test code and also run it on some real-world data, observing identical output compared to dropseq-tools.

Unless you run into issues, I'd like to merge this into fast-cmdline (and possibly master) asap.

Best,
-Marvin

@danilexn
Copy link
Member

Thanks so much for the amazing code, Marvin! I started testing on the tiny spatial data and Open-ST mouse hippocampus, and ran into some minor issues:

  • Several dependencies missing from the environment.yaml, I will push some commits fixing this
  • A file BamTagHistogram.log is created at the root of the spacemake project (because of the make_minimal_parser). Should this be created at the specific project/sample folders?
  • I've been running into issues with the rerun_triggers flag -- not all snakemake versions seem to support it (e.g., the newest we support in our environment.yaml).

As soon as the tests finish on the Open-ST data, I will report back and we can merge this into fast-cmdline and master. Also, I will run on very large Open-ST data (>10B reads) to explore the limits of the pipeline in terms of mem usage.

@danilexn
Copy link
Member

danilexn commented Apr 22, 2024

It worked with the tiny data, but then the out_readcounts_prealigned.txt.gz file and others are empty with larger (real) datasets. Will investigate...

Edit: the default --min-count argument in BamTagHistogram was just too low for the data I was using (a subset of a large sample, so no 0.6 micron spots had > 10 counts)

@danilexn
Copy link
Member

Tested, works as advertised :)

Maybe @nukappa can run another sample, but LGTM as soon as we address the minor points above (making sure installation works fine, and the .log file). Also, we should consider putting --min-count 1 in both calls to BamTagHistogram at main.smk, otherwise it might not work well with Open-ST data

@danilexn danilexn mentioned this pull request Apr 22, 2024
@danilexn
Copy link
Member

danilexn commented Apr 24, 2024

Thanks so much for the amazing code, Marvin! I started testing on the tiny spatial data and Open-ST mouse hippocampus, and ran into some minor issues:

* Several dependencies missing from the environment.yaml, I will push some commits fixing this

@marvin-jens will address in mrfifo

* A file `BamTagHistogram.log` is created at the root of the `spacemake` project (because of the `make_minimal_parser`). Should this be created at the specific project/sample folders?

This was fixed

* I've been running into issues with the `rerun_triggers` flag -- not all snakemake versions seem to support it (e.g., the newest we support in our environment.yaml).

This is not necessary yet

As soon as the tests finish on the Open-ST data, I will report back and we can merge this into fast-cmdline and master. Also, I will run on very large Open-ST data (>10B reads) to explore the limits of the pipeline in terms of mem usage.

Ready to test by @nukappa and merge into fast-cmdline

@danilexn danilexn added the enhancement New feature or request label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants