Cell-type-level PolyASite Atlas - Analysis and Pipelines

This repository contains the computational workflows and downstream analysis notebooks related to the update of PolyASite Atlas to next version (v4.0) dedicated to Cell-Type Level quantification of PASs.

The repository is optimized for running BOTH the workflows and analysis in jupyter notebook on HPC cluster.

On sciCORE HPC, running jupyter notebook on a computational node is nicely enabled by OnDemand service.

We utilize a hybrid approach: Snakemake for robust, scalable data processing on HPC clusters (sciCORE), and Jupyter Notebooks for interactive downstream analysis and visualization.

Current state

Currently, we are planning to utilize newly developped scQPAS tool to associate every UMI in every cell of a 10X 3' scRNA-seq sample to a most likely PAS taking advantage of specific cDNA fragment size distribution generated during library prep. We make UCSC trackhub to interactively visualize these data using UCSC Genome browser

In parallel, we plan to use Sanity to obtain normalized UMI counts per genes, and then use Bonsai to build a tree of cells for every tissue in the collected Human Cell Atlas dataset. Next we will use marker gene sets from CellMarker 2.0 to annotate the tree branches and further isolate cells into respective cell type groups.

Lastly, scQPAS output will be aggregated within obtained cell types to produce characteristic PAS usage quantification for every cell type.

Example - a healthy human liver 10X scRNA-seq sample from Human Cell Atlas project

We've used IGV genome browser to visualize .bam file with alignments and annotation tracks.

GAPDH gene Alignments were grouped by cell barcode tag (CB) and colored by UMI tag (UB).

Zoom-in into the PolyA site:

Repository Structure

.
├── CellTypePASatlas-current.ipynb       # Main master Jupyter notebook for downstream analysis
├── CellTypePASatlas.template.env        # Template for required environment variables/paths
├── install/
│   └── environment.yaml                # Conda environment specification for the Jupyter notebook
└── WF/                                 # Snakemake Workflow Engine
    ├── Snakefile-prepare               # Pipeline Step 1: scRNA-seq data processing (alignment, etc.)
    ├── Snakefile-quantification-faster # Pipeline Step 2: Quantification (gene counts, scQPAS quantification)
    ├── config.template.yaml            # Template configuration for Snakemake parameters
    ├── envs/                           # Conda environments isolated for specific Snakemake rules
    ├── profile/                        # SLURM execution profile for the HPC
    └── scripts/                        # Python and R scripts utilized by both Snakemake and Jupyter

Quick Start & Setup

To ensure strict reproducibility and security, this project uses .env files to manage all absolute paths (data directories, genome annotations, etc.). Do not hardcode paths into the Python or Snakemake files.

1. Clone the Repository

Clone this repository into your local user space ($HOME):

git clone https://github.com/zavolanlab/CellType_PolyASite_Atlas.git
cd CellType_PolyASite_Atlas

2. Configure Environment Paths

You must map the project to your local HPC paths. First, copy the template, rename it, and fill in your absolute paths, for example like that:

cp CellTypePASatlas.template.env CellTypePASatlas.scicore.env
# Open .env and edit the "Base Directories" section to match your system

Recommended if you are a group member on sciCORE: move the CellTypePASatlas.scicore.env to Project GROUP folder and symlink into your local repository directory:
```
ln -s <a file with specified sciCORE paths> CRISPR_projects.scicore.env
```

This way CellTypePASatlas.scicore.env will be automatically accessible by group members but will not be tracked by git. (Note: *.env files are ignored by git to protect private cluster paths, except the CellTypePASatlas.template.env file).

3. Install the Conda Environment

Create and activate the master environment required to run the Jupyter notebook and standard data science libraries (Pandas, UMAP, SciPy, BioPython, etc.):

conda env create -f install/environment.yaml
conda activate cell_type_pas_atlas

Executing the Workflows

The heavy lifting is divided into two separate Snakemake workflows located in the WF/ directory.

Configuration of the workflows (i.e. creation of input .tsv with sample specification and .yaml config is done inside the jupyter notebook)

Bash commands are also prepared inside the jupyter notebook. They should be further copied into command line and executed.

On an HPC cluster like sciCORE, workflows should be executed on a login node. Snakemake further automatically submits jobs to computational nodes.

Downstream Analysis

Once the Snakemake workflows are complete, all results are routed to the shared group directories defined in your .env file.

Use respective sections of CellTypePASatlas-current.ipynb to analyze the outputs.

The notebook automatically loads your .env paths using python-dotenv, allowing it to dynamically locate all workflow results, figures, and metadata regardless of where you cloned this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
install		install
.gitignore		.gitignore
CellTypePASatlas-current.ipynb		CellTypePASatlas-current.ipynb
CellTypePASatlas.template.env		CellTypePASatlas.template.env
LICENSE		LICENSE
README.md		README.md
upload_to_hf.no_token.py		upload_to_hf.no_token.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cell-type-level PolyASite Atlas - Analysis and Pipelines

Current state

Example - a healthy human liver 10X scRNA-seq sample from Human Cell Atlas project

Repository Structure

Quick Start & Setup

1. Clone the Repository

2. Configure Environment Paths

3. Install the Conda Environment

Executing the Workflows

Downstream Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cell-type-level PolyASite Atlas - Analysis and Pipelines

Current state

Example - a healthy human liver 10X scRNA-seq sample from Human Cell Atlas project

Repository Structure

Quick Start & Setup

1. Clone the Repository

2. Configure Environment Paths

3. Install the Conda Environment

Executing the Workflows

Downstream Analysis

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages