This task is based on publicly available sequencing data from a study of Alzheimer’s Disease and Down Syndrome (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA975472) .The dataset includes multiple samples under different conditions (AD vs Control) and was originally sequenced using Illumina paired-end 2×150. The subsampled FASTQs are stored in sc_ad_boilerplate/data/ upon running the pipeline and are used as the inputs for the workflow.
• BioProject: PRJNA975472
• GEO SuperSeries: GSE233208
• Paper: Miyoshi et al., Nature Genetics (2024)
• Assay: single-nucleus RNA-seq (10x)
SRR | Group |
---|---|
SRR24710554 | Control |
SRR24710556 | AD |
SRR24710558 | Control |
SRR24710560 | AD |
sc-alzheimer-analysis/
├── README.md
├── LICENSE
└── sc_ad_boilerplate/
├── metadata.yaml
├── samples.tsv
└── workflow/
├── 01_build_ref.sh
├── 02_kb_count.sh
├── 03_scanpy_analysis.py
├── 04_question_answers.py
└── environment.yml
Create and activate the environment:
- Prefetch: download
.sra
locally, then convert to FASTQ - Stream: download and process directly without storing
.sra
locally
conda env create -f sc_ad_boilerplate/workflow/environment.yml
conda activate ad_snrna
# Download and convert to paired, subsampled FASTQ.gz (default: 10% subsample)
bash sc_ad_boilerplate/workflow/01_fetch_and_fastq.sh
# stream directly from SRA without saving .sra locally
STREAM=1 bash sc_ad_boilerplate/workflow/01_fetch_and_fastq.sh
# change subsample rate (e.g., 20%)
RATE=0.20 bash sc_ad_boilerplate/workflow/01_fetch_and_fastq.sh
The script:
- Converts .sra to paired FASTQ files
- Subsamples reads in a paired-safe manner (default: 10%)
- Stores them in:
sc_ad_boilerplate/data/fastq_sub/
# defaults: RATE=0.10, THREADS=4
bash sc_ad_boilerplate/workflow/01_fetch_and_fastq.sh
# if you only want to stream directly from SRA without storing .sra:
STREAM=1 bash sc_ad_boilerplate/workflow/01_fetch_and_fastq.sh
# change subsample rate (e.g., 20%)
RATE=0.20 bash sc_ad_boilerplate/workflow/01_fetch_and_fastq.sh
Purpose: Prepare kallisto|bustools reference for mouse (Mus musculus, GRCm38, Ensembl 98)
Tools: kb ;
Inputs: Ensembl FASTA + GTF (downloaded automatically) ;
Outputs: workflow/ref/index.idx, t2g.txt, FASTA files
bash sc_ad_boilerplate/workflow/02_build_ref.sh
Purpose: Quantify transcripts from paired subsampled FASTQs ;
Tools: kb (kallisto|bustools) ;
Inputs: subsampled FASTQs from data/fastq_sub/
;
Outputs: matrices under workflow/kb_out/<SRR>/counts_unfiltered/
workflow/kb_out//counts_unfiltered/
bash sc_ad_boilerplate/workflow/03_kb_count.sh
Purpose: Merge samples, perform QC, UMAP, clustering, and compare AD vs Control ;
Tools: Scanpy (v1.10.2) ;
Inputs: matrices from workflow/kb_out/*/counts_unfiltered/
;
Outputs: figures and tables in workflow/scanpy_out/
python sc_ad_boilerplate/workflow/04_scanpy_analysis.py
- sc_ad_boilerplate/workflow/scanpy_out/umap_overview.png – UMAP of all cells
- sc_ad_boilerplate/workflow/scanpy_out/cell_counts_by_sample_group.csv – cell counts per sample & group
- sc_ad_boilerplate/workflow/scanpy_out/markers_per_cluster_wilcoxon.csv – cluster markers
- sc_ad_boilerplate/workflow/scanpy_out/DE_AD_vs_Control_per_cluster.csv – DE results per cluster
- sc_ad_boilerplate/workflow/scanpy_out/alz_snrna_merged.h5ad – merged AnnData object
Read counts check:
zcat sc_ad_boilerplate/data/fastq_sub/<SRR>_1.sub.fastq.gz | wc -l
Notes: This pipeline was run on Google colab with high RAM CPU (51 GB)