Skip to content

Latest commit

 

History

History
87 lines (65 loc) · 13.1 KB

README.md

File metadata and controls

87 lines (65 loc) · 13.1 KB

EAC-multiome

This repository accompanies the work "Cell states and neighborhoods in distinct clinical stages of primary and metastatic esophageal adenocarcinoma" (ref) It contains all the code necessary to reproduce the analyses. Each subsection contains a README that contains a description of the path placeholders.

How to reproduce the analyses

To reproduce the analyses, one first needs to download the data as described in the further sections "Links to data used in the study" and "Links to other data to download"; instructions and descriptions of file are provided in these sections.

Additionally, one needs to have an environment with all used packages correctly installed. The environment used to run the analyses is provided as a yaml file as eac_env.yml. Specifically for the spatial analysis, a separate environment can be installed to run Cell2Location, under cell2loc_env.yml (the interplay between package versions can be tricky to get).

Then, one needs to run the scripts in order, as some intermediate files generated by the scripts will be re-used in subsequent scripts.

The order to run scripts is depicted in the following illustration:

alt text

Instructions to replace placeholders are given in the README of each folder. Information on where to download all the data needed to reproduce the analysis is found in the following sections.

If there are any questions about code or issues to reproduce the analysis, please contact josephine.yates@inf.ethz.ch.

Patient ID to sample ID mapping

In the original paper, for simplicity patients are referred to as P1 through P10. In the scripts/notebooks the patients are referred to using their sample ID. The mapping is provided below.

Patient ID Sample ID
P1 CCG1153_4496262
P2 CCG1153_6640539
P3 CCG1153_4411
P4 Aguirre_EGSFR0074
P5 Aguirre_EGSFR0148
P6 Aguirre_EGSFR1732
P7 Aguirre_EGSFR0128
P8 Aguirre_EGSFR1938
P9 Aguirre_EGSFR1982
P10 Aguirre_EGSFR2218

Links to data used in the study

Dataset Link to paper Link to download Remarks
Discovery dataset, sn 10X multiome Yates et al., ??? Download
Discovery dataset, ST 10X Visium Yates et al., ??? Download
Single-cell, Carroll et al. Carroll et al., 2023 Download Need to request access to data through EGA / contact author (thomas.carroll@alumni.rice.edu)
Bulk, Carroll et al., RNA Carroll et al., 2023 Download Need to request access to data through EGA / contact author (thomas.carroll@alumni.rice.edu)
Bulk, Carroll et al., Clinical Carroll et al., 2023 Download Inoperable cohort info is located here
Single-cell, Croft et al. Croft et al., 2022 Download Need to request single-cell annotations from author (w.d.croft@bham.ac.uk)
Bulk, Hoefnagel et al., RNA Hoefnagel et al., 2022 Download
Bulk, Hoefnagel et al., Clinical Hoefnagel et al., 2022 NA Need to request from the author (sanne_hoefnagel@live.nl)
Bulk, TCGA, RNA (FPKM) The Cancer Genome Atlas Research Network, 2017 Download Used in the general TCGA analysis script, file named "TCGA-ESCA.htseq_fpkm-uq.tsv.gz"
Bulk, TCGA, RNA (Raw counts) The Cancer Genome Atlas Research Network, 2017 Download Used as a basis to deconvolve for BayesPrism
Bulk, TCGA, Clinical #1 The Cancer Genome Atlas Research Network, 2017 Download This is the general clinical+phenotypical info, named "TCGA.ESCA.sampleMap_ESCA_clinicalMatrix"
Bulk, TCGA, Clinical #2 The Cancer Genome Atlas Research Network, 2017 Download This is the clinical info provided in the original paper, need to save as "ESCA_Nature_clinicalinfo.csv"
Bulk, TCGA, Clinical #3 The Cancer Genome Atlas Research Network, 2017 Download This is the survival information, file named "Survival_SupplementalTable_S1_20171025_xena_sp"
Bulk, TCGA, Clinical #4 The Cancer Genome Atlas Research Network, 2017 Download This is the HRD information, file named "TCGA.HRD_withSampleID.txt"
Bulk, TCGA, ABSOLUTE purity The Cancer Genome Atlas Research Network, 2017 Download This is the ABSOLUE-estimate purity used for assessment of BayesPrism deconvolution, file named "TCGA_absolute_purity.txt" in the script
Single-cell, Luo et al. Luo et al., 2022 Download Need to download counts and metadata at the same time from this link

Links to other data to download

Data Needed for what script? Description Link to download
Gene Mapping R/scripts/BayesPrism/runBPrism.R Gene probe map fro the UCSC Xena browser that maps ENCODE to official gene ID Download
GENCODE annotations python/notebooks/preprocessing-snRNA/XXXX.ipynb (where XXX is any sample name) Subset of gencode annotations v41 Download or link to original GTF file
Gene Programs from Gavish et al. python/notebooks/analysis/5. cNMFCancerCells-perPatient.ipynb Signature genes derived in the Gavish et al. paper Download or link to original Excel file, the .csv corresponds to the first sheet only
MSigDB Hallmarks of cancer GMT python/notebooks/analysis/5. cNMFCancerCells-perPatient.ipynb This is the file to run GSEA on the hallmarks of cancer Download
List of human transcription factors python/notebooks/analysis/9. SCENICplus-analyze-cNMF.ipynb This file contains all known human transcription factors as defined in the Lambert et al. paper Download
Cell cycle genes python/notebooks/validation/3. Carroll-validation-set.ipynb This file contains cell cycle genes used by Scanpy Download
Marker genes of Barrett's esophagus cell types python/notebooks/validation/4. compare-Nowicki-BE.ipynb Set of fi!les, each containing marker genes of the Barrett's esophagus non-immune or stromal cell types Download or Original paper tables; marker genes are derived from Suppl Table 7
Blacklisted regions of hg38 python/scripts/scenicplus/1. run-pre-scenicplus-script.py List of blacklisted regions to remove for analysis Download
Screen v10 region-based databases, SCENIC+, #1 python/scripts/scenicplus/1. run-pre-scenicplus-script.py Ranking database of motifs Download
Screen v10 region-based databases, SCENIC+, #2 python/scripts/scenicplus/1. run-pre-scenicplus-script.py Scores database of motifs Download
Motif v10 annotation, SCENIC+ python/scripts/scenicplus/1. run-pre-scenicplus-script.py Motif annotation Download
Screen v10 hg38 database, SCENIC, #1 python/scripts/pyscenic/README.md Ranking of motifs, big search space Download
Screen v10 hg38 database, SCENIC, #2 python/scripts/pyscenic/README.md Ranking of motifs, small search space Download
Annotation for local pycisTarget run python/scripts/scenicplus/1. run-pre-scenicplus-script.py This is required if the HPC used to run pycisTarget does not have access to the internet Download
Annotation for local SCENIC+ search space run python/scripts/scenicplus/2. run-scenicplus-script.py This is required if the HPC used to run SCENIC+ does not have access to the internet Download, this contains two files, 'annot_ensembl.csv' and 'chromsizes_ensembl.csv'. More info on why we need to do this can be found in this issue
List of Lambert et al. TF names python/scripts/scenicplus/2. run-scenicplus-script.py List of all human TFs used for the search space Download can be done using !wget -O utoronto_human_tfs_v_1.01.txt http://humantfs.ccbr.utoronto.ca/download/v_1.01/TF_names_v_1.01.txt, as recommended in the SCENIC+ tutorial. FYI, this is the same list as above, simply formatted for the SCENIC+ run
Omnipath database of intercellular communication python/notebooks/spatial-transcriptomics/2. SpatialData_analysis.ipynb List of ligand receptor interactions aggregated with omnipath, used to run LIGREC, a Squipy implementation of CellPhoneDB The user should run omnipath.interactions.import_intercell_network with default parameters and save the resulting csv. NOTE: we use the presaved .csv because the HPC used doesn't have access to the internet; otherwise this is equivalent to running Squidpy's LIGREC with default parameters - the same database is automatially downloaded