This repository contains R/Python source code for the data analysis in the context of a study in Nature focused on investigation of oncogenic aberrations in medulloblastoma Group 3/4 brain tumors: "Oncogene aberrations drive medulloblastoma progression, not initiation"
The folders are constructed based on the analysis of specific data types. Check the wiki for some tutorials on the certain anlaysis steps. If case of any questions, please create an issue in github repo or contact the authors directly.
Analysis of single nuclei gene expression data counts matrix computed by STARsolo and futher filtered from quality control inspections. More details about these procedures are provided in the corresponding study.
General processing (normalization, clustering, UMAP, output) per sample: processSnRNAseq.R
Merged samples processing (UMAPs, markers visualization): mergeSnRNAseq.R
Copy number profiling per sample: runInferCNV_perSample.R
Gene set varaince enrichment analysis per sample: runGsvaPerCell.R
R session details: sessionInfo.RNAseq.txt
Analysis of single nuclei ATAC chromatin data generated by CellRanger-Arc toolkit.
General processing (filtering, normalization, preparation for InferCNV) per sample: processAtacTumorSampleSV4.R
Merged samples processing (UMAPs): mergeAtac.R
Analysis of cis-regulatory elements (subclone-specific, gene assoications,etc): list of scripts
Analysis of mutations (SComatic script, filter somatic based on bulk WGS, plot results): list of scripts
R session details: sessionInfo.ATAC.txt
Analysis of single cell spatial RNA data obtained from the protocol Molecular Cartography of Resolve Bioscience.
General per sample processing (filtering, normalization, visualziation): resolveSeurat.R
Merged samples processing (UMAPs): resolveMerged.R
Transfer of the signals obtained from snRNA-seq data into spatial: transferAnchorsPerSample.R
Additional processing with Giotto R package (neigbourhood, etc): giotoAnalysis.R
R session details: sessionInfo.spatial.txt
DNA processing (prepare alignment/QC; server jobs; draw CNV results): list of scripts
RNA processing (prepare alignment; server jobs for alignment and compute counts; process counts, transfer from 10X snRNAseq): list of scripts
This folder contains scripts to reproduce the WGS analysis. To start with, download the data from Mendeley (doi: 10.17632/g4r22w9pp8.1), adjust the paths and directories in Settings.R and provide Extended Data Table 5 in the folder "Meta_data". Install the missing libraries as listed in Settings.R. In particular, install the R package NBevolution. Then source Settings.R. Take care that all input and output directories exist.
The script Cohort_characteristics.R plots the distribution of subtypes that went into the analysis as shown in Fig. 3a and Extended Fig. 3a.
SNV densities were quantified with the script Mutation_density_quantification.R. Run Adjust_purity.R first to collect all ploidies and purities along the cohort and adjust the purity in cases where it could not be estimated by ACEseq. The script Mutation_density_quantification.R generates the plots shown in Extended Data Fig. 6b-f, Fig. 3b and d, and the statistics shown in Fig. 3i,j. For comparison, the script MutationTimeR.R computes mutation densities using MutationTimeR (Gerstung et al., Nature, 2020); this script generates the output shown in Extended Data Fig. 6g and the input data for the shinyApp.
This was performed in 2 steps. First, we fit a population-genetics model of tumor initiation to the SNV densities at MRCA in G3/4 tumors. To re-run the analysis, first generate the summary data by sourcing the script Input_data_MB.R. In a second step, the model is fit to the data. Here, you need to install pyABC. The model fit is done by executing the script Expansion_decay_continuous_evol.py, which sources the script Expansion_decay_continuous_evol.R, which, in turn, sources the script ABC.R. Parameter estimation is performed according to the following pseudo-code:
FOR EACH optimization step:
- Sample a parameter set from the prior probabilities given in Table 1 below
- Sample for each tumor a time point of the MRCA,
$t_2$ , using the functionSample.timepoint()from NBevolution (the function accounts for an overall incidence of$10^{–5}$ ). - Sample for each tumor a time point of the ECA,
$t_1$ using the functionSample.timepoint.eca()from NBevolution. - Sample for each tumor a neutral mutation count at ECA and MRCA using the function
Sample.mutations()from NBevolution. Determine the mutation density by dividing with the haploid genome length of 3.3× 10^9, yielding$\tilde{m}_\mathrm{MRCA,sim}$ and $\tilde{m}_\mathrm{ECA,sim}$ . - Determine the simulated incidence of the ECA at the experimentally determined mutation loads,
$I_{\mathrm{ECA,sim},i} = \sum \tilde{m}_\mathrm{ECA,sim} < i; i \in \tilde{m}_\mathrm{ECA,exp}$ , and the simulated incidence of the MRCA at the experimentally determined mutation loads,$I_{\mathrm{MRCA,sim},i} = \sum \tilde{m}_\mathrm{MRCA,sim} < i; i \in \tilde{m}_\mathrm{MRCA,exp}$ (where "sim" and "exp" denote simulated and experimentally determined mutation densities). - Determine the simulated incidence of the MRCA at age 50 years,
$I_\mathrm{MRCA,sim,ten years}$ using the functionP.2nd.driver()from NBevolution. - Compute the cost function
$d=\sum_{i \in m_\mathrm{MRCA,exp}} w_i \frac{(I_{\mathrm{MRCA,sim},i} -I_{\mathrm{MRCA,exp},i})^2}{\Delta I_{\mathrm{MRCA,exp},i}^2}+\frac{(I_{\mathrm{ECA,sim},i} -I_{\mathrm{ECA,exp},i})^2}{\Delta I_{\mathrm{ECA,exp},i}^2 }+\frac{(I_\mathrm{MRCA,sim,ten years} -10^{-5})}{(10^{-4} )^2}$ , where the weights were chosen as
to enforce good fits for the initial expansion phase. The experimental incidences were computed as
and uncertainties were estimated to
where
DONE
Table 1. Prior probabilities to model medulloblastoma initiation
| Parameter | Description | Unit | Distribution | Min | Max |
|---|---|---|---|---|---|
| Relative loss rate during tissue expansion | 1 | uniform | 0 | 0.9 | |
| Number of cells in tissue at peak | 1 | log-uniform | |||
| Relative loss rate during tissue contraction | 1 | uniform | 1.0001 | 1.5 | |
| Reduction in cell loss due to first driver | 1 | uniform | 0 | 0.999 | |
| Survival probability attributed to 2nd driver | 1 | uniform | 0.01 | 0.49 | |
| Mutation rate 1st driver | 1 | log-uniform | |||
| Mutation rate 2nd driver | 1 | log-uniform | |||
| Neutral mutation rate | 1/division | uniform | 1 | 15 |
In the second step, we fit a model of tumor growth to the subclonal tail of 39 Group3/4 medulloblastomas with sufficient data quality. To reproduce this part, run Neutral_fit_pre_clonal_and_clonal.py in conjunction with Mutation_accumulation_fit_pre_clonal_and_clonal.R and Mutation_accumulation_exponential_growth.R. The fitting procedure encompasses the following steps:
- Compute summary statistics from measured data:
FOR each copy number state
IF
NEXT
ELSE:
- Compute the total number of clonal mutations as the sum of the different clonal VAF peaks constituted by amplified and non-amplified clonal mutations. Denoting with
$\hat{C_k}$ the average coverage from all mutations falling on segments with copy number$k$ , we classified mutations as amplified clonal if$\frac{Q_{l-1}^{0.95}}{\hat{C_k}} < \mathrm{VAF}_k \le \frac{Q_l^{0.95}}{\hat{C_k}}$ , where$Q_l^{0.95}$ is the 95% quantile of a binomial distribution with success probability \frac{\rho l}{\zeta}$, where$\rho$ is the purity,$\zeta$ is the average copy number$l$ is the B-allele count. - Merge these mutations with those of the non-amplified clonal peak by multiplying their frequencies with
$l/k$ and adding them$l$ times. - Compute the cumulative mutation counts of the measured data,
$F_{k,\mathrm{exp}}(f) = \sum \mathrm{VAF}_k>f $ , where$f$ runs from 0.05 to 1 in steps of size 0.05 and extrapolate them to the whole genome by multiplication with$\frac{\sum_k g_k}{g_k}$ .
DONE
- Fit the model to the data:
FOR EACH optimization step:
- Sample values for
$\mu, \frac{\delta_\mathrm{T}}{\lambda_\mathrm{T}}, n_\mathrm{clonal}$ and$\Delta_\rho$ from the prior distributions given in Table 2 below, where$\Delta_\rho$ is a correction factor to the purity estimate by ACEseq.
FOR EACH copy number state$1\le k \le 4$ :__ IF$g_k<10^8$ bp, where$g_k$ is the length of the genome at copy number$k$ :
NEXT
ELSE:
- Determine$n_{f,k}$ from equation (8), assuming a tumor size of$10^9$ cells at diagnosis, and evaluating equation (8) in bins of size 0.05 at the lower limit:
where
DONE
DONE
The model fits were integrated with the script Compute_evolutionary_parameters_from_growth_model.R and visualized with the script Plot_dynamic_model.R, generating Fig. 3e,f,g and Extended Data Fig. 6l. Individual model fits to the subclonal tail can be visualized with the script Plot_neutral_fit.R.
Table 2. Prior probabilities to model medulloblastoma growth
| Parameter | Description | Unit | Distribution | Min | Max |
|---|---|---|---|---|---|
| Number of SSNVs per division | 1 | uniform | 0.1 | 20 | |
| Relative loss during tumor growth | 1 | uniform | |||
| Adjustment of the estimated purity ( |
1 | uniform | -0.05 | 0.05 | |
| Number of clonal variants | 1 | uniform | 0 | 10000 |
Significantly enriched large-scale CNVs were identified with the script Gains_and_losses.R (which generates Extended Data. Fig. 7a). The regions were then integrated with the timing information in the script [CNV_timing_overview.R](WGS/Oncoprint/CNV_timing_overview.R, which also generates Extended Data Fig. 7b). Finally, the script Driver_mutations.R generates the oncoprint in Fig. 3h.