Skip to content

Latest commit



73 lines (52 loc) · 6.56 KB


File metadata and controls

73 lines (52 loc) · 6.56 KB

Analysing mutations with log-linear models

This repository contains scripts used to generate the samples and perform the analyses reported in Statistical methods for identifying sequence motifs affecting point mutations by Zhu, Neeman, Yap and Huttley. Running all these scripts requires the MutationMotif Python library and all of its dependencies. In addition, an install of Jupyter notebook is required. (Note, scripts ending in .ipy and .ipynb require Jupyter/Ipython.)

As the repository also includes the counts data from which all analyses were conducted, running all scripts listed under the "Analysis of..." sections should reproduce exactly the tables and figures reported in the manuscript. As the production of those counts data involves pseudo-random sampling of "reference" locations, re-running the analysis (after executing the populate_data.ipy script to obtain the required precursor data) will produce slightly different results but ones consistent with the inferences drawn in the manuscript. If you wish to examine the robustness of the analyses, having run the script populate_data.ipy, you can re-start the analysis by generating counts data again.

Sampling the germline data

  1., produces a summary dump of germline SNPs whose flanks match the reference genome. This produces a massive pickle format (Python's serialised data format) file (data/ensembl_snps_79/GERMLINE_flanks_match_ref.txt.gz).
  2. using the generate_sampled_snps.ipy script. reads pickled records of the correct consequence type (e.g. missense) to query the Ensembl MySQL database, producing a summary table containing alleles, ancestral base, chromosome location and reported allele frequencies for non-somatic mutations. The results of this were dumped into a raw directory.
  3. generate_germline_raw_w_flanks.ipy calls, combines the above results and slices from the entire chromosome sequence to produce the completed SNP dump records which are written into raw_with_flanks.
  4. generate_germline_alignments.ipy produces fasta files, one for each mutation direction, with the mutated location in the middle.
  5. generate_germline_counts.ipy, produces separate counts tables for each mutation direction.
  6. generate_germline_all_counts.ipy, produces combined counts tables (all mutation directions) for strand asymmetric and strand symmetric conditions.
  7. generate_germline_long_flanks_counts.ipy, reads the germline alignments for the CtoT and AtoG mutations, producing counts from larger flanks. Data written to ../data/ensembl_snps_79/counts/long_flanks.

Sampling the COSMIC cancer data

  1. Downloaded CosmicMutantExport.tsv.gz from COSMIC.
  2. script, which limited output to SNPs and relevant fields (like primary histology). This produces cosmic_derived/.
  3. script to obtain the human chromosome sequences, then used to generate the data format consistent with that produced for the analysis of germline mutations.
  4. generate_cancer_alignments.ipy, produces fasta files, one for each mutation direction, with the mutated location in the middle.
  5. generate_cancer_counts.ipy and generate_cancer_all_counts.ipy.

Analysis of neighbourhood effects

  1. do_nbr.ipy: Statistical analysis of a single group for contributions of neighbouring bases to point mutations. Results from this written to ../results/ensembl_snps_79/<chrom class>/<seq class>/directions/<direction>. The malignant melanoma results are written to ../results/ensembl_snps_79/malignant_melanoma.
  2. do_nbr_long_flanks.ipy: Statistical analysis of a single group for contributions of neighbouring bases to point mutations. Results written to ../results/ensembl_snps_79/long_flanks/A/intergenic/directions/<direction>
  3. do_nbr_compare.ipy: Statistical analysis comparing neighbourhood effects between groups (tissue types, chromosome classes, sequence classes, sequence strands). Results are written to <group_vs_group> directories within ../results/ensembl_snps_79/, depending on the level of comparison, e.g. ../results/ensembl_snps_79/A_vs_X.

Analysis of mutation spectra

Specified in the do_spectra_compare.ipy script. Output files (spectra_analysis.json, spectra_analysis.log and spectra_analysis.txt) are written to the same directories as the corresponding neighbour analysis.

Generating grid plots

These encoded in in do_grid_draw.ipy. Includes both malignant melanoma, germline intergenic plus the mutation spectra plots.

Generating the MI sequence logo plot

This is a one off for demonstrating the conventional sequence logo approach does not produce meaningful results. Run via do_mi_draw.ipy.

Repeating neighbour analyses using the genome as the reference generates count files of observed and reference motifs where the reference is drawn from the entire genome. This script requires considerable RAM. It was run only for intergenic region autosomal and X-linked SNPs.

The script script checks to ensure the counts for the observed motifs are identical between executing the script and the original scripts.

The do_nbr_genome_ref.ipy script performs the mutation neighbour effect analyses and compares the A with X results.

Producing manuscript tables and figures

This is all encoded in a jupyter notebook file (manuscript_figs_tables.ipynb). The resulting tables for the manin manuscript are written to ../results/ensembl_snps_79/manuscript_tables.tex. Tables and figures for supplementary material are written to ../results/ensembl_snps_79/supp_materials_tables.tex.

Figures are copied to the figures directory.