Using mismatches in XR-seq reads to pinpoint CPD positions
- Project Overview
- Dependencies and File System Naming Conventions
- Preparing Data for Analysis
- Analysis and Figure Generation
- Data Availability
The code in this repository was used to produce the findings in this publication. This paper contains additional insight into the analysis using this code.
Cytosine nucleotides deaminate much faster in cyclobutane pyrimidine dimers (CPDs) than in undamaged DNA. Because of this, XR-sequencing reads, which represent fragments from nucleotide excision repair of various helix-distorting lesions such as CPDs (more information here) may contain C>T mismatches where the CPD was present. Normally, the position of these lesions in the original repair fragment can only be estimated, but by identifying these C>T mismatches, the lesion positions can be pinpointed more accurately.
For identifying single-deaminated cytosines in CPDs, reads are aligned using default alignment parameters in bowtie2. The exact locations of the mismatches are extracted post-alignment from the resulting sam file, and reads with exactly one C>T mismatch are investigated further as proxies for cytosine deamination in CPDs.
For tandem mismatches (specifically, CC>TT) an alternative method must be used. Aligners such as bowtie2 do not allow multiple mismatches during seeding, so depending on the read length and the position of the tandem mismatch, the read may not align at all. However, given the specific nature of the tandem mismatch, the problem can be coereced into an exact matching problem. For every TT sequence in a given read, a new read is produced with the TT changed to a CC to represent a reversion of the potential mismatch. These reverted reads are check against a reference genome for exact matches, and if exactly one read matches (and the original read does NOT match), the tandem mismatch in that read is confirmed. Exact matching can be performed using bowtie2 with specific parameters outlined in the Aligning Reads section.
Admittedly, the software in this repository was largely not designed with other users in mind. Regardless, the core analysis is reproducible after taking a few steps to set up a specific environment and following a few conventions pertaining to the project's file system structure. In the event that the following steps are insufficient to reproduce the desired results, you are more than welcome to post an issue or email me directly at b.morledge-hampton@wsu.edu. If there is sufficient interest in the future, we will consider producing a more user-friendly software package with the same functionality.
Besides the code in this repository, the analysis requires that the following packages are installed (The most up-to-date version of each package is recommended):
- Python:
- R:
- Must be available through command line (Recommend installing through apt, where possible):
Although not required, the scripts in this repository may be helpful for trimming and aligning sequencing reads and parsing them to bed. If desired, these three operations can be chained together using the AlignXRSeqReads.py script.
Additionally, this script can be used to download the sequencing data more efficiently by providing the relevant SRA accession IDs in a newline-separated text format and optionally providing custom names (in the same format) for the resulting fastq files. Note that running this script requires that the SRA Toolkit is installed and accessible through your PATH environment variable.
After cloning the deamination_determination repository, you will need to create a directory at the top level called "data". Within this data directory, an additional directory will be required for each species or cell type being analyzed. The names supported by the R notebooks in this repository are:
- Arabidopsis
- CS-B
- Dm
- E_coli_WT
- GM12878
- NHF1
- XP-C
- yeast
The corresponding data sets for the above names can be found in the Data Availability section
Within each of these directories, up to two other directories should be created:
- "mismatches_by_read" for storing mismatch data by read (e.g. The results from SamMismatchesToBed.py or PotentialTandemMismatchesSamToBed.py)
- "general_read_sequences" for storing bed files containing general non-mismatch-associated sequencing results
When running the R notebooks contained in this repository, they will ask you to provide a "bioinformatics directory". This is merely the directory containing the cloned repository (i.e. one directory up from "deamination_determination/"). Once this directory is selected, the underlying file system is inferred using the naming conventions in this section.
When obtaining fastq sequencing files (or sam alignment files), a strict file naming system needs to be maintained in order for the rest of the pipeline to run without errors. The file names should be made up of the following identifying information, separated by underscores:
- Organism or cell type name
- Lesion type ("CPD", "6-4", or "cisplatin")
- Timepoint (in shorthand format, such as "10min" or "1h")
- Repitition number (In the format "rep#" where '#' is the repitition number or "all_reps" when combined)
Here is an example of a valid sam file name: "NHF1_CPD_1h_rep1.sam"
The NHF1 periodicity analysis is largely deprecated, but if for some reason you want to run it anyway, you will need to install mutperiod for the periodicity analysis and have this repository downloaded in the same directory as deamination determination. The analysis requires that you use mutperiod to generate the periodicity data and stratify it by sequence context. Seriously though, don't try to run this. The analysis turned up very little in terms of a clear periodicity.
For this analysis, reads were aligned using bowtie2 after trimming adaptor sequences with trimmomatic. Other aligners have not been tested, but if they alter the sam file format significantly, they may not function properly with the rest of the pipeline.
For finding single-nucleotide mismatches, default bowtie2.4.5 parameters were used. For finding exact matches with potential tandem mismatches, the following parameters were used: --reorder --no-1mm-upfront --no-unal --score-min C,0,0 -k 2.
In order to identify single nucleotide mismatches from an aligned sam file, use the SamMismatchesToBed.py script. When running the script directly (recommended), you should see a short dialog asking for a few input parameters. Sam files can be given by selecting individual files or by giving a directory which will be recursively searched for files ending in ".sam" or ".sam.gz". The "omit indels" box should be checked, and the "specify single output dir" box can be used to output the results directly to the relevant "mismatches_by_read" directory, if desired.
In order to identify tandem nucleotide mismatches, reads need to have their adaptor sequences trimmed, and then the FindPotentialTandemMismatches.py script can be used to split reads based on potential tandem mismatch sites. Fastq files can be given by selecting individual files or by giving a directory which will be recursively searched for files ending in "trimmed.fastq" or "trimmed.fastq.gz". Next, the resulting reads should be aligned with bowtie using the parameters outlined in the Aligning Reads section. Finally, true tandem mismatches should be parsed from the resulting sam file using the PotentialTandemMismatchesSamToBed.py script. The results from this final script should end up in the relevant "mismatches_by_read" directory.
Some steps of the analysis require data in bed format from all reads, independent of mismatches. Aligned reads reads can be coerced to bed format using a combination of samtools and bedtools. A script for this conversion is available here. The resulting bed files should end up in the relevant "general_read_sequences" directory.
Once the data from the above steps has been generated and is present in the relevant directories, the sequence context of the data should be expanded by at least 3 base pairs using this script. For mismatch data files, the expanded sequence context should replace the 7th column (index 6), and for general read files, the new sequence should be appended. For general reads, the "output file suffix" should be in the format "_aligned_reads_#bp_expanded", whereas for mismatch data, the suffix should be in the format "_#bp_expanded". After being expanded, data can be optionally TGG filtered using the FilterOnExpandedContext.py script.
Lastly, the individual data repititions need to be combined. This can be achieved by running this script on the "mismatches_by_read" and "general_read_sequences" directories. This step IS REQUIRED to run the analyses within the R notebooks
If the data has been prepared correctly, running the analyses and generating figures is easy: Simply run/knit the R notebooks provided in the R/ directory of the repository for the desired organism and analysis. However, there are a few caveats...
Any "DataGeneration" notebooks must be run BEFORE their "FigureGeneration" counterparts. If a "FigureGeneration" notebook is run with specific parameters (i.e. TGG filtering), then its corresponding "DataGeneration" notebook must have also been run with those parameters at least once. The "DeaminationSequenceAndPositionAnalysis" notebooks must be run BEFORE their "DeaminationAnchoredExcisionSite" counterparts.
Each notebook requires a path to the "Bioinformatics_Projects" directory. (More information on this directory is available at the end of the directory structure naming conventions section). In brief, this is the parent directory to the cloned deamination_determination repository. If you are running the R notebooks in interactive mode, you will be able to select this directory through a file browser dialog. If you are knitting the notebook, you will select this directory from a list of choices in the Shiny UI for the notebook's parameters. If the desired directory is not listed, you will need to add it manually to the R notebook (generally on line 11).
When knitting figure generation notebooks, the figures are automatically output to the R/deamination_analysis_output directory, which will be created if it doesn't already exist.
The sequencing data used for the analyses supported by this repository can be found at the following locations:
- Arabidopsis: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108932
- CS-B: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67941
- Dm: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE138846
- E_coli_WT: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92734
- GM12878: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE82213
- NHF1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76391
- XP-C: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67941
- yeast: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110621