Reference free guide RNA designing tool for CRISPR
This is a tool to generate personalized guide RNAs for CRISPR without using a reference genome.
Most genome-wide guideRNA designer tools have to use the whole reference genome to populate their database. This limits their usage for organisms with incomplete reference genome. Instead of using a reference, kRISP-meR works using the sequenced reads, and a genomic target location (location where CRISPR cleavage is intended). Using the sequenced reads only, kRISP-meR is able to design variant-aware guideRNAs and predict those with minimized off-target activity.
This tool was initially designed to run on a Linux machine on Python2.7. With the decrepated Python2, this tool has been revised to run on Python3. If there are any issues, email mahmudhera93@gmail.com or open a github issue.
The following need to be installed to successfully run kRISP-meR
- samtools 1.0 or higher
- Jellyfish and Python binding of Jellyfish: In a python script, the code
import dna_jellyfish
should work.
- Bowtie2
- Java 1.8 or higher
- Biopython
- numpy
- scipy
- sklearn version 0.16.1
- pickle
- pandas
Once you have the dependencies installed, installing kRISP-meR is easy. You need to:
- Download the github repository
- Go to the directory
kRISP-meR source
- Install by running the command:
python setup.py install
kRISP-meR takes as input the following:
- The sequenced reads as a FASTQ file
- The target region as a FASTA file
Note that the file containing the target region should have a line >chromosome_name
as its header.
To run kRISP-meR, enter the following command in shell after successful installation.
krispmer READS_FILENAME TARGET_FILENAME OUTPUT_FILENAME MAX_HD
Here, MAX_HD is the maximum Hammind distance that kRISP-meR will consider when scanning target sites for a particular guideRNA. The output of the program is saved in the OUTPUT_FILENAME file in csv format.
After the program exits, you will see four columns after opening the output file. The first is the guideRNA in the + strand, the second is the guideRNA in the - strand (reverse complement of the first). The third column stores the estimated inverse-specificity score assigned to a particular guideRNA. And finally, the fourth column stores the strand in which the NGG pam was found by kRISP-meR.
Besides the scores output file, you will also see a directory named krispmer_temp
and a file named krispmer.log
. The krispmer_temp
directory contains the temporary files created when executing the program. You can use -r
flag to delete them automatically (see detailed usage below). krispmer.log
file contains detailed steps of the whole pipeline.
Usage:
krispmer [-h] [-J JF_FILENAME] [-H JF_HISTO_FILENAME]
[-m MAX_COPY_NUMBER] [-w TARGET_SLIDING_WINDOW_SIZE]
[-f SAVGOL_FILTER_WINDOW] [-s] [-v] [-n] [-c CUTOFF_SCORE]
[-a ALTPAMS [ALTPAMS ...]] [-r] [-j JF_THREADS]
[-b BT2_THREADS] [-S SAMTOOLS_THREADS] [-B SORT_THREADS]
[-p PILON_THREADS]
reads_file target_file scores_file max_hd
- reads_file
- target_file
- scores_file
- max_hd
kRISP-meR allows you to design guide RNAs with WGS shotgun reads (in a FASTQ file), and a target-region (a FASTA file). Besides these two, you also have to tell the program the number of mismatches to consider when scanning for target sites against a particular guideRNA. kRISP-meR allows upto 3 mismatches. kRISP-meR does not consider indels (like other established gRNA designing tools). You also have to tell the program the name of the output csv file, where the guideRNAs along with their inverse-specificity scores and strand information is to be stored.
-J
: kRISP-meR uses Jellyfish to count the k-mers in a set of sequenced reads (in a FASTQ file). Usually, that takes time. For the same set of reads (for the same FASTQ file), if you want to make multiple runs, then Jellyfish would have to run multiple times, resulting in a huge amount of time. Instead, you can input the Jellyfish binary file using this-J
flag.-H
: kRISP-meR uses k-spectrum histogram from the k-mer counts and uses that histogram to calculate prior and posterior probabilities (that are used to assign scores to the guideRNAs). If you have the histogram file ready, you can input the file with-H
flag.-h
: You can see help with-h
flag.-m
: You can set the maximum number of times a region may repeat in the genome using-m
flag. Default: 50.-w
: kRISP-meR uses savgol smoothing filter to smoothen the k-spectrum data before applying Expectation-Maximization to estimate read coverage. You can set the window size of the filter using-w
flag.-s
: You can choose to exclude the guideRNAs that contain a stop-codon using the flag-s
.-v
: You can choose to polish the target region and personalize that for the individual whose sequenced reads are being used. You can do so using the flag-v
. A long pipeline using bowtie2, samtools and pilon will start executing.-n
: You can choose to scan the PAM sequences in the -ve strand using the flag-n
.-c
: You can set a cut-off score of the inverse-specificity using the flag-c
. The guideRNAs with score higher than that will be dropped.-a PAM1 PAM2 ...
: You can provide kRISP-meR with a list of PAMs to consider with-a
flag. By default, NGG PAMs are considered.-r
: You can choose to remove the temporary files automatically using the flag-r
.-j
: You can set the number of threads you want to use to count the k-mers in the sequenced reads using Jellyfish using the flag-j
.-b
: You can set the number of bowtie2 threads with the flag-b
.-S
: You can set the number of samtools threads with the flag-S
.-B
: You can set the number of threads you want to use to sort the intermediate BAM file using the flag-B
.-p
: You can set the number of threads to be used in Pilon using the flag-p
.
krispmer -nvr read.fastq target.fasta out 1
This will use the sequenced reads in the file read.fastq
and the target sequence found in target.fasta
and write the scores as csv in the file out
.
This specific command will:
- Determine the target sequence for the particular individual whose reads are being used
- Scan the -ve strand (reverse complement of the target) for PAM sequences to identify guideRNAs
- Consider a Hamming-distance of 1 when scanning for target-sites for a guideRNA
- Remove all temporary files
If you find any issues, please feel free to:
- either create an issue in github
- or email me at
moc.liamg@39arehdumham
with the logfile (the file namedkrispmer.log
)
I will try to get back to you as soon as possible.