nifH Update is a python pipeline that updates an existing databases of labeled nifH gene sequences. Eventually, this pipeline will be generalized such that any database can be updated.
nifHUpdate requires a configuration file with the file paths to relevant files. Dependencies and the configuration requirements are explained in detail below.
The edirect method had several shortcomings, namely the time it took to retrieve relevant sequences from NCBI Nucleotide database, and it's reliance on proper annotation of the queried term. Instead, this pipeline takes fasta files of whole genome sequences from some existing database. To extract relevant genomes sequences, we use minimap to perform a preliminary alignment with your existing database.
The following steps will guide you through how to set up and run this pipeline
- Activate your environment if desired. I used ddocent_env (http://www.ddocent.com/bioconda/), which contains many of the tools necessary.
source activate ddocent_env
- Download additional dependencies if necessary (requirements.txt).
pip install cd-hit-auxtools
- Make relevant file-of-file-names for your local database you are searching.
Here is an example of a fofn file Klebsiella_genomes.fofn
data/home/user/nuccore/570_1.fasta
data/home/user/nuccore/570_2.fasta
data/home/user/nuccore/570_3.fasta
- Create Configuration File. Below are each label, and what they are used for. Those with a * are mandatory. The others have default values.
- PREFIX * will be used to name all intermediate and output files and directories.
- DBFILE * file path to your existing database in fasta format. The ID of each sequence is in the form
Accession;cluster;organism_name
- NUCCORE * file path to a text file containing file paths to all relevant (unzipped) fasta files of your local database you are searching from, even if it is a single fasta file
- MIN_MINIMAP_ALIGNLEN Default = 200 bp. Minimap will drop any alignments that are below this length.
- MIN_BLASTN_ALIGNLEN Default = 200 bp. Blastn will drop any alignments that are below this length.
- PIDENT_CUTOFF Default = 75%, which is the cutoff used for degree of similarity among organisms from the same family. This is used to capture homologs.
- '#' Used to denote a comment.
These labels can be specified in any order. The program will halt if there are any formatting errors or unnecessary labels. Here is an example configuration file called my_config.txt
# May 4, 2019 klebsiella genome test run
PREFIX AAA
DBFILE /Users/nifH_extract/oldDatabase.fasta
NUCCORE /Users/nifH_extract/Klebsiella_genomes.fofn
MIN_MINIMAP_ALIGNLEN 300
- Run the following command in your desired directory.
$ path/to/nifHUpdate_Lib/nifHUpdate_caller.py my_config.txt my_log_file_name.txt
my_config.txt
is the configuration file you made. my_log_file_name.txt
does not have to already exist. It will be created in your current working directory.
You can restart the pipeline at any point adding the -s
option and specifying what stage to start on. For example:
$ path/to/nifHUpdate_Lib/nifHUpdate_caller.py my_config.txt my_log_file_name.txt -s blastn
Available stages are listed below. Running the command without the -s
option will default to rerunning the pipeline from the beginning. It is recommended that if you want to save the result of each run, instead of rerunning with the same prefix, make a new configuration file with a different prefix label.
- The resulting clusters are found in the directory
clusters_dedup
, and a simple summary of the number of sequences in each fasta file created can be found in your log file you named in the initial command.
The following stages are listed in order in which they are performed.
- minimap:
This stage does the preliminary mapping of your old database onto your subject database. It outputs
.paf
files, which provides information about the alignments - minimap_filter:
This stage parses the
.paf
and pulls out the relevant fasta records from your subject database. This reduced set is then passed to the next stage. - blastn: Blastn does a more thorough alignment to pull out the exact sequences with a high enough similarity (specified in your configuration file). This produces the results in output format 6, with the addition of other details used to extract the relevant sequences.
- filter_best_alignments: This extracts the aligned regions that are above predetermined thresholds and puts them into new fasta files. It also formats the fasta file ID and description to include which cluster that sequence aligned to.
- cluster: From the renamed records, this stage simply sorts all the sequences that aligned to the same cluster in its own fasta file.
- deduplicate: Removes copies of the same sequence in each of the cluster files
nifHUpdate requires the following packages, and is highly recommended to run in an environment
- Python
- Biopython
- cd-hit-auxtools
- blastn