🔗 https://www.primerbanks.com/
All scripts are located in the scripts/ directory and are numbered according to their execution order.
Each script handles a specific part of the MicroPD data processing workflow.
📂 Directory structure example:
scripts/
├── 10_filter_table_by_threshold.py
├── 11_run_prokka_annotation.sh
├── 12_fetch_cds_dna_seq_bacteria.R
├── 13_extract_long_genes.sh
├── 14_fetch_specifi_gene_fq.py
├── 15_fetch_specifi_gene_fq.sh
├── 16_merge_fa.sh
├── 17_cdhit_cluster_analysis.sh
├── 18_merge2fasta.py
└── 19_rebuild_fasta.sh
...Based on the results of the MASH algorithm, filter out similar genomes.
Runs Prokka in batch mode to annotate genome files (.fna).
Generates amino acid (.faa), nucleotide (.ffn), and annotation tables (.tsv).
Parallel processing is used to enhance efficiency.
Extracts CDS DNA sequences from Prokka annotation tables (.tsv) and outputs FASTA files.
Extracts gene sequences longer than 10,000 bp using seqkit.
Saves the IDs of long genes to a text file, supports parallel processing.
Extracts specific gene sequences from FASTA files using a gene ID list.
Outputs matching and non-matching sequences separately.
Wraps fetch_specifi_gene_fq.py for batch processing of multiple FASTA files.
Categorizes sequences by matching status; supports parallelization.
Merges multiple FASTA files into larger batches (1000 per file).
Temporary files are used for batch control and deleted after processing.
Performs cd-hit-est clustering for batch sequence analysis.
Splits merged FASTA files into genome-specific FASTA outputs based on gene ID lists.
Supports batch processing with flexible input/output paths.
Executes merge2fasta.py with specified environment and file paths to rebuild genome-specific FASTA files.
Slices each CDS into 150-bp non-overlapping pseudo reads, outputs FASTQ and read count statistics.
Launches split_cds_to_pseudo_reads.py with required parameters.
bowtie2-build -f --large-index --bmax 6635772616 --dcv 4096 --threads 28 bacteria.fna bacteriaMaps CDS pseudo-reads to Bowtie2 indices; keeps uniquely aligned reads only.
Generates SAM, aligned/unaligned FASTQ, and log files.
Extracts CDS genes whose 150-bp fragments are uniquely aligned; outputs gene ID lists.
Batch script to execute fetch_specific_gene_name.py across datasets.
#Generate index file by diamond
diamond makedb --in /s3/SHARE/woodman/Prokka2/data/uniref90.fasta --db /s3/SHARE/woodman/Prokka2/dmnd_db/uniref90.dmnd
diamond makedb --in /s3/SHARE/woodman/Prokka2/data/uniref50.fasta --db /s3/SHARE/woodman/Prokka2/dmnd_db/uniref50.dmndRuns DIAMOND searches against UniRef90 and UniRef50 databases for protein annotation.
Launches parallel DIAMOND searches (two jobs × seven CPUs each) for large-scale protein annotation.
Splits sequences from input FASTA into individual {gene_id}.fasta files.
Batch executes gene extraction and logs completion status.
conda install -c bioconda primer3-py=0.6.1
conda install bioconda::pysam
conda install cctbx202105::biopytho
conda install pandas=1.5.3 numpy=1.21.2Designs qPCR primers using Primer3 (product sizes 100–500 bp).
Skips sequences shorter than 100 bp and outputs CSV/log files.
Executes batch primer design with the above Python script.
Merges individual primer result CSVs, adds GENOME_ID/GENE_ID, and outputs primer_bank_virus.csv.
Calculates ΔG, GC%, Tm, hairpin/self-complementarity scores, and overall primer score.
Batch runs primer scoring across all results.
Combines all primer CSVs into a unified bacterial primer bank.
Improved version of primer merging; auto-completes unique indexes for KINGDOM and PRIMER_PAIR_X.
Converts GTF “gene” entries into JSONL format for database use.
Batch converts GFF files to JSONL format, merges them into gtf.jsonl.
Merges historical and current NCBI assembly summaries into unified JSONL master data.
Updates database form fields and data structure (v2 → v7).
Maps 40k CDS pseudo-sequences against NCBI nt index using Bowtie2; retains uniquely aligned reads.
Merges SAM files from nt partitions (A/B), removes duplicates and non-unique alignments.
Performs multi-threaded batch download of NCBI RefSeq genome reports; validates MD5 checksums.
Parses SAM alignment files to generate genome-specific region–primer indexes (JSON/TSV/PKL).
Enhanced version supporting comprehensive summary tables across all genomes.
Aggregates sequence report JSONLs, deduplicates by GenBank accession, and outputs summary indexes.
Packages all subdirectories under temp_fna_taxid/ into {taxid}.fna.tar.gz archives.
Generates six downloadable formats (CSV, FAA, FA, FNA, GFF→BED, etc.) by TAXID with missing record alerts.
