baitUtils is a comprehensive toolkit for the analysis and visualization of bait sequences used in in-solution hybridization. It provides tools for generating bait quality statistics and visualizations.
- Conda: Please ensure you have conda installed to manage dependencies.
- Install the required packages and dependencies with:
conda create -n baitutils_env numpy pandas matplotlib-base seaborn scikit-learn biopython viennarna pblat
conda activate baitutils_env
git clone https://github.com/FOI-Bioinformatics/baitUtils.git
cd baitUtils
pip install .
baitUtils offers two main functionalities: stats
and plot
, both accessible as subcommands of the primary script.
baitUtils [command] [options]
stats
: Calculate quality statistics of bait sequences.plot
: Generate plots based on bait sequence statistics.
Calculates statistics on bait sequences and filters them based on user-defined criteria.
baitUtils stats -i probes.fasta.gz -o results --length 120 --mingc 40 --maxgc 60 --filter
-i, --input
: Path to the input FASTA or FASTA.GZ file.-o, --outdir
: Output directory for results.--length
: Requested bait length (default is 120).--mingc
: Minimum GC content percentage.--maxgc
: Maximum GC content percentage.--filter
: Save filtered FASTA output.
Generates plots based on the bait sequence statistics file.
baitUtils plot -i results/filtered-params.txt -o plots --columns GC% Tm MFE --plot_type histogram boxplot scatterplot
-i, --input
: Path to the parameters file.-o, --outdir
: Output directory for plots.--columns
: List of columns to include in plots.--plot_type
: Types of plots to generate.--color
: Column to use for coloring plots.
Maps bait sequences against a reference genome using pblat and filters mappings based on identity percentage.
baitUtils map -i baits.fa -q genome.fa -o mapping_results --outdir mappings --threads 4 --minIdentity 90 --filterIdentity 95 --fasta-output both
-i, --input
: Path to the input baits FASTA file.-q, --query
: Path to the reference genome FASTA file to map against.-o, --outprefix
: Prefix for the output files (default is out).-Z, --outdir
: Output directory path (default is .).--mapper
: Mapping tool to use (pblat is currently supported).-X, --threads
: Number of threads to use for mapping (default is 1).--minMatch
: Minimum number of tile matches (default is 2 for nucleotide sequences).--minScore
: Minimum score for alignments (default is 30).--minIdentity
: Minimum sequence identity percentage for mappings (default is 90).--filterIdentity
: Filter mappings with identity percentage less than this value; must be ≥ minIdentity (default is 90).--fasta-output
: Choose which probes to include in the FASTA output file: mapped, unmapped, both, or none (default is mapped).-l, --log
: Enable detailed logging for execution insights.
To calculate and filter baits based on length, GC content, and other quality metrics:
baitUtils stats -i probes.fasta.gz -o stats_output --length 120 --mingc 40 --maxgc 60 --filter
To generate histograms, boxplots, and scatterplots for GC content and melting temperature:
baitUtils plot -i stats_output/filtered-params.txt -o plots_output --columns GC% Tm --plot_type histogram scatterplot --color Kept
To map baits against a reference genome and output both mapped and unmapped probes:
baitUtils map -i baits.fa -q genome.fa -o mapping_results --outdir mappings --threads 4 --minIdentity 90 --filterIdentity 95 --fasta-output both
MIT License. See LICENSE
file for details.