This document will outline how to setup Autometa benchmarking jobs on CHTC or a lab server.
- Setup data directories and compute environment
- Configure an Autometa command for benchmarking
- Generate binning parameter sweep arguments file
- Submit jobs for configured command w/parameters file
- Commands used for parameter sweep against CAMI2 datasets
Resources:
HTCondor job submissions were implemented to take advantage of CHTC's parallelized compute resources.
The parameter sweep job submission (autometa_parameter_sweep.sub
) references a command template
(autometa_binning.sh
) as well as a parameters file (sweep_parameters.txt
) which allows arguments
(like --cluster_method DBSCAN
) to be supplied to autometa-binning
in autometa_binning.sh
. Prior
to job submission, a compute environment must be configured for use on the compute node.
There are two options available, either the compute node can download and run the command in a provided docker image, or a compute environment may be constructed and tarballed to be transferred to the compute node at runtime, where it will then need to be extracted and installed prior to running the respective preprocessing or binning command.
Autometa is available on docker hub with multiple supported versions. If docker is available to you at your compute facility or on your lab's server, you may easily specify the docker image tag you wish to use proceed with configuring the command and parameter sweep arguments file.
# lines in autometa_parameter_sweep.sub
universe = docker
docker_image = jasonkwan/autometa:2.2.0
jasonkwan/autometa:latest
# latest commit frommain
branchjasonkwan/autometa:dev
# up to date withdev
branchjasonkwan/autometa:main
# up to date withmain
branch
jasonkwan/autometa:2.2.0
jasonkwan/autometa:2.1.0
jasonkwan/autometa:2.0.3
jasonkwan/autometa:2.0.2
jasonkwan/autometa:2.0.1
jasonkwan/autometa:2.0.0
NOTE: This is NOT needed if the submit file uses the
docker
universe with a specified docker image.
Autometa's compute environment may be transferred and installed to CHTC's compute node to be used when the respective job is running. This requires additional steps in the job's command template to setup and teardown the compute environment. An example of packaging your own compute environment as well as setup and teardown at runtime and after termination of the job is outlined below.
In the following example, I have created the compute env (autometa.tar.gz
) and have specified to
transfer this as an input file in autometa_parameter_sweep.sub
# Install mamba (faster and same commands available)
conda install -n base -c conda-forge mamba -y
# Create autometa env
mamba create -n autometa -c conda-forge -c bioconda autometa -y
# Create conda-pack env
mamba create -n conda-pack conda-pack -y
# package autometa env to tarball for transfer to SQUID web proxy
mamba activate conda-pack
conda-pack -n autometa
After you have tarballed your compute environment, you may specify it in the submit file.
# lines in autometa_parameter_sweep.sub
universe = vanilla
http://proxy.chtc.wisc.edu/SQUID/erees/autometa.tar.gz
Next you will need to setup add the following code blocks to the beginning and end of the command template.
## BEGIN conda env setup
# replace env-name on the right hand side of this line with the name of your conda environment
ENVNAME=autometa
# if you need the environment directory to be named something other than the environment name, change this line
ENVDIR=$ENVNAME
# these lines handle setting up the environment; you shouldn't have to modify them
export PATH
mkdir $ENVDIR
tar -xzf $ENVNAME.tar.gz -C $ENVDIR
. $ENVDIR/bin/activate
## END conda env setup
Now add the following to the end of the executable file defined in the submit file (for example
autometa_binning.sh
)
# BEGIN conda env teardown
rm -rf $ENVDIR
rm -rf $ENVNAME.tar.gz
# END conda env teardown
NOTE: If you are unsure about your executable, look for the following lines in your submit file:
# line in autometa_parameter_sweep.sub executable = ./autometa_binning.sh
Templates correspond to their process and process env, for example autometa_binning_conda_env.sh
is
a template for the autometa-binning
command using a conda environment.
NOTE: For more information on setting up the appropriate compute environments, see Compute Environment Setup.
templates/
├── autometa_binning_conda_env.sh
├── autometa_binning_docker_env.sh
├── autometa_binning_ldm_conda_env.sh
├── autometa_gc_content_docker_env.sh
├── autometa_kmers_docker_env.sh
└── autometa_taxonomy_docker_env.sh
The following parameters were combined to generate parameter sweep results.
Parameter Sweep Parameters | Values | Process | Entrypoints |
---|---|---|---|
Cluster Method | DBSCAN, HDBSCAN | Genome-binning | autometa-binning, autometa-binning-ldm |
Completeness | 10, 20, 30, 40, 50, 60, 70, 80, 90 | Genome-binning | autometa-binning, autometa-binning-ldm |
Purity | 10, 20, 30, 40, 50, 60, 70, 80, 90 | Genome-binning | autometa-binning, autometa-binning-ldm |
GC Content standard deviation | 2, 5, 10, 15 | Genome-binning | autometa-binning, autometa-binning-ldm |
Coverage standard deviation | 2, 5, 10, 15 | Genome-binning | autometa-binning, autometa-binning-ldm |
k-mer norm. method | ILR, CLR | Genome-binning | autometa-binning-ldm |
k-mer embed method | BH-tSNE, UMAP | Genome-binning | autometa-binning-ldm |
taxonomy database | NCBI, GTDB | Taxon-binning | autometa-taxonomy, autometa-taxonomy-lca, autometa-taxonomy-majority-vote |
The input parameters file format should contain one set of job arguments per line.
These job arguments are defined in the autometa_parameter_sweep.sub
to be passed to autometa_binning.sh
sweep_parameters.txt
➡️autometa_parameter_sweep.sub
➡️autometa_binning.sh
.
--input
is a path to your data directory containing one metagenome per sub-directory.
The directory structure should resemble something like this:
data
├── cami
│ ├── marmgCAMI2_short_read_pooled_gold_standard_assembly
│ │ ├── logs
│ │ └── preprocess
│ ├── marmgCAMI2_short_read_pooled_megahit_assembly
│ │ ├── logs
│ │ └── preprocess
│ ├── strmgCAMI2_short_read_pooled_gold_standard_assembly
│ │ ├── logs
│ │ └── preprocess
│ └── strmgCAMI2_short_read_pooled_megahit_assembly
│ ├── logs
│ └── preprocess
└── databases
└── ncbi
With this directory structure, you can pass a regex pattern *assembly
to retrieve the metagenome sub-dirs for the parameter sweep analysis:
NOTE: The
--glob
uses the regex value to find sub-directories on the--input
directory path.
Here is an example command for the CAMI2 datasets using the directory structure shown above...
python scripts/generate_param_sweep_list.py \
--input $HOME/data/cami \
--glob "*assembly"
--output cami_sweep_parameters.txt
... and here is the breakdown of the search path:
--input |
--glob |
code |
search string | Example values for communityDir in submit file |
---|---|---|---|---|
$HOME/data/cami |
*assembly |
os.path.join(args.input, args.glob, recursive=True) |
/home/user/data/cami/*assembly |
marmgCAMI2_short_read_pooled_gold_standard_assembly marmgCAMI2_short_read_pooled_megahit_assembly strmgCAMI2_short_read_pooled_gold_standard_assembly strmgCAMI2_short_read_pooled_megahit_assembly |
This will allow use of
queue <var> from <arglist>
in submit file. e.g.:queue communityDir,community,cluster_method,completeness,purity,cov_stddev_limit,gc_stddev_limit from cami_sweep_parameters.txt
(autometa) [erees@submit-1 binning_param_sweep]$ python generate_param_sweep_list.py --input /home/erees/autometa_runs/binning_param_sweep/data/cami --glob "*assembly" --output cami2_sweep_parameters.txt
Found 4 communities
Wrote 10,368 (2,592 per community) parameter sweep jobs to cami2_sweep_parameters.txt
Wrote parameters in the format:
communityDir, community, cluster_method, completeness, purity, cov_stddev_limit, gc_stddev_limit
----------------------------------------------------------------------------------------------------
PLACE the following in your submit file using these parameter combinations:
queue communityDir,community,cluster_method,completeness,purity,cov_stddev_limit,gc_stddev_limit from cami2_sweep_parameters.txt
- inputs for executable (
autometa_binning.sh
) match filenames transferred in*.sub
file (listed intransfer_input_files
) - Check directories exist where
stderr
,stdout
, andlog
will be written (NOTE: These will be written relative toinitial_dir
, e.g.communityDir
) - Check annotation files are in their correct location. i.e.
communityDir/preprocess/<annotation_file>
Each preprocess
directory should contain annotation files required as input to their respective
command.
Here is one example corresponding to the autometa-binning
command template:
preprocess
├── 5mers.am_clr.bhsne.tsv
├── taxonomy.tsv
├── bacteria.markers.tsv
├── coverage.tsv
└── gc_content.tsv
condor_submit autometa_parameter_sweep.sub
to test an interactive job:
condor_submit -i autometa_preprocess_taxonomy_test.sub
condor_submit autometa_preprocess_taxonomy.sub
condor_submit cami_genome_binning_parameter_sweep.sub
condor_submit cami_autometa_ldm_binning_parameter_sweep.sub
test interactive job
condor_submit -i cami_autometa_binning_large_data_mode_parameter_sweep_w_kmer_args_test.sub
condor_submit cami_autometa_binning_large_data_mode_parameter_sweep_w_kmer_args.sub
- Preprocess CAMI2 data
- Generate parameters
- Submit jobs to HTCondor
- Convert binning results to biobox format
- Run AMBER on biobox-formatted binning results
- Get runtime and memory usage information
The CAMI2 assemblies were first pre-processed prior to performing genome-binning. This was performed on the lab server using nextflow and so the corresponding commands are listed below.
NOTE: Some metaBenchmarks workflows use Autometa modules for pre-processing.
To import the Autometa modules for use with the metaBenchmarks workflows, run:
nextflow clone kwanlab/Autometa
pre-processing generates annotations for each CAMI2 dataset. The annotations are:
- contig lengths & GC content
- contig read (or k-mer) coverage
- kmers
- markers
- taxonomy
cd ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/
nextflow run cami_preprocess.nf -resume -c cami.config -profile slurm -w cami_work
This will generate sub-directories corresponding to ${params.outdir}/${meta.id}/preprocess
.
The output directory specified in cami.config
from step 1:
params.outdir = "nf-autometa-genome-binning-parameter-sweep-benchmarking-results/cami"
OUTDIR="${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/nf-autometa-genome-binning-parameter-sweep-benchmarking-results/cami"
CHTC_DIR="/home/erees/autometa_runs/binning_param_sweep/data"
# Transfer directories and files to CHTC
rsync -azPL $OUTDIR chtc:"${CHTC_DIR}/."
# On CHTC (rsync)
cd /home/erees/autometa_runs/binning_param_sweep
python generate_param_sweep_list.py \
--input data/cami/ \
--glob "*assembly" \
--output cami2_sweep_parameters.txt
# Navigate to directory
cd /home/erees/autometa_runs/binning_param_sweep
# Submit CAMI2 jobs
condor_submit cami_genome_binning_parameter_sweep.sub
OUTDIR="${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/nf-autometa-genome-binning-parameter-sweep-benchmarking-results/cami"
CHTC_DIR="/home/erees/autometa_runs/binning_param_sweep/data"
# Transfer autometa2 results from CHTC
rsync -azPL chtc:"${CHTC_DIR}/cami/" "${OUTDIR}/"
bash ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/format_autometa_cami_binning_tables_to_biobox_format.sh
bash ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/amber_autometa_genome_binning_marine_gsa_results.sh
sbatch ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/amber_autometa_genome_binning_marine_gsa_results.sh
bash ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/amber_autometa_genome_binning_marine_megahit_results.sh
sbatch ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/amber_autometa_genome_binning_marine_megahit_results.sh
bash ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/amber_autometa_genome_binning_strmgCAMI2_short_read_pooled_gsa_assembly.sh
sbatch ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/amber_autometa_genome_binning_strmgCAMI2_short_read_pooled_gsa_assembly.sh
NOTE: Ground truth files were retrieved from the CAMI2 paper github repository: https://github.com/CAMI-challenge/second_challenge_evaluation/
- clone CAMI2 evaluation repo to get ground truths
git clone https://github.com/CAMI-challenge/second_challenge_evaluation.git
- Untar megahit binning ground truth (only needs to be performed once)
cd $HOME/second_challenge_evaluation/binning/genome_binning/strain_madness_dataset/data/ground_truth
tar -xvzf strain_madness_megahit.binning.tar.gz
bash ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/amber_autometa_genome_binning_strmgCAMI2_short_read_pooled_megahit_assembly.sh
sbatch ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/amber_autometa_genome_binning_strmgCAMI2_short_read_pooled_megahit_assembly.sh
All AMBER outputs may be found here: ls -d ${HOME}/metaBenchmarks/autometa_genome_binning_parameter_sweep/nf-autometa-genome-binning-parameter-sweep-benchmarking-results/cami/*assembly/genome_binning/amber-output
REPO="$HOME/metaBenchmarks"
script="${REPO}/autometa_genome_binning_parameter_sweep/scripts/parse_log_runtime_information.py"
indir="${REPO}/autometa_genome_binning_parameter_sweep/nf-autometa-genome-binning-parameter-sweep-benchmarking-results/cami"
python $script --input $indir --output cami_runtime_info.tsv.gz
bash /media/BRIANDATA4/metaBenchmarks/autometa_genome_binning_parameter_sweep/scripts/format_autometa_gtdb_genome_binning_to_biobox_format.sh
bash /media/BRIANDATA4/metaBenchmarks/autometa_genome_binning_parameter_sweep/scripts/amber_autometa_gtdb_genome_binning_results.sh