Skip to content

3. Config file (parameters)

sneuensc edited this page Nov 21, 2024 · 18 revisions

Introduction

The pipeline can be customised using a configuration file in the yaml format (default config/config.yaml).

The minimal input files to run mapache are

  • a plain text file (config/samples.tsv) with paths and metadata of the input FASTQ files (see Sample file)
  • a reference genome in FASTA format listed in the config file config/config.yaml (see Reference genome)

Additionally, if you plan to impute genomes, a reference panel (in VCF format) needs to be specified in the config file as well.

In the file config/config.yaml, you will see blocks corresponding to each of the steps that could be run. There, you can specify whether the step should be run, include additional arguments, and specify the number of threads and amount of memory to be used.

Here is an example for the subsampling step, which allows you to run mapache only on a few reads (number: 1000) of your data (to test whether the pipeline will succeed with your current configuration). Note that the additional parameter -s1 (random seed for the tool seqtk) was passed using the keyword params.

## subsampling (optional)
subsampling:
  run: False
  mem: 2
  params: '-s1'
  number: 1000

For some tools, the parameters have been set to be optimized for ancient human DNA. As a general rule, we try to add as few extra parameters as possible, and you can find them all (and change them as needed) on the config file. For example, to use the default parameters for a particular tool, the params has to be empty (params: '').

The following paragraphs list and describe all available parameters in mapache. We hope that the config file is self-explanatory, but in this page we extend the documentation for each of the sections in the configuration in case of doubts.

Samples

Sample file

Specifies the location of the sample file. The description on how to write a sample file is found on section 2 of this wiki (Sample file):

sample_file: config/samples.tsv                  ## path to the sample file
delim: "\s+"                                     ## deliminator used to read the sample 
                                                 ## file with pandas read_csv()

Samples may also be directly defined in the config file. Instead of passing a file name to the parameter sample_file a nested dictionary may be passed, e.g.:

sample_file: 
  ind1:                                          ## SM
    a_L2:                                        ## LB
      a_L2_R1_001:                               ## ID
        Data: reads/a_L2_R1_001.fastq.gz         ## Data and path to the file
      a_L2_R1_002:  
        Data: reads/a_L2_R1_002.fastq.gz
    b_L2:
      b_L2_R1_001:  
        Data: reads/b_L2_R1_001.fastq.gz
      b_L2_R1_002:  
        Data: reads/b_L2_R1_002.fastq.gz

External samples

Analyses on the final bam file can also be run on bam files generated by other means (e.g., downloaded from public repositories). These bam files can be defined either in a file analog to the sample file, e.g.:

External sample file ('config/external_samples.tsv')

SM           Bam                   Genome
external1    external/file1.bam    GRCh37
external2    external/file2.bam    GRCh37

The column Genome allows assigning the bam file to the correct reference genome. The external sample file has to be specified in the config file:

external_sample: config/external_samples.tsv

External samples may also be directly defined in the config file using a nested dictionary, e.g.:

external_sample: config/samples_stats.tsv
  GRCh37:                                        ## Genome
    external1: external/file1.bam                ## SM and path to the file
    external2: external/file2.bam

Output folder

The name of the output folder may be defined as follows:

result_dir: 'results'                            ## output folder name

Reference genome

Reference genome

Input FASTQ files may be mapped to one or multiple reference genomes.

The minimal specification of the human reference genome is for the GRCh37:

genome: 
  GRCh37: path_to_reference/GRCh37.fasta

and for the human reference genome GRCh38:

genome: 
  GRCh38:  path_to_reference/GRCh38.fasta

If you plan to map to multiple reference genomes (.e.g, organism1, organism2, organism3), you can do so by listing them all under the keyword genome:

genome: 
  ORG1: path_to_reference/organism1.fasta
  ORG2: path_to_reference/organism2.fasta
  ORG3: path_to_reference/organism3.fasta

Note that the names for each reference genome (in the example above, ORG1, ORG2 and ORG3) have to be unique. Your final output files will be named with the format {sample_name}.{genome_name}.bam

Indexing

Reference genome(s) and bam files have to be indexed for some tools. The indexing is done automatically if needed (for fasta and bam files). If indexes of the reference gnome (fasta) are already available beforehand, mapache may include them as inputs if they are located in the same directory as their matching reference genome (e.g., foo.fasta) and are named with the following formats:

## reference genome name specified in config file:
foo.fasta

## bwa index file names:
foo.fasta.sa
foo.fasta.amb
foo.fasta.ann
foo.fasta.bwt
foo.fasta.pac

## bowtie2 index file names:
foo.fasta.1.bt2
foo.fasta.2.bt2
foo.fasta.3.bt2
foo.fasta.4.bt2
foo.fasta.rev.1.bt2
foo.fasta.rev.2.bt2

## samtools faidx index file name:
foo.fasta.fai

## picard index file name:
foo.dict

Indexing is specified with the following parameters:

indexing:
  bwa_params: ''                                 ## parameters for 'bwa index'
  bowtie2_params: ''                             ## parameters for 'bowtie2-build'
  samtools_params: ''                            ## parameters for 'samtools faidx'
  mem: 16                                        ## memory allocation for queuing system in GB
  threads: 1                                     ## number of CPUs requested for each indexing job
  time: 2                                        ## runtime allocation for queuing system in hours

Mapping workflow

Subsampling (optional)

If you are not yet familiar with mapache, we recommend you to try and subsample a few reads of your data to see mapache's behavior and get to know its output files quickly.

This step is run with the tool seqtk.

seqtk sample ${params} ${input} ${number}`

Subsampling is specified with the following parameters:

subsampling:
  run: False                                     ## should the fastq files be randomly subsampled?
  number: 1000                                   ## number/ratio of reads to keep:
                                                 ##   <1: ratio of reads
                                                 ##   >=1: absolute number of reads 
  params: '-s1'                                  ## parameters for seqtk sample (if paired reads are 
                                                 ##   subsampled, the seed has to be fixed (e.g., -s1)

Cleaning (optional)

Ancient DNA is usually fragmented, with fragment lengths usually shorter than the number of sequenced cycles, resulting in reads including adapter sequences. Removing these remnants of adapters is crucial.

Mapache uses AdapterRemoval2 (default) or fastp to remove adapters, combine paired-reads, and to perform cleaning of the reads in general. Mapache uses by default AdapterRemoval2 with settings optimized for ancient DNA, namely:

  • remove default adapters
  • trim ambiguous bases (N) at 5'/3' termini (--trimns)
  • trim consecutive stretches of low-quality (<=2) bases at the termini (--trimqualities)
  • discard reads shorter than 30bp following trimming (--minlength 30)

Paired-end reads may be collapsed with AdapterRemoval2 using the arguments --collapse, --collapse-deterministic or --collapse-conservatively in the parameter cleaning:params_adapterremoval. Paired-end reads may also be merged with fastp using the arguments -m (has to stand alone) or --merge in the parameter cleaning:params_fastp. Mapache automatically detects these collapsing arguments and will retain just the collapsed pairs (usually the majority of the dataset) and will map them to the reference genome; that is, singletons and trashed reads that did not pass the aforementioned filters will not be mapped to the reference genomes. The parameter cleaning:collapse_opt allows to adjust what reads should be retained for mapping.

If you collapse paired-end reads, please have a look at DeDup to remove duplicates (see Removing duplicates)

AdapterRemoval is run for single-end reads as follows

AdapterRemoval ${params} --file1 ${input} --basename ${output%%.fastq.gz} --gzip --output1 ${output}

Fastp is run for single-end reads as follows

fastp --in1 ${input} --out1 ${output.R} --json ${output.json} --html ${output.html} ${params}

The FASTQ cleaning is specified with the following parameters:

cleaning: 
  run: 'adapterremoval'                              ## 'adapterremoval', 'fastp', or False
  params_adapterremoval: '--minlength 30 --trimns --trimqualities'  ## parameters for AdapterRemoval
  params_fastp: ''                                   ## parameters for fastp
  collapse_opt: 'only_collapse'                      ## what to retain when collapsing pairs:
                                                     ##   for AdapterRemoval:
                                                     ##     'only_collapse': collapsed
                                                     ##     'collapse_trunc': collapsed + collapsed_truncated 
                                                     ##     'all': collapsed + collapsed_truncated + R1 + R2 + singleton.truncated
                                                     ##   for fastp:
                                                     ##     'only_collapse': merged
                                                     ##     'all': merged + R1 + R2
  threads: 4                                         ## number of threads
  mem: 4                                             ## memory allocation for queuing system in GB
  time: 2                                            ## runtime allocation for queuing system in hours

Mapping

Mapping to a reference genome is a mandatory step. The default mapping is optimized for ancient DNA using bwa_aln with the argument -l 1024 to disable seeding (Schubert et al., 2012). However, other mappers are available. We show below the commands that are used to map single-end (or collapsed) and paired-end reads with the available mapping softwares (BWA and bowtie2).

  • bwa_aln (default) using the command
    • for single-end reads:
bwa aln ${params} ${ref} -f ${sai} ${fastq}
bwa samse ${bwa_samse_params} -r \"@RG\\tID:${ID}\\tLB:${LB}\\tSM:${SM}\\tPL:${PL}\" ${ref} ${sai} ${fastq}
    • for paired-end reads:
bwa aln ${params} ${ref} -f ${sai1} ${fastq1}
bwa aln ${params} ${ref} -f ${sai2} ${fastq2}
bwa sampe ${bwa_sampe_params} -r \"@RG\\tID:${ID}\\tLB:${LB}\\tSM:${SM}\\tPL:${PL}\" ${ref} ${sai1} ${sai2} ${fastq1} ${fastq2}
bwa mem ${bwa_mem_params} -R \"@RG\\tID:${ID}\\tLB:${LB}\\tSM:${SM}\\tPL:${PL}\" ${ref} ${fastq(s)}

The values in ID, LB, and SM are automatically extracted from the corresponding columns in the sample file. The PL is specified by the parameter pl. This is useful, for example, if in the future you need to split your BAM files by library type with samtools.

  • bowtie2 using the command
    • for single-end reads:
bowtie2 ${bowtie2_params} -x ${ref} -U ${fastq}
    • for paired-end reads:
bowtie2 ${bowtie2_params} -x ${ref} -1 ${fastq1} -2 ${fastq2}

Mapping is specified with the following parameters:

mapping:
  mapper: bwa_aln                                ## mapper (available: bwa_aln, bwa_mem, bowtie2)
  bwa_aln_params: '-l 1024'                      ## parameter for 'bwa aln'
  bwa_samse_params: '-n 3'                       ## parameter for 'bwa samse' (single-end libraries)
  bwa_sampe_params: '-n 3'                       ## parameter for 'bwa sampe' (paired-end libraries)
  bwa_mem_params: ''                             ## parameter for 'bwa mem'
  bowtie2_params: ''                             ## parameter for 'bowtie2'
  pl: 'ILLUMINA'                                 ## the platform to specify in the RG
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 30                                       ## runtime allocation for queuing system in hour

Sorting

Mapped BAM files are always sorted using samtools with the command

samtools sort --threads ${threads} ${input} > ${output}

Sorting is specified with the following parameters:

sorting:
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 10                                       ## runtime allocation for queuing system in hours

Filtering (optional)

It is common practice to keep only high-quality mapped reads in a final BAM file. By default, unmapped and low-quality reads are saved in a separate BAM file (save_low_qual: True) in case you want to use them for other type of analyses, such as investigating the metagenomic composition of the sample.

In brief, filtering is done with samtools using the command

## if filtering:mapq is specified:
samtools view -b -F 4 -q ${mapq} -U ${out_low_qual} ${input} > ${out_mapped}

## if filtering:params is specified:
samtools view -b ${params} -U ${out_low_qual} ${input} > ${out_mapped}

Just as mentioned above, this command will save reads with a mapping quality equal or higher than mapq (indicated in the sample file) to {out_mapped}, and use samtools to store (-U ${out_low_qual}) unmapped reads (-F 4) and reads with lower mapping qualities (-q ${mapq}) to a separate BAM file.

Filtering is specified with the following parameters:

filtering:
  run: True                                      ## perform filtering or not (if `False`, the final bam file 
                                                 ##   contains all alignments, including not mapping reads)
  mapq: 30                                       ## filtering threshold (parameter -q in samtools)
  params: ''                                     ## any parameters to samtools (only 'params' OR 'mapq' may be specified)
  save_low_qual: True                            ## if filtering, shall the not mapped reads and low quality  
                                                 ##   alignments be saved in a separate bam file? 
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Merging 'fastq' to library

Merging of all bam files of a library (same name in LB and SM).

Merging is done using samtools with the command

samtools merge -f ${output} ${inputs}

Merging is specified with the following parameters (merging libraries to sample is specified using the same parameters (see below)):

merging:
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Removing duplicates (optional)

A characteristic of ancient DNA is its scarcity. Consequently the chance to resequence the same molecule is quite large leading to unwanted duplicated sequences. This module allows detecting and removing such duplicated sequences. By default MarkDuplicates from Picard is used with the command

picard MarkDuplicates --INPUT ${input} --OUTPUT ${out_bam} --METRICS_FILE ${out_stats} \
            ${params} --ASSUME_SORT_ORDER coordinate --VALIDATION_STRINGENCY LENIENT

Alternatively, DeDup may be used which was especially developed for collapsed paired-end reads in ancient DNA using the command

dedup -i ${input} ${params}  -o $(dirname $bam)

Removing duplicates is specified with the following parameters:

remove_duplicates:
  run: markduplicates                            ## should duplicates be inferred/removed?
                                                 ##   (markduplicates, dedup, False)
  params_markduplicates: '--REMOVE_DUPLICATES true'   ## parameters for markDuplicates
  params_dedup: '-m'                             ## parameters for dedup ('-m/--merged' is 
                                                 ##   automatically removed for single-end reads)
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Rescaling damage (optional)

Reads originated from ancient samples present a high frequency of apparent C to T and G to A substitutions at the read termini, with specific patterns that vary depending on the protocol used to build the libraries.

An option to deal with these deamination patterns is to reduce the base qualities of the likely affected bases. You can do so by rescaling the qualities with mapDamage2, although this option is not run by default in mapache.

mapDamage -i ${input} -r ${ref} -d $(dirname ${output})_results ${params} --rescale \
      --merge-reference-sequences --rescale-out ${output};

Rescaling damage is specified with the following parameters:

damage_rescale:
  run: False                                     ## should the bam file be rescaled?
  params: ''                                     ## parameters for mapDamage
  threads: 1                                     ## number of threads (only single threaded)
  mem: 8                                         ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Trimming alignments (optional)

Alignments may be trimmed at the reads end in order for example to reduce the effect of the damage. Trimming is done with BamUtil with the command

bam trimBam ${input} ${output} ${params}

Following parameters are available:

bamutil:
  run: False
  mem: 4                                         ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Merging libraries to sample

Merging of libraries of the same sample (same name in SM).

Merging is done using samtools with the command

samtools merge -f ${output} ${inputs}

Merging is specified with the following parameters (merging 'fastq' to library is specified using the same parameters (see above)):

merging:
  threads: 4                                     ## number of threads
  mem: 16                                        ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Realigning indels (optional)

Taking the information across alignments allows correcting for mis-alignments (indels) at the edges of the reads. Realigning indels is done using IndelRealigner from GATK v3.8 using the command

GenomeAnalysisTK -T RealignerTargetCreator -I ${input} -R ${ref} -o ${intervals} ;
GenomeAnalysisTK -T IndelRealigner -I ${input} -R ${ref} -targetIntervals ${intervals} -o ${output}

Realigning indels is specified with the following parameters:

realign:
  run: True                                      ## should GATK IndelRealigner be run
  threads: 1                                     ## number of threads (only single threaded)
  mem: 8                                         ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Recomputing md flags (optional)

This step allows to recompute the md flag of each alignment to avoid downstream analyses problems. Incorrect md flags may occur by modifications to the bam file in previous steps (such as realignment around indels) without adjusting the md flag and may lead to issues when using this BAM file with certain programs. The md flag is recomputed with samtools using the command

samtools calmd ${input} ${ref} | samtools view -bS - > ${output}

Recomputing md flag is specified with the following parameters:

run_compute_md:
  run: True                                      ## should samtools calmd be run [True]
  threads: 4                                     ## number of threads (only single threaded)
  mem: 8                                         ## memory allocation for queuing system in GB
  time: 2                                        ## runtime allocation for queuing system in hours

Statistics

Mapping statstics

Mapache computes several summary statistics during the mapping process. The following tables in csv format are generated:

results/04_stats/03_summary/                     ## path to the files
├── FASTQ_stats.csv                              ## statistics for each fastq file and gneome
├── LB_stats.csv                                 ## statistics for each library and gneome
├── SM_stats.csv                                 ## statistics for each sample and gneome

In addition to the csv tables, some statistics at the sample level are plotted in svg format.

results/04_stats/04_plots/                       ## path to the files
├── 1_nb_reads.svg                               ## number of raw reads
├── 2_mapped.svg                                 ## number of mapped reads (unique and duplicated)
├── 3_endogenous.svg                             ## endogenous content (unique reads)
├── 4_duplication.svg                            ## % of duplicated reads
├── 5_AvgReadDepth.svg                           ## average read depth (unique reads)
├── 6_Sex.svg                                    ## genomic sex assignment

The frollowing parameters allow fine-tuning the plots. If you are mapping a large number of samples or to a large number of genomes, the default plotting settings might need some adjustments in order to improve the readability of the plots.

stats:
  plots:
    x_axis: auto                                 ## "auto" (default), "sample" or "genome"
                                                 ## Which value should be included as the variable 
                                                 ## on the x-axis? E.g.,
                                                 ## sample vs DoC (and bars colored by sample name)
                                                 ## genome vs DoC (and bars colored by genome name)
                                                 ## We recommend ("auto"):
                                                 ##    "sample" if n_samples > n_genomes 
                                                 ##    "genome" if n_genomes > n_samples
    n_col: 1                                     ## number of panels per row if multiple samples 
                                                 ##    and genomes are used
    width: 18                                    ## width of the plot
    height: 12                                   ## height of th plot
    color: "#f2ad78"                             ## color of the bars
    show_numbers: 15                             ## show the numbers in the plot if the number of items
                                                 ##    (samples/genomes) is below the given threshold
    sex_ribbons: 'c("XX"="#7e3075", "XY"="#003e83")'
                                                 ## ribbon coloring for plot 6_Sex.svg. Expected is a 
                                                 ## vector of 2 colors (R format). Names must much
                                                 ## 'sex_inference:threshold' names.

Reports

Mapache can produce several reports.

  • qualimap (per sample and genome): Qualimap reports statistics on single bam file. In mapache qualimap is run on all final bam files.
  • multiqc (per genome): MultiQC combines log files of several tools, such as fastqc, samtools stats, AdapterRemoval, qualimap. It is a good resource to get detailed information of indvidual steps if something went wrong. It also includes the qualimap reports.
  • snakemake (per run): This is the most complete report, including the the mutliqc report, the mapping statistic tables and plots (see Mapping statistcs).) and also runtime information. It has to be generated manually after a successful run with the command snakemake --report report.html.

The reports may be enabled as follows. Depending on the data valume the memory and/or runtime have to be adjusted when run on a cluster.

stats:
  fastqc_time: 4                                 ## runtime specification for fastqc
    
  qualimap: False                                ## should the qualimap report be generated?
  qualimap_mem: 8                                ## memory allocation for qualimap
    
  multiqc: False                                 ## should the multiqc report be generated?

Analyses on final bam file

Damage (optional)

The damage (fragment length and deamination pattern of the mapped reads) may be inferred for each library using mapDamage or an extended version of bamdamage. To re-scale the base qualities based on the deamination pattern please see Rescaling damage).

The damage inference is parametrized with the following parameters:

damage:
    ## damage statistics (read length, deamination rates)
  run: False                                     ## "False", "bamdamage", "mapDamage"
  bamdamage_fraction: 0                          ## fraction/number of alignments to consider
                                                 ##     0: use all alignments (default)
                                                 ##    <1: fraction of alignments to consider 
                                                 ##   >=1: absolute number of alignments to consider
  bamdamage_params: ''                           ## parameters to pass to bamdamage (see below)

Bamdamage was extended to be able to subsample the bam file and to define the plot area for the damage plot. Using all alignments is time consuming and often already a small fraction of the alignments is sufficient to get representative plots. The second modification allows narrowing down the number of bases from each end of the fragment to visualize in the damage plot. The parameters for the extended bamdamage are

  --mapquality|-m=i                        skip reads with mapping quality smaller than qual (30)
  --basequality|-b=i                       skip bases with base quality smaller than qual (20)
  --rlength|-r=i                           assumes reads are at most length (100)
  --sample|-s=s                            use name for the legend in plots (default is the file name)
  --outputfile|--output|-o=s               filename of damage plot (foo.dam.pdf)
  --outputfile_length|--output_length|-O=s filename of length plot (foo.length.pdf)
  --nth_read=i                             subsample: consider only every nth alignment (1; all)
  --plot_length=i                          damage plot: number of bases to plot from each end (25)
  --help|-h                                show this help
  --debug|-D                               write debug information
  --verbose|-v                             verbose output
  --version|-V                             show version

Please note, that in mapache the specification of the subsampling (see above) is different to the one using bamdamagedirectly. By default bamdamgeis returning the length distribution for reads up to 100bp (parameter --rlength) and longer reads are cropped. The longer the specified max read length is, the longer the execution of bamdamage takes. Thus, it is wise to set it as small as possible (e.g., initial read length of Illumina sequences), but also as long as needed (e.g., collapsed reads).

Depth of coverage for specific chromosomes (optional)

By default, the LB_stats.csv and SM_stats.csv files report the average genome depth of coverage (DoC). this section allows reporting the depth of coverage on a specific chromosome(s).

The csv files will then have extra columns with the name of the chromosomes selected, prefixed with "depth_".

depth:
  hg19:                                          ## genome name (has to defined under 'genome')                         
    run: False                                   ## compute DoC of specified chromosomes
    chromosomes: ["X", "Y", "MT"]                ## list of chromosome names

Please make sure to pass the correct names of the chromosomes as specified in the reference FASTA file, otherwise mapache will not run until you fix this error.

Sex inference (optional)

Any human reference genome, but also any other reference genome in FASTA format may be specified with or without sex-specific chromosomes. To account for this flexibility several parameters need to be specified in order to compute underlying summary statistics (e.g. determine the genomic sex). Mapache recognizes automatically the two human reference genomes GRCh37and GRCh38. For all other reference genomes, the assignments to autosome, sex and mt chromosomes have to be made manually if the inference of the genomic sex is desired.

To infer sex, we follow the method presented by Mittnik et al. (2016) for humans. This method computes the average ratio of the normalised depth of the X chromosome to the normalised depth on the autosomes. Thus, for females, a ratio of 1 is expected, and for males a ratio of 0.5 is expected. Assuming this ratio can be approximated by a normal distribution, a 95% confidence interval is generated by taking ~1.96 standard errors of the mean. The sex is then assigned depending on the values spanned by this confidence interval (parameter thresholds).

We reimplemented the code to adapt it to organisms with a different chromosomal sex-determination systems, including XX/XY, ZW/ZZ, XX/XO and ZZ/ZO.

As the method relies on the number of reads mapped to the homogametic sex chromosome and to the autosomes, and the naming of chromosomes varies across reference genomes (i.e., 1,2,..,22,X,Y vs chr1,chr2,...,chr22,chrX,chrY), the parameters specific to sex inference have to be declared.

Please make sure to pass the correct names of the autosomes and the sex chromosome as specified in the reference FASTA file, otherwise mapache will not run until you fix this error.

Thus, assuming that we are mapping to two different reference genome, the parameters should be specified as follows.

sex_inference: 
  GRCh37:                                        ## genome name (has to defined under 'genome')
    run: True                                    ## should the sex_inference be run for this genome?
    autosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
                                                 ## a list of autosome chromosome names 
    sex_chr: X                                   ## the homogametic sex
    system: XY                                   ## either of XY, ZW, XO or ZO
    signif: 0.95                                 ## confidence interval wide
    thresholds: 'list("XX" = c(0.8, 1), 
                      "XY" = c(0, 0.6), 
                      "consistent with XX but not XY" = c(0.6, 1), 
                      "consistent with XY but not XX" = c(0, 0.8))'    
                          ## These are the thresholds used to assign molecular sex, in R syntax.
                          ## For humans, we expect 
                          ##   females: DoC(X)/DoC(genome) = 1
                          ##   males:   DoC(X)/DoC(genome) = 0.5
                          ## Note that sex will be assigned by considering the confidence intervals
                          ## around the estimate for such ratio, which can be a bit lower or 
                          ## higher than 1 or 0.5.
                          ## This option can be commented as the sex script comes with default 
                          ## values for different sex systems.  

Autodetection of GRCh37 and GRCh38

Mapache recognizes the two human reference genomes GRCh37 and GRCh38 automatically based on the chromosome names, and assigns automatically the chromosomes to autosome, sex and mt chromosomes. It assumes the folling names:

  • GRCh37: 1, 2, ..., 22 ,X ,Y ,MT
  • GRCh38: chr1, chr2, ..., chr22, chrX, chrY, chrM

Therefore, for these two genomes the parameters autosomes and sex_chr most not be set manually. For other reference genomes, these parameters have to be set manually. If the specified chromosome names are not present in the reference genome, no sex inference is possilbe and an error is drawn.

Autosomes

The autosomes should be listed as a list, e.g.

  • autosomes: '["chr1", "chr2", "chr3"]'.

This can be tedious if you have many more chromosomes. We recommend using a python terminal to obtain the correct lit of chromosome names, by executing for example the code below:

  • list(range(1,23)) or
  • [f"chr{x}" for x in range(1,23)] if you need the chromosomes with the chr prefix

Sex thresholds

As mentioned at the beginning, we can use this method to assign sex to other sex systems. We have pre-defined thresholds for each systems, but you might change these thresholds according to your needs. Mapache uses the default thresholds proposed by Mittnik et al. (2016) for humans. You can edit them by passing a list with R syntax to the parameter thresholds. Note that each name contains the "sex" that will be reported. For example, if you wish to report Female instead of XX).

In this specific example (default values), if the 95% confidence interval falls within

  • 0.8 and 1, it returns XX
  • 0 and 0.6, it returns XY
  • 0.6 and 1, it returns "consistent with XX but not XY"
  • 0 and 0.8, it returns "consistent with XY but not XX"

Imputation (optional)

Low-coverage imputation has as input genotype likelihoods for variable sites in a reference panel.

In the first step, we generate genotype likelihoods at the sites of the reference panel. Then, we proceed to impute. Besides the input data, for this step we need:

  1. path_panel: the reference panel (vcf file; e.g., 1000 Genomes)
  2. path_map: the genetic map that contains the recombination rates across the human genome in the coordinates of the genome build used to map the data

We can assess how confidently each site was imputed using the genotype probabilities (GP), that are contained in the output vcf/bcf files (FORMAT/GP). For any biallelic site, a GP value for each three possible genotypes (0, 1 and 2) is reported with values ranging between 0 and 1. They sum to 1 across the three genotypes. Mapache outputs a histogram with the resulting maximum GP for all imputed sites as well as the average maximum GP. Mapache allows filtering on the GP value (parameter ‘GP_filter’), which is a trade-off between imputation accuracy and loss of sites. Stricter thresholds for this filtering result in a greater loss of sites containing the alternative allele, which is particularly noticeable at lower depths of coverage (e.g., 0.1x). As default, mapache applies a GP filter of 0.8. In any case, we recommend removing rare variants, at least with minor allele frequency (MAF) below 1%, as these are less accurately imputed.

imputation:
    hg19:                                        ## genome name (has to defined under 'genome')
        run: False                               ## should imputation be run
        gp_filter: [0.8]                         ## gp filter

        ## path_map and path_panel may specify a single file OR may contain the variable '{chr}' 
        ## to specify it per chromosome 
        path_map: "path_to_map/chr{chr}.b37.gmap.gz"
        path_panel: "path_to_panel/ALL.chr{chr}.genotypes.vcf.gz"

        chromosomes: [20,21]                     ## list of chromosomes to impute

        glimse_chunk_params: ""                  ## parameters for GLIMPSE chunk
        glimse_phase_params: ""                  ## parameters for GLIMPSE phase

Variable parameter

Mapache supports library specific settings. In general, defined settings are common across all fastq files and libraries, e.g.:

cleaning:
   run: 'adapterremoval'
   params_adapterremoval: '--minlength 30 --trimns --trimqualities'

To use fastq/library specific settings, you may allocate each fastq file to a key in the sample file (using a new column named Group) and refer to this key in the config file (using a nested dictionary). A key may be any text. For example, let's assume that library a_L2 is UDG treated and libraray b_L2 is not. To use UDG specific settings, we first have to define the key in the sample file (column Group; the keys 'UDG' and 'nonUDG' are freely chosen):

SM          LB        ID           Data                         Group
ind1        a_L2      a_L2_R1_001  reads/a_L2_R1_001.fastq.gz   UDG
ind1        a_L2      a_L2_R1_002  reads/a_L2_R1_002.fastq.gz   UDG
ind1        b_L2      b_L2_R1_001  reads/b_L2_R1_001.fastq.gz   nonUDG
ind1        b_L2      b_L2_R1_002  reads/b_L2_R1_002.fastq.gz   nonUDG

Now we can specfiy UDG specific settings in the config file using a nested dictionary, for example by using different trimming lengths at the 5'/3' termini (2 bases for 'UDG' and 6 bases for 'nonUDG' libraries):

cleaning:
   run: 'adapterremoval'
   params_adapterremoval: 
      UDG: '--trim5p 2 --trim3p 2 --minlength 30 --trimns --trimqualities'
      nonUDG: '--trim5p 6 --trim3p 6 --minlength 30 --trimns --trimqualities'

Above is a simple case. You may extend the variable setting as follows:

  • default: If a parameter contains variable setting (a nested dictionary), but the specified key in the sample file is not defined in the config file, then the default parameter setting is used. One may override the default parameter setting, using 'default' as key in the config file. Like this not all fastq files in the sample file have to contain a key (key 'strange' in the example below).

  • reuse of key: The same key may be used for different parameters.

  • multiple keys: The key allows grouping fastq files. It is possible to use multiple keys to define different groups. Multiple keys are seperated by a comma.

Below is a more complex scenario:

Sample file:

SM          LB        ID           Data                         Group
ind1        a_L2      a_L2_R1_001  reads/a_L2_R1_001.fastq.gz   index8,UDG
ind1        a_L2      a_L2_R1_002  reads/a_L2_R1_002.fastq.gz   index8,UDG
ind1        b_L2      b_L2_R1_001  reads/b_L2_R1_001.fastq.gz   index7,nonUDG,strange
ind1        b_L2      b_L2_R1_002  reads/b_L2_R1_002.fastq.gz   index8,nonUDG

Config file:

cleaning:
    run: ‘adapterremoval’
    params_adapterremoval: 
      index8: '–adapter1 ${adapter1_8b} –adapter2 ${adapter2_8b}'
      index7: '–adapter1 ${adapter1_7b} –adapter2 ${adapter2_7b}'

filtering:
    run: True
    mapq:  
      strange: '25'
      default: '30'

bamutil:
    run: True
    params:
      UDG: '2'
      noUDG: '6'

Please note, that the listed adapters above (e.g, '${adapter1_8b}') are used as placeholders for an adapter sequence to increase the readability. YAML does not allow such substitutions, the adapter sequences have to be writen. Keys defined for parameter settings of tools used at the library or sample level, have to be identical for each fastq file within a library and sample, respectively (othwise an error is drawn).

In the eaxample above, the fastq files will have the following parameter settings

  • fastq files a_L2_R1_001 and a_L2_R1_002:
    • cleaning: '–adapter1 ${adapter1_8b} –adapter2 ${adapter2_8b}'
    • filtering: '30'
    • bamutil: '2'
  • fastq file b_L2_R1_001:
    • cleaning: '–adapter1 ${adapter1_7b} –adapter2 ${adapter2_7b}'
    • filtering: '25'
    • bamutil: '6'
  • fastq file b_L2_R1_002:
    • cleaning: '–adapter1 ${adapter1_8b} –adapter2 ${adapter2_8b}'
    • filtering: '30'
    • bamutil: '6'

Variable parameters are not avaible for all parameters. The following parameters support variable settings (nested dictionary):

## FASTQ LEVEL
subsampling:
    run: False
    params: '-s1'
    number: 1000
 
cleaning:
    run: ‘adapterremoval’
    params_adapterremoval: '' 
    params_fastp: ''
    collapse_opt: only_collapse
 
mapping:
    bwa_aln_params: '-l 1024'   
    bwa_samse_params: '-n 3'
    bwa_sampe_params: ''
    bwa_mem_params: ''
    bowtie2_params: ''
    platform: 'ILLUMINA'
 
filtering:
    params: ''
    mapq: 30
    save_low_qual: True

## LIBRARY LEVEL
remove_duplicates:
    params_markduplicates: '--REMOVE_DUPLICATES true'
    params_dedup: '-m'
 
damage_rescale:
    run: False
    params: ''
 
bamutil:
    run: False
    params: '3'
 
## SAMPLE LEVEL
realign:
    run: True
 
compute_md:
    run: True

Job resubmission

On an HPC system, it may happen that a job is killed due to memory or runtime limits. Snakemake does not only clean up the space of killed/unfinished jobs, but also allows relaunching them automatically. At such a relaunch one may increase automatically the memory or runtime allocation. This is very handy if the input file sizes are highly variable (wath is the normal in aDNA). For example, if most libraries are of small size and only a few large, it would be a pitty to allocate huge resources to all libraries and then spend most of the time waiting in the queue. It may be more efficient to allocate less resources (runtime and memory) which allows for most of the libraries to map successfully. The larger libraries will then fail due to lack of resources and be relaunched automatically with more resources. The parameters below allow defining the increase of resources at a relaunch.

memory_increment_ratio: 1                        ## memory increment
runtime_increment_ratio: 0                       ## runtime increment

The number specifies the ratio in relation ot the initial allocation which is added at each relaunch:

  • 0: no change at relaunch
  • 1: adding the inital allocation to the previous one at each relaunch

Software

Java tools

Picard and GATK are Java programs. To launch them the .jar file has to be known. If you have used the conda installation, the following settings should be fine.

software:
    picard_jar: 'picard'
    gatk3_jar: 'GenomeAnalysisTK'

Environment modules

The underlying software used by mapache can be provided in different ways. The easiest way is to use global or rule specifc conda environment(s). However, it is also possible to use pre-installed software available via 'module', which is a widly used feature on HPC systems. Using the 'module' feature can be invoced by launching mapache with the command line parameter --use-envmodules. The parameters below allow defining how the different tools are loaded on your computer infrastucture. E.g., if you have to load samtools as module add samtoolsthen you shoud state it in the config file as

envmodules:
    samtools:       "samtools"

The following tools may be defined by enfironment modules:

envmodules:
    samtools:       "gcc samtools/1.12"
    bowtie2:        "gcc bowtie2/2.4.2"
    bwa:            "gcc bwa/0.7.17"
    picard:         "gcc picard/2.24.0"
    gatk3:          "gcc gatk/3.8-1"
    fastqc:         "gcc fastqc/0.11.9"
    r:              "gcc r/4.0.4"
    adapterremoval: "gcc adapterremoval/2.3.2"
    mapdamage:      ""
    bedtools:       "gcc bedtools2/2.29.2"
    seqtk:          ""
    qualimap:       ""
    multiqc:        ""
    glimpse:        ""
    bamutil:        ""