Ancient bacterial genomes reveal a high diversity of Treponema pallidum strains in early modern Europe
Kerttu Majander, Saskia Pfrengle, Judith Neukamm, Arthur Kocher, Louis du Plessis, Marta Pla-Díaz, Natasha Arora, Gülfirde Akgül, Kati Salo, Rachel Schats, Sarah Inskip, Markku Oinonen, Heiki Valk, Martin Malve, Aivar Kriiska, Päivi Onkamo, Fernando González-Candelas, Denise Kühnert, Johannes Krause, Verena J. Schuenemann
This repository contains the data files, configuration files, log files, tree files and scripts necessary to reproduce the molecular clock dating analyses and figures presented in Majander et al. https://doi.org/10.1101/2020.06.09.142547. Some of the scripts may need some adjustment depending on the local setup.
Syphilis is a globally re-emerging disease, which has marked European history with a devastating epidemic at the end of the 15th century. Together with non-venereal treponemal diseases, like bejel and yaws, which are found in subtropical and tropical regions, it currently poses a substantial health threat worldwide. The origins and spread of treponemal diseases remain unresolved, including syphilis’ potential introduction into Europe from the Americas. Here, we present the first genetic data from archaeological human remains reflecting a high diversity of Treponema pallidum in early modern Europe. Our study demonstrates that a variety of strains related to both venereal syphilis and yaws-causing T. pallidum subspecies were already present in Northern Europe in the early modern period. We also discovered a previously unknown T. pallidum lineage recovered as a sister group to yaws and bejel-causing lineages. These findings imply a more complex pattern of geographical distribution and etiology of early treponemal epidemics than previously understood.
Run the RMarkdown notebooks to generate the reports below:
- Temporal signal from root-to-tip regression:
reports/TemporalSignal.Rmd
- BEAST2 Bayesian DRT (Dataset D only):
reports/DateShuffling-D.Rmd
- BEAST2 molecular clock dating (Dataset D only):
reports/DatingAnalysis-D.Rmd
(this report also generates figures of more MCC trees that are not displayed in the report). - Manuscript figures:
reports/Figures-D.Rmd
- Reports 2-4 can be easily modified for other datasets.
- To generate PDF figures simply change the output to "pdf_document" and the device to "pdf".
- Biopython, numpy
- snp-sites
- RAxML
- BEAST v2.6
- ggplot2, ggtree, treeio, ggsci, ggpubr, coda, cowplot, phytools, phangorn, ape
- beastio: Commit #b18caa6
Input data for the molecular clock dating analyses are in data/
.
The workflow and results are only given for the alignment with genomes mapped to the TPA (Nichols) reference sequence, with recombining and hypervariable genes removed, as well as all sites with >25% missing data. To use a different dataset simply change the input dataset in the workflow (and to avoid confusion all output filenames and directories).
Define the following datasets for downstream analyses:
- Dataset A: Full dataset (33 genomes)
- Dataset B: Remove genomes without stable placement for finding recombining loci:
KM14-7
(Kampen) (32 genomes) - Dataset C: Remove genomes problematic for clock-dating due to passaging:
KM14-7,NIC2,Nichols
(30 genomes) - Dataset D: Remove genomes problematic for clock-dating due to unique SNPs:
KM14-7,NIC2,Nichols,94A,94B
(28 genomes) - Dataset E: Remove genomes problematic for clock-dating due to underrepresentation:
KKM14-7,NIC2,Nichols,94A,94B,BosniaA
(27 genomes) - Dataset F: Remove genomes that could also be problematic for clock-dating due to passaging:
KKM14-7,NIC2,Nichols,94A,94B,BosniaA,Fribourg
(26 genomes)
Only results for dataset D are provided in the manuscript.
- Add metadata to alignment files and subsample to create datasets A-F.
- Use
snp-sites
to get the constant site breakdown for each dataset.
mkdir results/alignments/
# Dataset A
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/A/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25
# Dataset B
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/B/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7
# Dataset C
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/C/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols
# Dataset D
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/D/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols,94A,94B
# Dataset E
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/E/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols,94A,94B,BosniaA
# Dataset F
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/F/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols,94A,94B,BosniaA,Fribourg
# Extract SNP alignments and constant sites
for DATASET in A B C D E F
do
snp-sites -o results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas results/alignments/${DATASET}/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas
snp-sites -C -o results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.csv results/alignments/${DATASET}/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas
tr ',' ' ' < results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.csv > results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.txt
done
- Only use trimmed SNP alignments for subsequent analyses.
- Create partition and constant sites files by hand and save in the directory where RAxML will be run, e.g.
results/RAxML/D/
for dataset D.
Partition file contents for Dataset D:
[asc~../../alignments/D/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.txt],ASC_DNA, p1=1-1500
Partition files for other datasets only differ in the path to the filename and the number of sites.
- Use RAxML to build genetic distance trees (run from the directory where RAxML output should be stored, e.g.
results/RAxML/D/
). - After RAxML finishes midpoint root trees in FigTree and export as displayed with NEXUS block.
# Dataset A
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/A/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree
# Dataset B
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/B/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree
# Dataset C
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/C/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree
# Dataset D
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/D/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree
# Dataset E
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/E/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree
# Dataset F
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/F/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree
- Create NEXUS files with SNP sequences, sampling dates and tip date priors.
for DATASET in B C D E F
do
# Narrow uniform priors
python scripts/fasta2nexus.py -i results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -o results/alignments/${DATASET}/narrow -d 5 -l 3 -u 4
# Normally distributed priors
python scripts/fasta2nexus.py -i results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -o results/alignments/${DATASET}/normal -d 5 -l 3 -u 4 -n
# Wide uniform priors
python scripts/fasta2nexus.py -i results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -o results/alignments/${DATASET}/wide -d 5 -l 3 -u 4 -L 1000 -U 2016 -w
done
- The NEXUS files above can be dragged into BEAUti v2.6 to immediately load the alignment and set tip date priors.
- Create the XML files in BEAUti v2.6 and adjust by hand (only the XML files for dataset D are provided on the repository).
- Run on remote server using commands below (from the directory where XML files are stored, e.g.
results/beast/D/
). - Build MCC trees.
# On remote server (from directory with XML files)
mkdir output
ls *.xml | parallel --delay 1 --jobs 30% --results outdir -I% --max-args 1 ~/BEASTv2.6.0/bin/beast -overwrite -seed 127 % &
cd output
mkdir mcctrees
for i in `ls *.trees | cut -f 1,2,3,4 -d "."`; do ~/BEASTv2.6.0/bin/treeannotator -burnin 30 ${i}.trees mcctrees/${i}.MCC.tree; done
- Create config files in
results/beast/shuffleddates/config/
by hand. - Run Python scripts to produce XML files (only the XML files for dataset D are provided on the repository).
- Run on remote server using commands below (from the directory where XML files are stored, e.g.
results/beast/shuffleddates/D/
).
# On local (from project root)
python scripts/MakeBEASTXML.py -i results/beast/shuffleddates/config/D/
# On remote server (from directory with XML files)
mkdir output
ls *R{0..50}.xml | parallel --delay 1 --jobs 75% --results outdir -I% --max-args 1 ~/BEASTv2.6.0/bin/beast -overwrite -seed 127 % &