Ancient bacterial genomes reveal a high diversity of Treponema pallidum strains in early modern Europe

Kerttu Majander, Saskia Pfrengle, Judith Neukamm, Arthur Kocher, Louis du Plessis, Marta Pla-Díaz, Natasha Arora, Gülfirde Akgül, Kati Salo, Rachel Schats, Sarah Inskip, Markku Oinonen, Heiki Valk, Martin Malve, Aivar Kriiska, Päivi Onkamo, Fernando González-Candelas, Denise Kühnert, Johannes Krause, Verena J. Schuenemann

This repository contains the data files, configuration files, log files, tree files and scripts necessary to reproduce the molecular clock dating analyses and figures presented in Majander et al. https://doi.org/10.1101/2020.06.09.142547. Some of the scripts may need some adjustment depending on the local setup.

Abstract

Syphilis is a globally re-emerging disease, which has marked European history with a devastating epidemic at the end of the 15th century. Together with non-venereal treponemal diseases, like bejel and yaws, which are found in subtropical and tropical regions, it currently poses a substantial health threat worldwide. The origins and spread of treponemal diseases remain unresolved, including syphilis’ potential introduction into Europe from the Americas. Here, we present the first genetic data from archaeological human remains reflecting a high diversity of Treponema pallidum in early modern Europe. Our study demonstrates that a variety of strains related to both venereal syphilis and yaws-causing T. pallidum subspecies were already present in Northern Europe in the early modern period. We also discovered a previously unknown T. pallidum lineage recovered as a sister group to yaws and bejel-causing lineages. These findings imply a more complex pattern of geographical distribution and etiology of early treponemal epidemics than previously understood.

Reports

Run the RMarkdown notebooks to generate the reports below:

Temporal signal from root-to-tip regression: reports/TemporalSignal.Rmd
BEAST2 Bayesian DRT (Dataset D only): reports/DateShuffling-D.Rmd
BEAST2 molecular clock dating (Dataset D only): reports/DatingAnalysis-D.Rmd (this report also generates figures of more MCC trees that are not displayed in the report).
Manuscript figures: reports/Figures-D.Rmd

Reports 2-4 can be easily modified for other datasets.
To generate PDF figures simply change the output to "pdf_document" and the device to "pdf".

Dependencies

Biopython, numpy
snp-sites
RAxML
BEAST v2.6
ggplot2, ggtree, treeio, ggsci, ggpubr, coda, cowplot, phytools, phangorn, ape
beastio: Commit #b18caa6

Data

Input data for the molecular clock dating analyses are in data/.

Workflow

The workflow and results are only given for the alignment with genomes mapped to the TPA (Nichols) reference sequence, with recombining and hypervariable genes removed, as well as all sites with >25% missing data. To use a different dataset simply change the input dataset in the workflow (and to avoid confusion all output filenames and directories).

Datasets

Define the following datasets for downstream analyses:

Dataset A: Full dataset (33 genomes)
Dataset B: Remove genomes without stable placement for finding recombining loci: KM14-7 (Kampen) (32 genomes)
Dataset C: Remove genomes problematic for clock-dating due to passaging: KM14-7,NIC2,Nichols (30 genomes)
Dataset D: Remove genomes problematic for clock-dating due to unique SNPs: KM14-7,NIC2,Nichols,94A,94B (28 genomes)
Dataset E: Remove genomes problematic for clock-dating due to underrepresentation: KKM14-7,NIC2,Nichols,94A,94B,BosniaA (27 genomes)
Dataset F: Remove genomes that could also be problematic for clock-dating due to passaging: KKM14-7,NIC2,Nichols,94A,94B,BosniaA,Fribourg (26 genomes)

Only results for dataset D are provided in the manuscript.

Create alignments

Add metadata to alignment files and subsample to create datasets A-F.
Use snp-sites to get the constant site breakdown for each dataset.

mkdir results/alignments/

# Dataset A
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/A/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25

# Dataset B
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/B/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7

# Dataset C
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/C/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols

# Dataset D
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/D/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols,94A,94B

# Dataset E
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/E/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols,94A,94B,BosniaA

# Dataset F
python scripts/processdata.py -i data/treponema_metadata.csv -a data/Nichols_reference/norecomb_nohypervariable/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -s "name,accession,clade,date_lower,date_upper,date" -o results/alignments/F/ -p fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25 -e KM14-7,NIC2,Nichols,94A,94B,BosniaA,Fribourg

# Extract SNP alignments and constant sites
for DATASET in A B C D E F
do 
	snp-sites -o results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas results/alignments/${DATASET}/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas
	snp-sites -C -o results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.csv results/alignments/${DATASET}/fullAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas
	tr ',' ' ' < results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.csv > results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.txt
done

Only use trimmed SNP alignments for subsequent analyses.

Partition files for RAxML

Create partition and constant sites files by hand and save in the directory where RAxML will be run, e.g. results/RAxML/D/ for dataset D.

Partition file contents for Dataset D:

[asc~../../alignments/D/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25_constantsites.txt],ASC_DNA, p1=1-1500

Partition files for other datasets only differ in the path to the filename and the number of sites.

Build trees in RAxML

Use RAxML to build genetic distance trees (run from the directory where RAxML output should be stored, e.g. results/RAxML/D/).
After RAxML finishes midpoint root trees in FigTree and export as displayed with NEXUS block.

# Dataset A
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/A/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset B
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/B/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset C
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/C/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset D
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/D/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset E 
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/E/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

# Dataset F
raxmlHPC-PTHREADS-AVX -T 2 -m ASC_GTRGAMMA -f a -x 12345 -p 12345 -N autoMRE -s ../../alignments/F/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas --asc-corr=stamatakis -q partition_snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.txt -n snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.tree

BEAST2 molecular clock dating

Create NEXUS files with SNP sequences, sampling dates and tip date priors.

for DATASET in B C D E F
do 

	# Narrow uniform priors
	python scripts/fasta2nexus.py -i results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -o results/alignments/${DATASET}/narrow -d 5 -l 3 -u 4

	# Normally distributed priors
	python scripts/fasta2nexus.py -i results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -o results/alignments/${DATASET}/normal -d 5 -l 3 -u 4 -n

	# Wide uniform priors 
	python scripts/fasta2nexus.py -i results/alignments/${DATASET}/snpAlignment_cov2_cleaned_excluded_real_noRecomb_genesTrimmed_trimmed_0.25.fas -o results/alignments/${DATASET}/wide -d 5 -l 3 -u 4 -L 1000 -U 2016 -w

done

The NEXUS files above can be dragged into BEAUti v2.6 to immediately load the alignment and set tip date priors.
Create the XML files in BEAUti v2.6 and adjust by hand (only the XML files for dataset D are provided on the repository).
Run on remote server using commands below (from the directory where XML files are stored, e.g. results/beast/D/).
Build MCC trees.

# On remote server (from directory with XML files)
mkdir output 

ls *.xml | parallel --delay 1 --jobs 30% --results outdir -I% --max-args 1 ~/BEASTv2.6.0/bin/beast -overwrite -seed 127 % & 		

cd output
mkdir mcctrees

for i in `ls *.trees | cut -f 1,2,3,4 -d "."`; do ~/BEASTv2.6.0/bin/treeannotator -burnin 30 ${i}.trees mcctrees/${i}.MCC.tree; done

Bayesian Date Randomisation Test

Create config files in results/beast/shuffleddates/config/ by hand.
Run Python scripts to produce XML files (only the XML files for dataset D are provided on the repository).
Run on remote server using commands below (from the directory where XML files are stored, e.g. results/beast/shuffleddates/D/).

# On local (from project root)
python scripts/MakeBEASTXML.py -i results/beast/shuffleddates/config/D/

# On remote server (from directory with XML files)
mkdir output

ls *R{0..50}.xml | parallel --delay 1 --jobs 75% --results outdir -I% --max-args 1 ~/BEASTv2.6.0/bin/beast -overwrite -seed 127 % &

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
reports		reports
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ancient bacterial genomes reveal a high diversity of Treponema pallidum strains in early modern Europe

Table of Contents

Abstract

Reports

Dependencies

Data

Workflow

Datasets

Create alignments

Partition files for RAxML

Build trees in RAxML

BEAST2 molecular clock dating

Bayesian Date Randomisation Test

About

Releases 1

Packages

Languages

License

laduplessis/Treponema_pallidum_in_early_modern_Europe

Folders and files

Latest commit

History

Repository files navigation

Ancient bacterial genomes reveal a high diversity of Treponema pallidum strains in early modern Europe

Table of Contents

Abstract

Reports

Dependencies

Data

Workflow

Datasets

Create alignments

Partition files for RAxML

Build trees in RAxML

BEAST2 molecular clock dating

Bayesian Date Randomisation Test

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages