-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
286 lines (239 loc) · 11.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
title: "Data procesing on PHASTER output dataset"
output: github_document
bibliography: packages.bib
nocite: '@*'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=FALSE, comment= " ", fig.path="Figs/", dev = c("png","pdf"))
# knitr::write_bib(c(.packages(), "knitr", "rmarkdown", "tidyverse"), "packages.bib")
knitr::write_bib(renv::dependencies()$Package, file = "packages.bib")
```
<a href="https://zenodo.org/badge/latestdoi/293570188"><img src="https://zenodo.org/badge/293570188.svg" alt="DOI"></a>
------------------------------------------------------------------------
This repo hosts the complete PHASTER dataset from the 59 strains' genomes analyzed on PHASTER and several scripts developed for its processing. Analysis of this processed data is provided in an on preparation manuscript. The primary packages used here are in the Tidyverse set of libraries, the rmarkdown and knitr packages were used to generate this report [@R-knitr; @R-rmarkdown; @tidyverse2019].
```{r libraries, include=FALSE, warning=FALSE, message=FALSE}
library(readr)
library(magrittr)
library(tibble)
library(dplyr)
library(stringr)
library(kableExtra)
library(tidyr)
library(ggplot2)
library(forcats)
library(datapasta)
library(citr)
```
# Data layers for the phylogenetic recosntruction (Fig 2.B-D) of **Bacillus** spp.
## Data importing and processing
The first step in this analysis is importing the summarized output that was created using the `populate_script()` after selecting each prophage prediction using de `datapasta` package [@R-datapasta].
<details open>
<summary>Predicted prophages retrieved on 2021-02-19</summary>
```{r dataset-importing, warning=FALSE, message=FALSE}
df_summaries <- read_tsv("Data/Raw/2021-02-19_PHASTER-summaries.tsv")
```
<p>...</p>
</details>
## Fig 1B. Generating the number of biosynthetic gene clusters
Data from the the Biosynthetic Gene Clusters (BGCs) were retrieved for each species from different servers including antiSMASH v5.1.2 and a manual curation.
## Fig 1C. Generating candidate prophages organized by completeness (Incomplete, Questionable, Intact)
<details open>
<summary> Prophage completeness code. </summary>
```{r prophage-completeness, message=FALSE}
by_completeness_raw <- df_summaries %>%
group_by(species,genome) %>%
summarise(
Incomplete = sum(str_count(completeness, "incomplete")),
Questionable = sum(str_count(completeness, "questionable")),
Intact = sum(str_count(completeness, "intact"))
)
write_tsv(x = by_completeness_raw, "Data/Raw/completeness-01.tsv")
```
<p>...</p>
</details>
## Fig 1D. Generating the total number of Restriciton modification systems per bacterial species
<details open>
<summary> RM system count import. </summary>
```{r Rm-proteins, message=FALSE}
rm_raw <- read_csv("Data/Raw/RMsystems-01.csv")
```
<p>...</p>
</details>
Finally all data was processed by adding extra information necessary for applying as an extra layer of phylogenetic information:
```{zsh data-processed, echo = FALSE}
tree Data/Processed
```
For instance the BGCs data layer has the following information:
```{zsh head-processed, echo = FALSE}
head -11 Data/Processed/BGCs-iTOL-04.txt
```
# Fig S1. Bubbleplot of the intact prophages in each bacterial species.
For the purpose of wrap viral species into families we created this table assigning the current virus on the data set to its associated family:
<details close>
<summary> Prophage families population. </summary>
```{r phage-families, message=FALSE}
phage_family <- tribble(
~most_common_phage, ~phage_family,
'Brevibacillus Jimmer1' , 'Myoviridae',
'Brevibacillus Jimmer2' , 'Myoviridae',
'Brevibacillus Abouo', 'Myoviridae',
'Brevibacillus Jenst' , 'Siphoviridae',
'Brevibacillus Osiris', 'Myoviridae',
'Paenibacillus HB10c2' , 'Siphoviridae',
'Paenibacillus Harrison' , 'Siphoviridae',
'Paenibacillus Tripp' , 'Siphoviridae',
'Deep s_D6E' , 'Myoviridae',
'Listeria 2389' , 'Siphoviridae',
'Listeria A006' , 'Siphoviridae',
'Listeria vB_LmoS_188', 'Siphoviridae',
'Listeria B054', 'Myoviridae',
'Thermus phi OH2' , 'Myoviridae',
'Geobacillus E3', 'Siphoviridae',
'Geobacillus GBSV1' , 'Myoviridae',
'Staphylococcus vB_SepS_SEP9' , 'Siphoviridae',
'Staphylococcus SPbeta_like' , 'Siphoviridae',
'Clostridium phiCT9441A' , 'Myoviridae',
'Clostridium phiCD27' , 'Myoviridae',
'Clostridium phiMMP02' , 'Myoviridae',
'Clostridium phiCD505', 'Myoviridae',
'Clostridium phiCT453A', 'Myoviridae',
'Clostridium phiSM101' , 'Siphoviridae',
'Clostridium c_st', 'Siphoviridae',
'Clostridium phiCD111', 'Siphoviridae',
'Clostridium phiCD211', 'Siphoviridae',
'Bacillus SPP1' , 'Siphoviridae',
'Bacillus phi105' , 'Siphoviridae',
'Bacillus virus 1' , 'Myoviridae',
'Bacillus Bam35c' , 'Tectiviridae',
'Bacillus phBC6A51' , 'Siphoviridae',
'Bacillus phBC6A52' , 'Siphoviridae',
'Bacillus PM1' , 'Siphoviridae',
'Bacillus WBeta' , 'Siphoviridae',
'Bacillus phIS3501' , 'Siphoviridae',
'Bacillus BtCS33' , 'Siphoviridae',
'Bacillus SPbeta' , 'Siphoviridae',
'Bacillus phi4J1' , 'Siphoviridae',
'Bacillus BalMu1' , 'Myoviridae',
'Bacillus AR9', 'Myoviridae',
'Bacillus Bam35c', 'Tectiviridae',
'Bacillus Bobb' , 'Herelleveridae',
'Bacillus BtCS33', 'Siphoviridae',
'Bacillus Eyuki' , 'Herelleveridae',
'Bacillus G', 'Myoviridae',
'Bacillus IEBH', 'Siphoviridae',
'Bacillus JBP901', 'Herelleveridae',
'Bacillus JL', 'Herelleveridae',
'Bacillus Palmer', 'Podoviridae',
'Bacillus PBS1', 'Myoviridae',
'Bacillus PfEFR_5', 'Siphoviridae',
'Bacillus phiCM3', 'Siphoviridae',
'Bacillus phiNIT1', 'Herelleveridae',
'Bacillus Pony', 'Herelleveridae',
'Bacillus SP_10', 'Herelleveridae',
'Bacillus vB_BhaS_171', 'Siphoviridae',
'Bacillus Waukesha92' , 'Siphoviridae',
'Bacillus Finn', 'Siphoviridae',
'Bacillus Fah', 'Siphoviridae',
'Enterobacteria phi92', 'Myoviridae',
'Enterococcus phiEf11', 'Siphoviridae',
'Enterobacteria vB_KleM_RaK2', 'Myoviridae',
'Lactobacillus JCL1032', 'Siphoviridae',
'Lactobacillus phiAT3', 'Siphoviridae',
'Planktothrix PaV_LD', 'Siphoviridae',
'Sphingomonas PAU', 'Myoviridae',
'Cellulophaga phiSM', 'Myoviridae',
'Staphylococcus SpaA1' , 'Siphoviridae',
'Synechococcus S_SKS1' , 'Myoviridae',
'Escherichia RCS47' , 'Myoviridae'
)
```
<p>...</p>
</details>
```{r phage-families-02, echo=FALSE}
kable(phage_family[1:10,])
```
------------------------------------------------------------------------
The previous table is then joined to the main dataset and complements it. To find the prophage candidates from PHASTER data we evaluated the summary information. In this approach the *most common phage* is directly detected from PHASTER and its criteria is based on adding unidentified prophage proteins to the PHAGE with most proteins already identified using sequence similarity of the non identified proteins.
<details close>
<summary> Intact prophages per bacteria </summary>
```{r intact-phages, fig.height=15, fig.width=12, warning=FALSE, message=FALSE}
intact_phages_summary_cleaned <- df_summaries %>%
separate(most_common_phage, c("most_common_phage", "phage_proteins"), sep = "\\(") %>%
relocate(species) %>%
arrange(species) %>%
mutate(
phage_proteins = str_replace(phage_proteins, "\\)", ""),
most_common_phage = str_remove(most_common_phage, "PHAGE_"),
species = str_replace(species, "Bacillus", "B."),
species = as.factor(species),
phage_proteins = as.numeric(phage_proteins),
most_common_phage = str_replace(most_common_phage, "_", " "),
most_common_phage = str_replace(most_common_phage, "Bacill", "Bacillus"),
most_common_phage = str_replace(most_common_phage, "Brevib", "Brevibacillus"),
most_common_phage = str_replace(most_common_phage, "Clostr", "Clostridium"),
most_common_phage = str_replace(most_common_phage, "Thermu", "Thermus phi"),
most_common_phage = str_replace(most_common_phage, "Paenib", "Paenibacillus"),
most_common_phage = str_replace(most_common_phage, "Bacillus 1", "Bacillus virus 1"),
most_common_phage = str_replace(most_common_phage, "Lister", "Listeria"),
most_common_phage = str_replace(most_common_phage, "Geobac", "Geobacillus"),
most_common_phage = str_replace(most_common_phage, "Staphy", "Staphylococcus"),
most_common_phage = str_replace(most_common_phage, "Entero phi92", "Enterobacteria phi92"),
most_common_phage = str_replace(most_common_phage, "Entero phiEf11", "Enterococcus phiEf11"),
most_common_phage = str_replace(most_common_phage, "Entero vB_KleM_RaK2", "Enterobacteria vB_KleM_RaK2"),
most_common_phage = str_replace(most_common_phage, "Lactob", "Lactobacillus"),
most_common_phage = str_replace(most_common_phage, "Sphing", "Sphingomonas"),
most_common_phage = str_replace(most_common_phage, "Cellul", "Cellulophaga"),
most_common_phage = str_replace(most_common_phage, "Plankt", "Planktothrix"),
most_common_phage = str_replace(most_common_phage, "Synech", "Synechococcus"),
most_common_phage = str_replace(most_common_phage, "Escher", "Escherichia")
) %>%
separate(most_common_phage, c("most_common_phage", "phage_nc_id"), sep = "_NC_") %>%
inner_join(phage_family)
```
<p>...</p>
</details>
------------------------------------------------------------------------
Using the organized data from intact prophages we then plot the best prediction per bacteria and their associated CDSs.
<details close>
<summary> Bubbleplot of intact prophages per bacteria </summary>
```{r bubbleplot-summary, fig.height=15, fig.width=12, warning=FALSE, message=FALSE}
bubbleplot_plot_summary <- intact_phages_summary_cleaned %>%
filter(completeness == "intact") %>%
select(species, most_common_phage, phage_proteins, phage_family) %>%
mutate(epithet = as_factor(word(species,2))) %>%
pivot_wider(names_from = "most_common_phage", values_from = "phage_proteins", values_fill = 0, values_fn = sum) %>%
pivot_longer(cols = -c(species, epithet, phage_family), names_to = "most_common_phage", values_to = "phage_proteins") %>%
filter(phage_proteins > 0) %>%
ggplot(aes(most_common_phage, species, size = phage_proteins, color = phage_family)) +
geom_point() +
facet_grid(rows = vars(epithet), scales = "free", space = "free") +
theme_bw() +
scale_size_area(max_size = 8) +
scale_color_brewer(palette = "Dark2") +
facet_grid(rows = vars(epithet), cols = vars(phage_family), scales = "free", space = "free") +
guides(color = guide_legend(override.aes = list(size=9))) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 13, face = "italic"),
axis.text.y = element_text(size = 13, face = "italic"),
axis.title.x = element_text(size = 16, face = "bold"),
axis.title.y = element_text(size = 16, face = "bold"),
strip.background = element_blank(),
strip.text.y = element_blank(),
strip.text.x = element_blank(),
legend.position = "right"
) +
labs(
x = "Prophage candidates",
y = "Bacterial species",
size = "CDS hits",
color = "Phage family"
)
# bubbleplot_plot_summary
```
<p>...</p>
</details>
![](Figs/bubbleplot-summary-1.png)
<!-- ## License -->
<!-- <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img src="https://i.creativecommons.org/l/by/4.0/88x31.png" alt="Creative Commons License" style="border-width:0"/></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>. -->
## References