Results

Written summary of results obtained in this repo. This is the results write up for the paper and includes figures.

02a_Methods.md

After testing bbmap and spades with 4 of the E. coli isolates (#1, #20, #94, #96), I decided to run assemblies at 100X coverage, skip isolate #20. Modified slurm file from Jules and renamed as SRAassemblyPipeline.FS19C.SLURM_TEMPLATE.

02_Methods.md

BBmap + spades

Assembly and qc check was successful for all isolates except for #20 (skip).

Files/directories generated (for each isolate if indicated with a '*'):

polishedfasta.txt
*_1.fastq.gz
*_2.fastq.gz
*_pol.fasta
*.slurm
*.names
*_covstats.txt
*.fasta
*_spades_out/

fastQC

Ran all samples through fastqc to use for multiqc. I also looked at fastqc reports for re-sequenced isolates #95 and #96 - they look good.

Files generated (for each isolate):

*fastqc.zip
*fastqc.html

MultiQC

Ran multiqc for isolates #1-94 and report shows sequences are good. Looked at fastqc reports for re-sequenced isolates #95 and #96 and confirmed they are good.

Files generated:

FS19all_multiqc_report.html
FS19all_multiqc_data directory
FS19_1-94_multiqc_report.html
FS19_1-94_multiqc_data directory
1-H12-96-441FEC_S2_L002_R2_001_fastqc.html
1-H12-96-441FEC_S2_L002_R1_001_fastqc.html
1-H11-95-440FED_S1_L002_R2_001_fastqc.html
1-H11-95-440FED_S1_L002_R1_001_fastqc.html

MDS from fastANI and mash

fastANI

Files generated:

FS19CfastANIoutput.xlsx
FS19CfastANIoutput2.xlsx
fs19cfastanioutput.out
fs19cfastanioutput2.out

mash

Files generated:

distances_thirdrun.tab
FS19Cmashdistances.xlsx

MDS

There weren't any obvious outliers, like what Jules had hinted when he ran the code on his end. At microbe meeting (22Jan2021), Crystal pointed out that it's good we're seeing some differences of the commensal isolates (like isolates collected from EDL933 group cluster separately from EDL933 isolate) from the STEC isolates in this plot. Regardless of whether we do find any differences in metabolic genes or not, there are other differences we can explore too.

Also ran MDS of mash-generated distances_thirdrun.tab of 95 isolates + 6 reference strains + 3 non-E. coli isolates. The 3 non-E. coli isolates were either in the same family, different family, or non-proteobacteria: Salmonella enterica subsp. enterica serovar Typhimurium str. LT2, Campylobacter jejuni subsp. jejuni NCTC 11168, Clostridium saccharoperbutylacetonicum N1-4(HMT), respectively. We got what we expected where non-E. coli isolates were very distant from E. coli isolates and reference strains.

Files generated:

fastANImashMDSheatmaps.pptx
FS19C_fastaniMDS.tiff
FS19C_mashMDS_thirdrun_all.tiff
FS19C_mashMDS_thirdrun_onlyrefgenomes.tiff
qc_mds.R

gifrop

Generated pangenome files. Will need to translate the genes to GO terms to find pathways of interest and what STECs and commensals have these pathways and genes.

Files generated:

**pol/ or Ecoli*/
- *_pol.err
- *_pol.faa
- *_pol.ffn
- *_pol.fna
- *_pol.fsa
- *_pol.gbk
- *_pol.gff
- *_pol.log
- *_pol.sqn
- *_pol.tbl
- *_pol.tsv
- *_pol.txt
- proteins.faa
- proteins.pdb
- proteins.pot
- proteins.ptf
- proteins.pto
pan/
- *.gff
- accessory_binary_genes.fa
- accessory_binary_genes.fa.newick
- _accessory_clusters
- _accessory_clusters.clstr
- accessory_graph.dot
- accessory.header.embl
- accessory.tab
- blast_identity_frequency.Rtab
- _blast_results
- _clustered
- _clustered.clstr
- clustered_proteins
- _combined_files
- _combined_files.groups
- core_accessory_graph.dot
- core_accessory.header.embl
- core_accessory.tab
- core_alignment_header.embl
- core_gene_alignment.aln
- core_gene_alignment.aln.reduced
- gene_presence_absence.csv
- gene_presence_absence.Rtab
- gifrop_out/
  - clustered_island_info.csv
  - figures/
    - island_length_histogram.png
    - islands_per_isolate_no_unknowns.png
    - islands_per_isolate.png
    - Number_of_occurances.png
    - Number_of_occurances_secondary.png
  - gifrop.log
  - islands_pangenome_gff.csv
  - my_islands/
    - abricate/
      - All_islands.megares2
      - All_islands.ncbi
      - All_islands.plasmidfinder
      - All_islands.vfdb
      - All_islands.viroseqs
    - island_info.csv
    - All_islands.fasta
  - pan_only_islands.csv
  - pan_with_island_info.csv
  - sequence_data/
    - *.fna
    - *_short.gff
- _inflated_mcl_groups
- _inflated_unsplit_mcl_groups
- _labeled_mcl_groups
- M7lUUryBzC/
  - *.gff.proteome.faa
- number_of_conserved_genes.Rtab
- number_of_genes_in_pan_genome.Rtab
- number_of_new_genes.Rtab
- number_of_unique_genes.Rtab
- pan_genome_reference.fa
- pan_genome_sequences/
- summary_statistics.txt
- _uninflated_mcl_groups

DRAM

Completed DRAM.py distill for both dram runs (all 231 genomes). See genome_summaries_annotation_v3 and genome_summaries_annotation_v4 for output files. I examined product.html, which shows what modules are present in the isolates. Need to examine this and the metabolism_summary.xlsx more closely. Discussions with Jules and Crystal about converting presence/absence data from metabolism_summary.xlsx into an ordination to see which commensals fall closer to STEC and could be candidates for further study. Can also color code and include virulence genes.

Files generated:

genome_summaries_annotation_*/
- genome_stats.tsv
- metabolism_summary.xlsx
- product.html
- product.tsv
annotation_v3_dramfirstrun/working_dir/ or annotation_v4_dramsecondrun/
- */
  - *.gbk
  - genes.annotated.faa
  - genes.annotated.gff3
  - scaffolds.annotated.fa
  - annotations.tsv
  - genes.annotated.fna
  - rrnas.tsv
  - trnas.tsv
  - genbank/ # <= in annotation_v4_dramsecondrun only

PPanGGOLiN

Download gene_families.tsv which shows all the genes (non-descriptive) (second column) in each gene family (first column). Will still have to blast what each of these gene family sequences are. Can obtain fasta file for entire pangenome of genes, gene families, or protein families:

ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes all
ppanggolin fasta -p pangenome.h5 --output MY_GENES --gene_families all
ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families all

Files generated:

gene_presence_absence.Rtab
organisms_statistics.tsv
pangenomeGraph_light.gexf
projection/
matrix.csv
pangenomeGraph.gexf
pangenome.h5
tile_plot.html
mean_persistent_duplication.tsv
pangenomeGraph.json
partitions/
Ushaped_plot.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03_Results.md

03_Results.md

Results

02a_Methods.md

02_Methods.md

BBmap + spades

Files/directories generated (for each isolate if indicated with a '*'):

fastQC

Files generated (for each isolate):

MultiQC

Files generated:

MDS from fastANI and mash

fastANI

Files generated:

mash

Files generated:

MDS

Files generated:

gifrop

Files generated:

DRAM

Files generated:

PPanGGOLiN

Files generated:

Files

03_Results.md

Latest commit

History

03_Results.md

File metadata and controls

Results

02a_Methods.md

02_Methods.md

BBmap + spades

Files/directories generated (for each isolate if indicated with a '*'):

fastQC

Files generated (for each isolate):

MultiQC

Files generated:

MDS from fastANI and mash

fastANI

Files generated:

mash

Files generated:

MDS

Files generated:

gifrop

Files generated:

DRAM

Files generated:

PPanGGOLiN

Files generated: