The pipeline PETANC for “Plasmid-Exploration Typing Assembly N’Contig-ordering” is a pipeline for Illumina paired-end reads of Escherichia coli strain.
Petanc is a pipeline used on the cdc of IAME. Our cdc use Docker. Each software have been installed first in container.
In the file parameter.py
, there is a list of the names, versions and articles to cite for a software in each images.
% petanc.py
Script usage :
-h : print this message and exit
--fasta : start the analyse from fasta sequences
--directory : directory with fastq or fasta files
- Create a list of samples and clean names of fastq files
- Quality Control of the raw reads
- Trimming
- Quality Control of the clean reads
- Assembly
- Quality control of the assembly
- Serotype
- MLST
- fimH
- Phylogroup
- Genes of virulences
- Genes of resistance
- Plasmides
- Classification of contigs to plasmid or chomosome
- Annotation
- Capsule systems
- Shigella or EHEC
- Layout Excel
A list of the samples is created from the directory (options). If the samples are fastq files, we clean their names. Depending on the sequencing technology, the reads files can be shortened. For exemple, reads files from MiSeq have a name Name_S[0-9]*_L001_R1_001.fastq.gz and could be shortened to Name_R1.fastq.gz . This pipeline clean names for HiSeq, MiSeq and NextSeq.
We used fastQC to generated a quality control of each file (with default parameters) of reads and MultiQC for visualize all fastQC output in one glance.
https://github.com/s-andrews/FastQC
https://github.com/ewels/MultiQC
✔️ FastQC: 0.11.9 ✔️ QUAST: 5.0.2
Trimgalore is used to trim and filer the paired-end reads (--paired). We trim low-quality ends (-q 30) from reads in addition to adapter removal. We discard reads that became shorter (-t 50). We keep reads unpaired (--retain_umpaired).
https://github.com/FelixKrueger/TrimGalore
✔️ Trimgalore: 0.6.7
The quality control after trimming is exactly the same before trimmming.
The assembly is done with Spades with the option careful.
https://github.com/ablab/spades
✔️ SPAdes : 3.15.4
QUAST is used to check the quality of the assembly.
✔️ QUAST: 5.0.2
⌛ to use MultiQC again
We use abricate with the database ecoh (--db ecoh) and threshold for the percentage of identity (--minid 80) and coverage (--minicov 90)
https://github.com/tseemann/abricate
✔️ abricate : 0.8.11
We use mlst to looking for Escherichia coli Warwick MLST (--scheme ecoli_achtman_4) and Escherichia coli Pasteur MLST (--scheme ecoli)
"This publication made use of the PubMLST website (https://pubmlst.org/) developed by Keith Jolley (Jolley & Maiden 2010, BMC Bioinformatics, 11:595) and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust".
https://github.com/tseemann/mlst
✔️ mlst : 2.16.2
We use FimTyper to know the allele of fimH with a threshold for %identity (-k 95.00) and a minimum length for the overlap (-l 0.60)
https://bitbucket.org/genomicepidemiology/fimtyper/src/master/
✔️ FimTyper : 1.1
We use the ClermonTyping to know the Phylogroup of Escherichia coli with all the contigs (--threshold 0)
✔️ ClermonTyping : 21.03
With Abricate, we are looking for genes of virulences with thresholds for the percentage of identity (--minid 80) and coverage (--minicov 90). The database was done by Guilhem Royer in Kieffer et al, 2019.
https://github.com/tseemann/abricate
✔️ abricate : 0.8.11
Currently, we use ResFinder.
https://bitbucket.org/genomicepidemiology/resfinder/src/master/
✔️ Resfinder : 4.2.2
With Abricate, we are looking for genes of plasmids with thresholds for the percentage of identity (--minid 80) and coverage (--minicov 90). The database is PlasmidFinder_DB.
https://github.com/tseemann/abricate
✔️ abricate : 0.8.11
We classify contigs from the assembly according to their location (i.e. plasmid or chromosome) with PlaScope.
https://github.com/labgem/PlaScope
✔️ PlaScope : 1.3.1
We annote the bacteria assembly (--gcode 11) with prokka.
https://github.com/tseemann/prokka
✔️ Prokka : 1.14
Sometime, we have a problem with the first line of each contig of the genebank file from prokka. The locus name is crushed by length of the sequence, so the script cleanGBK_locusProkka.py
rename the locus without the coverage of spades output.
We detects capsule systems with CapsuleFinder for the model "ABC", "GroupIV_e_stricte", "GroupIV_f", "GroupIV_s_stricte", "PGA", "Syn_cps3", "Syn_has" and "Wzy_stricte".
https://research.pasteur.fr/en/tool/capsulefinder/
✔️ CapsuleFinder : 02/02/2018
With Abricate, we are looking for the gene ipaH3 of Shigella flexneri Y strain PE57 (CP042980.1 from 1400701 to 1402416) with thresholds for the percentage of identity (--minid 95) and coverage (--minicov 95).
https://github.com/tseemann/abricate
✔️ abricate : 0.8.11
All the data are synthesized by the script petanc_layout.py
to make an Excel file.