INNUca - Reads Control and Assembly
INNUENDO quality control of reads, de novo assembly and contigs quality assessment, and possible contamination detection
https://github.com/B-UMMI/INNUca
- Illumina paired-end reads (paired end information: sampleName_R1_001 / sampleName_R2_001 OR sampleName_1 / sampleName_2) (gzip compressed: .fastq.gz or .fq.gz)
- Expected species name
- Expected genome size in Mb
Mandatory
- Java JDK
- mlst (https://github.com/tseemann/mlst) >= v2.4 (whenever mlst module should run) (it is recommended to use a mlst version with updated databases)
- ReMatCh (https://github.com/B-UMMI/ReMatCh) >= v3.2 (whenever true coverage module should run)
- gzip >= v1.6 (normally found in Linux OS)
Optional
(executables are provided, but user's own executables can be used with --doNotUseProvidedSoftware
option)
- Bowtie2 >= v2.2.9
- Samtools = v1.3.1
- FastQC = v0.11.5
- Trimmomatic = v0.36 (make sure the .jar file is executable and it is in your PATH)
- Pear = v0.9.10
- SPAdes >= v3.9.0
- Pilon = v1.18
git clone https://github.com/B-UMMI/INNUca.git
usage: INNUca.py [-h] [--version]
-s "Streptococcus agalactiae" -g 2.1
(-i /path/to/input/directory/ | -f /path/to/input/file_1.fq.gz /path/to/input/file_2.fq.gz)
[-o /output/directory/] [-j N]
[--jarMaxMemory 10] [--doNotUseProvidedSoftware]
[--keepIntermediateAssemblies]
[--skipEstimatedCoverage] [--skipFastQC]
[--skipTrimmomatic] [--skipSPAdes] [--skipAssemblyMapping]
[--skipPilon] [--skipMLST] [--runPear] [--noLog] [--noGitInfo] [--json]
[--skipTrueCoverage | --trueConfigFile species.config]
[--adapters adaptersFile.fasta | --doNotSearchAdapters]
[--estimatedMinimumCoverage N]
[--fastQCkeepFiles] [--fastQCproceed]
[--doNotTrimCrops | [[--trimCrop N] [--trimHeadCrop N]]]
[--trimSlidingWindow window:meanQuality] [--trimLeading N]
[--trimTrailing N] [--trimMinLength N] [--trimKeepFiles]
[--spadesVersion] [--spadesNotUseCareful]
[--spadesMinContigsLength N] [--spadesMaxMemory N]
[--spadesMinCoverageAssembly 10] [--spadesMinKmerCovContigs N]
[--spadesKmers 55 77 [55 77 ...] | --spadesDefaultKmers]
[--assemblyMinCoverageContigs N]
[--maxNumberContigs N] [--saveExcludedContigs]
[--pilonKeepFiles]
[--pearKeepFiles] [--pearMinOverlap N]
INNUca - Reads Control and Assembly
optional arguments:
-h, --help show this help message and exit
--version Version information
Required options:
-s "Streptococcus agalactiae", --speciesExpected "Streptococcus agalactiae"
Expected species name (default: None)
-g 2.1, --genomeSizeExpectedMb 2.1
Expected genome size in Mb (default: None)
Required INPUT options (one of the following):
-i /path/to/input/directory/, --inputDirectory /path/to/input/directory/
Path to directory containing the fastq files. Can be
organized in separete directories by samples or all
together (default: None)
-f /path/to/input/file_1.fq.gz /path/to/input/file_2.fq.gz, --fastq /path/to/input/file_1.fq.gz /path/to/input/file_2.fq.gz
Path to Pair-End Fastq files (default: None)
General options:
-o /output/directory/, --outdir /output/directory/
Path for output directory (default: .)
-j N, --threads N Number of threads (default: 1)
--jarMaxMemory 10 Sets the maximum RAM Gb usage by jar files
(Trimmomatic and Pilon). Can also be auto or off. When
auto is set, 1 Gb per thread will be used up to the
free available memory (default: off)
--doNotUseProvidedSoftware
Tells the software to not use FastQC, Trimmomatic,
SPAdes, Bowtie2, Samtools and Pilon that are provided
with INNUca.py (default: False)
--keepIntermediateAssemblies
Tells INNUca to keep all the intermediate assemblies
(default: False)
--skipEstimatedCoverage
Tells the programme to not estimate coverage depth
based on number of sequenced nucleotides and expected
genome size (default: False)
--skipTrueCoverage Tells the programme to not run trueCoverage_ReMatCh
analysis (default: False)
--skipFastQC Tells the programme to not run FastQC analysis
(default: False)
--skipTrimmomatic Tells the programme to not run Trimmomatic (default:
False)
--skipSPAdes Tells the programme to not run SPAdes and consequently
Pilon correction, Assembly Mapping check and MLST
analysis (SPAdes contigs required) (default: False)
--skipAssemblyMapping
Tells the programme to not run Assembly Mapping check
(default: False)
--skipPilon Tells the programme to not run Pilon correction and
consequently Assembly Mapping check (bam files
required) (default: False)
--skipMLST Tells the programme to not run MLST analysis (default:
False)
--runPear Tells the programme to run Pear (default: False)
--noLog Do not create a log file (default: False)
--noGitInfo Do not retreive GitHub repository information
(default: False)
--json Tells INNUca to save the results also in json format
(default: False)
Adapters options (one of the following):
--adapters adaptersFile.fasta
Fasta file containing adapters sequences to be used in
FastQC and Trimmomatic (default: None)
--doNotSearchAdapters
Tells INNUca.py to not search for adapters and clip
them during Trimmomatic step (default: False)
Estimated Coverage options:
--estimatedMinimumCoverage N
Minimum estimated coverage to continue INNUca pipeline
(default: 15)
trueCoverage_ReMatCh options:
--trueConfigFile species.config
File with trueCoverage_ReMatCh settings. Some species
specific config files can be found in
INNUca/modules/trueCoverage_rematch/ folder. Use those
files as example files. For species with config files
in INNUca/modules/trueCoverage_rematch/ folder (not
pre releases versions, marked with "pre."),
trueCoverage_ReMatCh will run by default, unless
--skipTrueCoverage is specified. Do not use together
with --skipTrueCoverage option (default: None)
FastQC options:
--fastQCkeepFiles Tells INNUca.py to not remove the output of
FastQC (default: False)
--fastQCproceed Do not stop INNUca.py if sample fails FastQC (default:
False)
Trimmomatic options:
--doNotTrimCrops Tells INNUca.py to not cut the beginning and end of
reads during Trimmomatic step (unless specified with
--trimCrop or --trimHeadCrop, INNUca.py will search
for nucleotide content bias at both ends and will cut
by there) (default: False)
--trimCrop N Cut the specified number of bases to the end of the
maximum reads length (default: None)
--trimHeadCrop N Trimmomatic: cut the specified number of bases from
the start of the reads (default: None)
--trimSlidingWindow window:meanQuality
Trimmomatic: perform a sliding window trimming,
cutting once the average quality within the window
falls below a threshold (default: 5:20)
--trimLeading N Trimmomatic: cut bases off the start of a read, if
below a threshold quality (default: 3)
--trimTrailing N Trimmomatic: cut bases off the end of a read, if below
a threshold quality (default: 3)
--trimMinLength N Trimmomatic: drop the read if it is below a specified
length (default: 55)
--trimKeepFiles Tells INNUca.py to not remove the output of
Trimmomatic (default: False)
SPAdes options:
--spadesVersion 3.11.0
Tells INNUca.py which SPAdes version to use
(available options: 3.9.0, 3.10.1, 3.11.0) (default:
3.11.0)
--spadesNotUseCareful
Tells SPAdes to only perform the assembly without the
--careful option (default: False)
--spadesMinContigsLength N
Filter SPAdes contigs for length greater or equal than
this value (default: maximum reads size or 200 bp)
--spadesMaxMemory N The maximum amount of RAM Gb for SPAdes to use
(default: 2 Gb per thread will be used up to the free
available memory)
--spadesMinCoverageAssembly 10
The minimum number of reads to consider an edge in the
de Bruijn graph during the assembly. Can also be auto
or off (default: 2)
--spadesMinKmerCovContigs N
Minimum contigs K-mer coverage. After assembly only
keep contigs with reported k-mer coverage equal or
above this value (default: 2)
SPAdes k-mers options (one of the following):
--spadesKmers 55 77 [55 77 ...]
Manually sets SPAdes k-mers lengths (all values must
be odd, lower than 128) (default values: reads
length >= 175 [55, 77, 99, 113, 127]; reads
length < 175 [21, 33, 55, 67, 77])
--spadesDefaultKmers Tells INNUca to use SPAdes default k-mers (default:
False)
Assembly Mapping options:
--assemblyMinCoverageContigs N
Minimum contigs average coverage. After mapping reads
back to the contigs, only keep contigs with at least
this average coverage (default: 1/3 of the assembly
mean coverage or 10x)
Assembly options:
--maxNumberContigs N Maximum number of contigs per 1.5 Mb of expected
genome size (default: 100)
--saveExcludedContigs Tells INNUca.py to save excluded contigs (default: False)
Pilon options:
--pilonKeepFiles Tells INNUca.py to not remove the output of Pilon
(default: False)
Pear options:
--pearKeepFiles Tells INNUca.py to not remove the output of Pear
(default: False)
--pearMinOverlap Minimum nucleotide overlap between read pairs for Pear
assembly them into only one read (default: 2/3 of
maximum reads length or 33 whenever is was not possible
to determine it with FastQC)
In order to combine INNUca reports (Estimate Coverage, True Coverage, Pear, SPAdes, Assembly Mapping, Pilon, MLST), use combine_reports.py found in INNUca modules folder
usage: python combine_reports.py [-h] [--version] -i
/path/to/INNUca/output/directory/
[-o /path/to/output/directory/]
Combine INNUca reports (Estimated Coverage, True Coverage, Pear, SPAdes, Assembly
Mapping, Pilon, MLST)
optional arguments:
-h, --help show this help message and exit
--version Version information
Required options:
-i /path/to/INNUca/output/directory/, --innucaOut /path/to/INNUca/output/directory/
Path to INNUca output directory (default: None)
Facultative options:
-o /path/to/output/directory/, --outdir /path/to/output/directory/
Path to where to store the outputs (default: ['.'])
In order to manually combine INNUca trueCoverage_ReMatCh module reports in respect to gene information, use combine_trueCoverage_reports.py found in INNUca modules/trueCoverage_rematch folder
usage: python combine_trueCoverage_reports.py [-h] [--version] -i
/path/to/INNUca/output/directory/
[-o /path/to/output/directory/]
[--minimum_gene_coverage 80]
Combine trueCoverage_ReMatCh module reports in respect to gene information.
optional arguments:
-h, --help show this help message and exit
--version Version information
Required options:
-i /path/to/INNUca/output/directory/, --innucaOut /path/to/INNUca/output/directory/
Path to INNUca output directory (default: None)
Facultative options:
-o /path/to/output/directory/, --outdir /path/to/output/directory/
Path to where to store the outputs (default: .)
--minimum_gene_coverage 80
Minimum percentage of sequence length (with a minimum
of read depth to consider a position to be present) to
determine whether a gene is present. (default: 80)
MP Machado, J Halkilahti, A Jaakkonen, DN Silva, I Mendes, Y Nalbantoglu, V Borges, M Ramirez, M Rossi, JA Carriço. INNUca GitHub https://github.com/B-UMMI/INNUca
Miguel Machado mpmachado@medicina.ulisboa.pt
Written with StackEdit.