A workflow designed to clean SINGLE-END fastq files
for the SEACONNECT project
See https://www.sylabs.io/docs/ for instructions to install Singularity.
Following programs are installed into our container:
- fastq_illumina_filter keeps reads that were NOT filtered by illumina sequencer.
- fastp provides fast all-in-one preprocessing for
fastq
files. - bbduk filters or trims reads for adapters and contaminants using k-mers.
singularity pull --name cleanfastq.simg shub://Grelot/clean-fastq:cleanfastq
alternatively, if you are administrator on your machine, you can build a local image:
sudo singularity build cleanfastq.simg Singularity.cleanfastq
singularity run cleanfastq.simg
it should output:
Opening container...ubuntu xenial: fastq_illumina_filter, bbduck, fastp
Before running the pipeline with snakemake, you have to set a config file
threads_by_job
number of cores used to process a singlefastq
filefastqFolderPath
an absolute path of the folder containing fastq filesfastqFiles
a list of PREFIX .fastq.gz file name inside the folder that you want to processcontainer
name of the singularity image filefastp
custom parameters for fastp command (see section The pipeline - Quality filtering for details)
Illumina sequencers perform an internal quality filtering procedure called chastity filter, and reads that pass this filter are called PF for pass-filter. According to Illumina, chastity is defined as the ratio of the brightest base intensity divided by the sum of the brightest and second brightest base intensities. Clusters of reads pass the filter if no more than 1 base call has a chastity value below 0.6 in the first 25 cycles. This filtration process removes the least reliable clusters from the image analysis results. We used fastq_illumina_filter to remove reads which failed to pass the chastity filter.
-
The viral genome of phiX is used as a control in Illumina sequencing. While the viral libraries do not have MIDs on them, some phiX reads always creep through, possibly because the clusters “borrow” the signals from closely surrounding clusters that do. We removed these phiX reads.
-
Adapter sequences should be removed from reads because they interfere with downstream analyses, such as alignment of reads to a reference. The adapters contain the sequencing primer binding sites, the index sequences, and the sites that allow library fragments to attach to the flow cell lawn. We trimmed Illumina Truseq and Nextera adapters sequences from reads sequence.
We proceed base trimming and read discarding based on quality phred score information provided by fastq
files with the program fastp.
Modify the fastp section of the config file to change default parameters
fastp:
n_base_limit: 0
qualified_quality_phred: 18
unqualified_percent_limit: 40
length_required: 76
cut_tail_window_size: 4
cut_tail_mean_quality: 18
poly_g_min_len: 10
We removed reads with more than 0 N
bases
n_base_limit: 0
In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer. The Phred quality score (Q)
is logarithmically related to the error probability (E)
.
Q = -10 \log E
Here is a table of how to interpret a range of Phred Quality Scores.
Phred Quality Score | Error | Accuracy (1 - Error) |
---|---|---|
10 | 10% | 90% |
20 | 1% | 99% |
30 | 0.1% | 99.9% |
40 | 0.01% | 99.99% |
- We filtered bases with Phred Quality Score under 18 and we discard a reads when at least 40% of these bases have a Phred Quality Score under 18.
- We trimmed from 3' tail of the reads windows of 4 bases with mean Phred Quality Score below 18.
qualified_quality_phred: 18
unqualified_percent_limit: 40
cut_tail_window_size: 4
cut_tail_mean_quality: 18
We trimmed 3' tail polyG sequences with a length greater than 10 bases from reads sequence.
poly_g_min_len: 10
We removed trimmed reads with a length below 76 bases.
length_required: 76
To do a test on tiny data
snakemake -s Snakefile -j 8 --use-singularity --configfile 01-infos/tiny_config.yaml
Run the pipeline on diplodus Sargus fastq
raw data
snakemake -s Snakefile -j 8 --use-singularity --configfile 01-infos/diplodus_rawdata_config.yaml
Run the pipeline on mullus Surmuletus fastq
raw data
snakemake -s Snakefile -j 8 --use-singularity --configfile 01-infos/mullus_rawdata_config.yaml