This is a catalogue of diverse scripts used at the lab for smaller customised tasks.
Linux, Mac OS
Shell scripts (*.sh) of this software were developed and tested using GNU bash (v4.4.20) in a Ubuntu 18.04 linux system. R scripts were developed using the R console (v4.1.1) in macOS Monterey. Perl scripts tested on v5.32.1.
In pair-end (PE) fastq files the sequence identifiers in line 1 (of 4) of each sequence are identical in both read1 and read2 files. Sometimes one wants to concatenate (cat
) both paired-end files to a single-end (SE) fastq file for downstream analysis. This creates a fastq file were each sequence identifier is present twice, therefore not unique. This lack of uniqueness can cause bugs down the line at some processes like e.g. deduplication.
This script adds endings '/1' and '/2' to the identifier of each read in the pair to make them unique prior to concatenation.
perl sample_r1.fastq sample_r2.fastq
This creates two output files sample_r1.fastq_headed.fastq and sample_r2.fastq_headed.fastq
Looking like this in the first three fastq identifiers (header):
Evaluates nuclear and chloroplast CHH methylation to estimate cytosine non-conversion and other errors in bisulfite treated sequencing reads (BS-Seq) in Arabidopsis TAIR10. Input is the methylation report produced by the bismark_methylation_extractor (CX_report) from the BS-seq mapping tool bismark, which can look like:
Chr2 1001 - 0 0 CHH CNN
Chr2 1006 + 0 0 CG CGT
Chr2 1007 - 0 0 CG CGA
Chr2 1009 + 0 0 CG CGA
Chr2 1010 - 0 0 CG CGA
Chr2 1012 + 0 0 CHH CCA
Chr2 1013 + 0 0 CHG CAG
perl sample.deduplicated.CX_report.txt
Produces standard output like:
The mean chloroplast CHH methylation is :0.663193735143279%
The mean nuclear CHH methylation is :2.04898463286775%
juan.santos at