This repository exists for reproducibility purpose. The data generated on this workflow powers the NMDtxDB. Raw data is available at the SRA PRJNA1054031. RNA-seq reads need to be pre-processed and alignment before input.
The workflow comprises two parts. The first part comprises a Snakemake workflow (workflow
). The second part enables the CDS detection and integration.
This refers to the workflow to generate the de novo transcriptome, and compute DGE and DTE.
snakemake --jobs 10 --cores 10 --profile slurm --printshellcmds --reason --use-singularity --use-conda --use-envmodule
To produce the DAG:
snakemake --rulegraph | dot -Tsvg > rulegraph.sv
This refers to the workflow for CDS detection. Here an example using sequences trimmed by the Ensembl start codon:
awk '{ print $1 "\t" $7-1 "\t" $8 "\t" $4 "\t" 1 "\t" $6; }' GRCh38.102.gtf > ref_cds.bed
Rscript cds/StartATG_to_cDNA.R ref_cds.bed
perl longorf2_fwd_v2.pl --input GRCh38.102.fa --startcodon ref_cds_cDNA.bed > ensembl_longorf2.fa
See longorf_integration_bed12 script, which details how the multiple source integration is done.
To retrieve the other sources:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/riboseq_orfs/data/Ribo-seq_ORFs.bed
https://api.openprot.org/api/2.0/HS/downloads/human-openprot-2_0-refprots+altprots+isoforms-uniprot2017_03_07.bed.zip
This project is licensed under the MIT.
This work was supported by the DFG Research Infrastructure West German Genome Center, project 407493903, as part of the Next-Generation Sequencing Competence Network, project 423957469.