rDNA clusters in the NA12878 consortium human genome assembly

Version: 0.0.9000 (20 May 2020)
Author: Maurits Evers (maurits.evers@gmail.com)

Workflow

The current workflow performs the following steps:

Build a bowtie2 index from the NA12878 assembly.
Based on the U13369.1 rDNA reference sequence, align 500 bp long rDNA fragments and sequences based on rDNA regions of interested to the NA12878 assembly. Store results as sorted & indexed BAM and BED files. Regions of interest (ROI) include the transcribed regions 5'ETS, 18S, ITS1, 5.8S, ITS2, 28S and 3'ETS, as well as a 500 bp wide region upstream of the 5'ETS (which includes the core promoter and upstream control elements).
Plot the number of fragment hits across the assembly's contigs.
Determine the loci where the first 500 bp rDNA fragment (or 5'ETS sequence) maps to; these loci define the 5' start of an rDNA unit in the assembly; the 3' end is defined as the 5' start of the next rDNA unit. If it is the last rDNA unit in a cluster the 3' end is given by 5' start coordinate plus the average length of the preceding rDNA units. This information defines a list of putative rDNA loci in the assembly. We use this list to extract the corresponding full rDNA unit sequences from the assembly.

Results are stored in folder 04_rDNA_copies/complete_unit.
From the rDNA ROI alignment, for every ROI extract a list with identified matches and their corresponding sequences. Results are stored in folder 04_rDNA_copies/regions.
Perform a multiple sequence alignment of the putative full rDNA sequences and rDNA ROI sequences using Clustal Omega and visualise results using MView. Results are stored in 05_msa/complete_unit and 05_msa/regions, respectively.
Show hits of transcribed rDNA regions across the assembly's contigs using pygenometracks.

The current workflow consists of separate shell and R scripts in the individual main folders; for example 01_bowtie2_ref has a shell script do_indexing.sh that builds the bowtie2 index. A future version of the workflow will automate and robustify processes using snakemake.

Requirements and dependencies

Data

The NA12828 reference assembly FASTA file, stored in 00_ref_sequences/assembly/albacore_canu_wtdbg_nanopolish2.fasta.
The U13369.1 human rDNA sequence as a FASTA file, stored in 00_ref_sequences/rDNA_GenBank/U13369.1.fa.
Gzip'ed FASTA file of the fragmented rDNA in 02_rDNA_frags/rDNA_frags_len500_step500.fa.gz (see below).
Annotation file of the transcribed regions in the U13369.1 rDNA sequence as a BED file in 02_rDNA_frags/U13369.1_transcribed.bed.

Software

bedtools; tested with bedtools_2.26.0
bowtie2; tested with bowtie2_2.3.5.1
Clustal Omega; tested with clustalo_1.2.4
deepTools; tested with deeptools_3.4.3
MView; tested with mview_1.67
pygenometracks; tested with pgt_3.3
R and R/Bioconductor libraries Biostrings, gridExtra, Rsamtools, tidyverse; tested with R_3.6.1, Biostrings_2.52.0, gridExtra_2.3, Rsamtools_2.0.3, tidyverse_1.3.0
samtools; tested with samtools_1.9

The NA12878 reference assembly

The reference genome is based on the de-novo assembly of the NA12878 consortium data from the MinION (Oxford Nanopore Technology) nanopore sequencer. Full details involving the sophisticated data analysis and assembly process can be found

in the original publication by Jain et al., Nanopore sequencing and assembly of a human genome with ultra-long reads, Nature Biotechnology 36, 338 (2018),
in Human genome assemblies with nanopore, an update from the Genome Informatics Section, and
on the Whole Human Genome Sequencing Project GitHub project page.

The FASTA file of the Canu 1.7 + WTDBG + Nanopolish reference assembly is available from the Center for Bioinformatics and Computational Biology at the University of Maryland.

Further details and references:

The assembly consists of 1172 contigs; contig sizes are summarised in the following statistics

#Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
#1006    31736   244367  2405237  1951710 51216920

Koren et al., Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation, Genome Research 27, 722 (2017); latest release: Canu v2.0
WTDBG: A fuzzy Bruijn graph approach to long noisy reads assembly; Jue Ruan and Heng Li, Fast and accurate long-read assembly with wtdbg2, Nature Methods 17, 155 (202); latest release: wtdbg2.5; requires 1TB+ RAM
Nanopolish by Jared Simpson et al. is a

[s]oftware package for signal-level analysis of Oxford Nanopore sequencing data. Nanopolish can calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels with respect to a reference genome and more.

Human ribosomal DNA sequence

We use the human ribosomal DNA complete repeating unit U13369.1 and fragment the sequence into 500 bp long non-overlapping subsequences. The workflow does not generate these files, and instead requires these files to be already present; they are generated as part of the publicly available snakemake-based workflow frag_align_rDNA.

TODO

Move workflow to snakemake

The NA12878 consortium

The consortium includes

Heng Li (Harvard Medical School and Dana-Farber Cancer Institute, former Broad Institute, developed samtools, bwa) [RNA]
George Church (Harvard Medical School) [RNA]
Matthew Loose (DeepSeq, University of Nottingham) [DNA]
Nick Loman (Institute of Microbiology and Infection, University of Birmingham) [DNA]
Jared Simpson (Ontario Institute for Cancer Research) [DNA]
Alan Phillippy (Head of the Genome Informatics Section, University of Maryland) [DNA]

Copyright and licence

The workflow is distributed under the GNU GPL licence version 3. The workflow contains the MView source code as a .tar.gz file, which is distributed under the GNU GPL license version version 2 or higher.

The NA12878 data is released under the Create Commons CC-BY licence.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
00_ref_sequences		00_ref_sequences
01_bowtie2_ref		01_bowtie2_ref
02_rDNA_frags		02_rDNA_frags
03_alignment		03_alignment
04_rDNA_copies		04_rDNA_copies
05_msa		05_msa
06_genome_plots		06_genome_plots
logs		logs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
rDNA_unit_def.png		rDNA_unit_def.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rDNA clusters in the NA12878 consortium human genome assembly

Workflow

Requirements and dependencies

Data

Software

The NA12878 reference assembly

Human ribosomal DNA sequence

TODO

The NA12878 consortium

Copyright and licence

About

Releases

Packages

Languages

License

mevers/NA12878_rDNA_clusters

Folders and files

Latest commit

History

Repository files navigation

rDNA clusters in the NA12878 consortium human genome assembly

Workflow

Requirements and dependencies

Data

Software

The NA12878 reference assembly

Human ribosomal DNA sequence

TODO

The NA12878 consortium

Copyright and licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages