Call as:
bash run_pipeline.sh $SCRIPT_DIR $alns.bam $chrs.txt $KY_ref.fa $out_dir $genome.bed $18S.fa $45S_on_KY.bed $TR_unit.fa
$SCRIPT_DIR
directory of this code
$alns.bam
ONT aligned to chm13 passed
$chrs.txt
list of newline separated chromosomes of interest (ie. the acros)
$KY_ref.fa
path to the KY_ROT reference
$out_dir
path to directory where you want results
$genome.bed
bed file for calculating genome coverage
$18S.fa
reference of the 18S rRNA gene for estimating copy number
$45S_on_KY.bed
bed file of the 45S gene on the KY_rot reference for getting methylation of just the gene
$TR_unit.fa
the reference sequence of a canonical TR unit for estimating copy number in rdna units
General pipeline:
- makes file structure for outputs
- maps ONT reads to 45S ref (or any ref you give it), filters for 90% alignment block and 90% identity, filters out suspected chimeric reads (identified as containing inverted 45S units)
- splits up the alignments to get one alignment per file
- calls modkit on these split alignments, getting per read methylation info (also calls on all reads per category)
- summarizes data at the per group level, and the per unit level
- ordering analysis: are neighboring units methylated more similarly than random units?
- generates a per category violin plot
The outputs:
Everything is contained within the folder specified as $out_dir
.
$out_dir/alignment
contains outputs from the alignment and splitting stages.
- In the top level contains bams,beds,pafs for each group
- In
read_breakdown
contains subdirectories for each group, then for each readname - Each readname folder contains a sam and bam per 45S unit
$out_dir/get_methylation
contains outputs from running modkit
to get methylation information from alignments.
modkit_beds
contains the group level beds generated bymodkit
, as well as alogs
folderread_breakdown
is organized as in thealignment
folder, with each readname folder containing one bed file per unit, as well as alogs
folder
$out_dir/methylation_analysis
contains final summary data generated by the pipeline.
group_summary.txt
contains the group-level summary of %methylationread_summary.txt
contains the per-unit level data of %methylation (each row represents one unit on one read)output.png
contains a violin plot of group-level %methylationordering_analysis.txt
contains the results of ordering analysis