Skip to content

Visualising RNA feature distributions with R2Dtool

pre-mRNA edited this page Nov 8, 2024 · 3 revisions

R2Dtool Plotting Functions

R2Dtool offers several flexible and modular plotting functions to visualise the distributions of isoform-resolved RNA features around genomic landmarks. Here, we provide more information on how this functions are implemented and can be used.

Overview

R2Dtool offers three main plotting functions:

  1. plotMetaTranscript: Visualises the distribution of RNA features across a normalised metatranscript model.
  2. plotMetaJunction: Shows the distribution of RNA features around splice junctions (only on spliced transcripts)
  3. plotMetaCodon: Displays the distribution of RNA features around start or stop codons (only on protein-coding transcripts)

Each of these functions is implemented as a separate R script, all of which operate on the the output of r2d annotate.

Note:annotate must be called with the -H flag, since column headers are required to specify which columns can be used to filter for significant RNA features (see below)

For general usage, the plots can be generated using the r2d plot... commands from the Rust-based command-line interface, as described in the README. The source Rscript for each plot is available at ./scripts/R2_plot[type].R and can be modified and used directly if greater customisation is required.

Common concepts for R2Dtool metaplots.

Plotting the density of positive features

R2Dtool metaplots are designed to show the density of RNA features on metatranscript positions; that is, the proportion of 'positive' RNA features, against all features that have been tested, at a position, either on a scaled metatranscript model (metatranscript plots), or at a fixed distance from a transcriptomic landmark (metacodon or metajunction plots).

Binning features.

In order to specify 'positive features', a numberic column must exist in the annotated BED file, for example, site stoichiometry, or site p-value, from which 'negative' and 'positive' sites can be identified. In order to facilitate this, the input data provided to the plots should contain information about both 'tested' and 'significant' RNA features. The syntax of the R2Dtool plots relies on both these data types being present, and is specified:

Filtering and Significance

In order to normalise for differences in the amounts of sites tested at different transcriptomic positions, R2Dtool normalises the number of 'positive' sites at a given position, to the total number of sites that have been tested at that position. To perform this normalisation, R2Dtool requires a method to determine which RNA features are considered 'significant'. This is done using three parameters;

  • filter_field: The name of the column in the input file used to filter significant sites.
  • cutoff: A numeric value defining the threshold for significance.
  • cutoff_type: Specifies whether values above (upper) or below (lower) the cutoff are considered significant.

These parameters allow R2Dtool to work with various types of RNA feature data. For example:

  • For RNA modification data, you might use a stoichiometry measure as the filter_field, with a cutoff of 0.5 and cutoff_type of upper to consider sites with >50% modification as significant.

  • For statistical measures, you could use a p-value as the filter_field, with a cutoff of 0.05 and cutoff_type of lower to consider sites with p < 0.05 as significant.

Density and Proportion

In R2Dtool plots, "density" refers to the relative abundance of a feature on a transcript, normalized for the length of the region and the number of tested sites at a given distance from a feature of interest.

The y-axis in R2Dtool plots typically shows the "proportion of significant sites", which is calculated as:

proportion = (number of significant sites) / (total number of sites)

This proportion is calculated for each bin or position along the x-axis.

Confidence Intervals

R2Dtool offers two methods for calculating and displaying confidence intervals:

  1. LOESS (Local Regression): This method uses local weighted regression to smooth the data and calculate confidence intervals. It's the default method and generally produces smoother curves.

  2. Binomial: This method calculates exact binomial confidence intervals for each point. It may be more appropriate for datasets with high variability or where precise point estimates are needed.

Users can specify the confidence interval method using the -c flag in the command-line interface.

Common Parameters

All plotting functions share these common parameters:

  • input_file: Path to the annotated sites file generated by r2d annotate.
  • output_file: Path where the plot will be saved (include file extension, e.g., .png or .svg).
  • filter_field: The name of the column in the input file used to filter significant sites.
  • cutoff: Numeric value defining the threshold for significance.
  • cutoff_type: Specifies the comparison direction, either 'lower' or 'upper', to determine significance.
  • confidence_method: Strategy for displaying confidence intervals: 'loess' (default) or 'binomial'.
  • save_table: (Optional) Path to save the aggregated data as a tab-separated file, where source data is required.

plotMetaTranscript

This function generates a plot showing the distribution of RNA features across a normalized transcript model.

Usage

r2d plotMetaTranscript -i <input_file> -o <output_file> -f <filter_field> -u <cutoff> -t <cutoff_type> [-c <confidence_method>] [-s <save_table>] [-l]

Additional Parameters

  • -l: Display transcript region labels (5' UTR, CDS, 3'UTR) on the plot.

Plot Description

The x-axis represents the relative position along the transcript, normalized to a scale of 0-3, where:

  • 0-1: 5' UTR
  • 1-2: CDS (Coding Sequence)
  • 2-3: 3' UTR

The y-axis shows the proportion of significant sites at each position.

Calculation Method

  1. Each RNA feature is assigned to a metatranscript region (5' UTR, CDS, or 3' UTR) based on the transcript model annotated for the isoform to which the feature is mapped.
  2. The relative position of the feature in the metatranscript region is calculated by comparing the position of the feature to the length of the metatranscript region.
  3. The script bins the normalized transcript positions into intervals.
  4. For each interval, it calculates the ratio of significant sites to total sites.
  5. A smoothed line is drawn using either LOESS regression or binomial confidence intervals.

Interpretation

This plot allows you to visualize how RNA features are distributed across different regions of transcripts. For example, you might observe enrichment of modifications in the 3' UTR or depletion near the start codon. The isoform-aware nature of R2Dtool ensures that features are correctly placed based on the specific isoform they were mapped to, providing a more accurate representation than methods that use a single representative transcript per gene.

plotMetaJunction

This function creates a plot showing the distribution of RNA features around splice junctions. By default, the plot shows the proportions of features when junctions are at specific distances from sites, e.g. the sites are shown at x = 0 and the coordinates of the junctions are indicated on the plot.

Usage

r2d plotMetaJunction -i <input_file> -o <output_file> -f <filter_field> -u <cutoff> -t <cutoff_type> [-c <confidence_method>] [-s <save_table>] [-r]

The -r flag reverses the x-axis, so the positions of junctions are shown at x = 0, and the distance to the nearest m6A sites is indicated on the x-axis. An example is shown on the README page.

Plot Description

The x-axis represents the distance from the nearest splice junction, with negative values indicating upstream positions and positive values indicating downstream positions.

The y-axis shows the proportion of significant sites at each position relative to the splice junctions.

Calculation Method

  1. The script calculates the distance of each site to its nearest upstream and downstream splice junctions, considering the specific isoform the feature is mapped to.
  2. It then bins these distances and calculates the ratio of significant sites in each bin.
  3. A smoothed line is drawn using either LOESS regression or binomial confidence intervals.

Interpretation

This plot helps visualize how RNA features are distributed around splice junctions. It can reveal patterns such as depletion or enrichment of features near splice sites, which might indicate roles in splicing regulation or be a consequence of the splicing process. The isoform-aware approach ensures that the distances are calculated based on the actual splice junctions present in the isoform where each feature was detected.

plotMetaCodon

This function generates a plot showing the distribution of RNA features around start or stop codons.

Usage

r2d plotMetaCodon -i <input_file> -o <output_file> -f <filter_field> -u <cutoff> -t <cutoff_type> [-c <confidence_method>] [-s <save_table>] (-s | -e)

Additional Parameters

  • -s: Plot distribution around the start codon
  • -e: Plot distribution around the stop codon

Plot Description

The x-axis represents the distance from the start or stop codon, with the codon position at 0.

The y-axis shows the proportion of significant sites at each position relative to the codon.

Calculation Method

  1. The script calculates the distance of each site to the start or stop codon of the specific isoform it's mapped to.
  2. It then bins these distances and calculates the ratio of significant sites in each bin.
  3. A smoothed line is drawn using either LOESS regression or binomial confidence intervals.

Interpretation

This plot allows you to examine how RNA features are distributed relative to start or stop codons. It can reveal patterns such as enrichment of modifications near the stop codon, which might suggest roles in translation termination or mRNA stability. The isoform-aware approach ensures that the distances are calculated based on the actual start or stop codon positions in the isoform where each feature was detected.

Rust Implementation and R Scripts

While the plotting functions are implemented as R scripts, they can be easily called using the R2Dtool Rust-based command-line interface. The Rust code handles parameter parsing and calls the appropriate R script with the correct arguments.

The R scripts use the ggplot2 library to generate high-quality, publication-ready plots. They also handle data aggregation and statistical calculations.

This design allows for easy integration of the plotting functions into larger bioinformatics pipelines while leveraging the powerful plotting capabilities of R and ggplot2.

Isoform-specific metaplots using R2Dtool

  1. Isoform-Aware Analysis: Unlike methods that use a single representative transcript per gene, R2Dtool considers the specific isoform each feature is mapped to. This provides a more accurate representation of feature distributions, especially for genes with multiple isoforms.

  2. Flexibility: The filtering approach allows R2Dtool to work with various types of RNA feature data, whether it's based on stoichiometry, statistical significance, or other measures.

  3. Comprehensive Visualization: By providing multiple plot types (metatranscript, metajunction, metacodon), R2Dtool allows researchers to examine RNA feature distributions from different perspectives.

  4. Statistical Rigor: The inclusion of confidence intervals (either through LOESS or binomial methods) provides a measure of uncertainty in the observed patterns.

  5. Integration with Bioinformatics Pipelines: The command-line interface and option to save data tables make it easy to incorporate R2Dtool plots into larger analysis workflows.

Clone this wiki locally