Skip to content

syseitz/predTED

Repository files navigation

predTED

Calculating pairwise Tree Edit Distance (TED) for RNA structures can be very time-consuming, especially with a large number of different structures. Therefore, I have implemented a method to make an approximate prediction of the Tree Edit Distance using various features of the Dot Bracket Notation. This approach aims to accelerate computations by prefiltering the data for later precise calculations.

Installation

  1. Create the Conda environment:

    • If you have Conda installed, use:
      conda env create -f environment.yml --yes
    • If you have Mamba installed, use:
      mamba env create -f environment.yml --yes
  2. Activate the environment:

    conda activate predTED

Usage

Run the prediction model

C Version

  1. Compile the C code:
    • Install LightGBM statically
       git clone --recursive https://github.com/microsoft/LightGBM.git
       cd LightGBM
       mkdir build
       cd build
       cmake -DBUILD_STATIC_LIB=ON ..
       make -j
       cd ../..
    • Run:
      #xxd -i model.txt > model.h # to compile the model for predTED.c
      g++ -O3 -march=native -mtune=native -ffast-math -funroll-loops -fopenmp \
    -DNDEBUG -I LightGBM/include predTED.c LightGBM/lib_lightgbm.a -lm -o predTED
    
    - Move it to the bin folder
    
    cp predTED ~/.local/bin/predTED
    
    
  2. Run the prediction:
    • Execute the compiled program with two RNA structures in Dot Bracket Notation:
      ./predTED "((..))" "(()).."
    • This will output the predicted Tree Edit Distance.

Using predTED as a Python Library

To utilise predTED as a Python library, follow these steps:

  1. Ensure the environment is activated:

    • If not already activated, run:
      conda activate predTED
  2. Import the predTED module:

    • In your Python script, import the module:
      import predTED
  3. Prepare your RNA structures:

    • Define your RNA structures in Dot Bracket Notation as strings.
  4. Call the predict_TED function:

    • Use the predict_TED function with the structures, weights, number of weights, and intercept.
    • Example:
      struct1 = "((..))"
      struct2 = "(()).."
      predicted_ted = predTED.predict_TED(struct1, struct2, weights, num_weights, intercept)
      print(f"Predicted TED: {predicted_ted}")

Visualisation

Below is a plot showing the relationship between predicted and true Tree Edit Distances:

TED Prediction Plot

This plot illustrates the accuracy of the prediction model by comparing predicted TED values to the actual TED values.

Creating your own feature weights

  1. Prepare your data:

    • Ensure you have the RNA structures in a file named structures.txt, with one structure per line in Dot Bracket Notation.
    • Ensure you have the pairwise Tree Edit Distance matrix in a file named ted_matrix.txt, with space-separated values.
  2. Run the script:

    • Execute the main script to compute and save the feature weights:
      python compute_feature_weights.py
    • The script will compute the feature weights and save them in feature_weights.json, sorted by the absolute value of their weights.
  3. Interpret the results:

    • The feature_weights.json file contains the weights for each feature, indicating their importance in predicting the Tree Edit Distance.
    • Features with larger absolute weights are more important for the prediction.
  4. Generate model.h To make the model available for predTED, you have to converte it into a C-constant.

    xxd -i model.txt > model.h

Features

The following features are computed for each RNA structure:

  • internal_loops: Number of internal loops.
  • var_depth_paired: Variance of the depth of paired bases.
  • multiloops: Number of multiloops.
  • max_loop: Size of the largest loop.
  • length: Length of the structure.
  • many more

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or issues, please open an issue on the GitHub repository.

About

Predicting the Tree Edit Distance (TED) for RNA Structures in Dot Bracket Notation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages