Calculating pairwise Tree Edit Distance (TED) for RNA structures can be very time-consuming, especially with a large number of different structures. Therefore, I have implemented a method to make an approximate prediction of the Tree Edit Distance using various features of the Dot Bracket Notation. This approach aims to accelerate computations by prefiltering the data for later precise calculations.
-
Create the Conda environment:
- If you have Conda installed, use:
conda env create -f environment.yml --yes
- If you have Mamba installed, use:
mamba env create -f environment.yml --yes
- If you have Conda installed, use:
-
Activate the environment:
conda activate predTED
- Compile the C code:
- Install LightGBM statically
git clone --recursive https://github.com/microsoft/LightGBM.git cd LightGBM mkdir build cd build cmake -DBUILD_STATIC_LIB=ON .. make -j cd ../..
- Run:
#xxd -i model.txt > model.h # to compile the model for predTED.c g++ -O3 -march=native -mtune=native -ffast-math -funroll-loops -fopenmp \
cp predTED ~/.local/bin/predTED- Move it to the bin folder - Run the prediction:
- Execute the compiled program with two RNA structures in Dot Bracket Notation:
./predTED "((..))" "(()).."
- This will output the predicted Tree Edit Distance.
- Execute the compiled program with two RNA structures in Dot Bracket Notation:
To utilise predTED as a Python library, follow these steps:
-
Ensure the environment is activated:
- If not already activated, run:
conda activate predTED
- If not already activated, run:
-
Import the predTED module:
- In your Python script, import the module:
import predTED
- In your Python script, import the module:
-
Prepare your RNA structures:
- Define your RNA structures in Dot Bracket Notation as strings.
-
Call the predict_TED function:
- Use the
predict_TEDfunction with the structures, weights, number of weights, and intercept. - Example:
struct1 = "((..))" struct2 = "(()).." predicted_ted = predTED.predict_TED(struct1, struct2, weights, num_weights, intercept) print(f"Predicted TED: {predicted_ted}")
- Use the
Below is a plot showing the relationship between predicted and true Tree Edit Distances:
This plot illustrates the accuracy of the prediction model by comparing predicted TED values to the actual TED values.
-
Prepare your data:
- Ensure you have the RNA structures in a file named
structures.txt, with one structure per line in Dot Bracket Notation. - Ensure you have the pairwise Tree Edit Distance matrix in a file named
ted_matrix.txt, with space-separated values.
- Ensure you have the RNA structures in a file named
-
Run the script:
- Execute the main script to compute and save the feature weights:
python compute_feature_weights.py
- The script will compute the feature weights and save them in
feature_weights.json, sorted by the absolute value of their weights.
- Execute the main script to compute and save the feature weights:
-
Interpret the results:
- The
feature_weights.jsonfile contains the weights for each feature, indicating their importance in predicting the Tree Edit Distance. - Features with larger absolute weights are more important for the prediction.
- The
-
Generate model.h To make the model available for predTED, you have to converte it into a C-constant.
xxd -i model.txt > model.h
The following features are computed for each RNA structure:
internal_loops: Number of internal loops.var_depth_paired: Variance of the depth of paired bases.multiloops: Number of multiloops.max_loop: Size of the largest loop.length: Length of the structure.- many more
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or issues, please open an issue on the GitHub repository.
