Skip to content

Original data and code of RFLDA algorithm

License

Notifications You must be signed in to change notification settings

seyedrezamirkhani/RFLDA

 
 

Repository files navigation

RFLDA

Introduction

This project contains the code and data for the paper A random forest based computational model for predicting novel lncRNA-disease associations by Yao, Dengju, et al.

Original data and code of RFLDA algorithm is available in code & data.zip

The data and code have been re-organised as follows:

  • Data resides in input_data folder. These are the original excel files extracted from code & data.zip

  • Source code resides in the src folder

  • Output generated by RFLDA resides in the output_data folder

  • Results of code optimisation tests reside in optimisation_data

R files

File Description
RFLDA.R The original RFLDA.txt has been renamed to RFLDA.R and the code optimised. For more information see the RFLDA code changes section. Additionally, only the functions are declared in this file.
test_RFLDA.R executes the functions in RFLDA.R in order and records the execution time in optimisation_data/RFLDA_result.csv
common_functions.R shared functions used by optimisation test scripts
compare_excel_libraries_read.R compares read speed of OpenXLSX and ReadXL packages and records results in optimisation_data/excel_read_result.csv
compare_excel_libraries_write.R compares write speed of OpenXLSX and WriteXL packages and records results in optimisation_data/excel_write_result.csv
compare_parquet_and_excel_libraries.R compares write speed of OpenXLSX, WriteXL and Arrow packages and records results in optimisation_data/parquet_and_excel_write_result.csv
compare_randomforest_libraries.R compares training time of RandomForest and Ranger packages and records results in optimisation_data/randomforest_comparison_result.csv
optimise_nested_loop.R compares joining two datasets with original code which used a nested loop vs SqlDF package. Warning the nested loop takes a long time! Records results in optimisation_data/optimise_nested_loop.csv
save_main_R_package_versions.R writes the package name and versions for R package that were used for code optimisation to main_R_package_versions.txt
save_R_package_versions.R writes all package names and associated versions in R environment to R_package_versions.txt

RFLDA code changes

The following changes have been made to the original code:

1 - The original code fails to write and read back the LDA object. This is resolved by converting LDA to a data.frame before saving it to disk. Perhaps the code may have worked with the openxlsx before and subsequent changes to this package stopped supporting of writing the matrix to excel?

2 - The openxlsx is very slow at writing xlsx files. Additionally, the file lncRNA-disease-ALL.xlsx cannot be opened with LibreOffice Calc. To resolve this issue, writexl is used instead.

3 - The original code converted LDA into a matrix which is a bug as this dataframe contains two columns of text.

4 - Changed generation of labels for LDExcl0 to use sqldf instead of nested loops changing the time taken from approx. 10 hours to 1 minute.

5 - Switched from RandomForest library to ranger as it supports usage of multiple processor cores.

6 - Added support for using parquet files which are compact and fast to read from.

Running the R files

You can run the R files directly in R Studio or from the command line using Rscript utility. Please make sure the working directory, using the setwd function, is set the src folder containing the R files. This is to ensure the input files are found and the output files are placed in the desired location.

Technology Stack

Operating System

  • Primary Development OS: Ubuntu 24.04 LTS
  • Compatible OS: Windows 10, macOS 11.0 (Big Sur)

Programming Languages

  • R (version 4.4.1)
  • Bash (5.2.21)

R Package info

The full list of R-Packages that were installed on the development machine can be found in R_package_versions.txt.

The main R-Packages are:

Package Version
arrow 16.1.0
diffdf 1.0.4
openxlsx 4.2.5.2
randomForest 4.7-1.1
ranger 0.16.0
sqldf 0.4-11
writexl 1.5.0

Code Formatting

The styler package is used to format the R files created in this project. There is a pre-commit git hook which you can use to automate this process. Make sure that styler is installed in your R environment, then install this hook using ./install-hooks.sh bash script.

About

Original data and code of RFLDA algorithm

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 97.9%
  • Shell 2.1%