miRACLe: improving the prediction of miRNA-mRNA interactions by a random contact model

miRACLe (miRNA Analysis by a Contact modeL) is a newly developed miRNA target prediction tool. It combines genome-wide expression profiles and the cumulative weighted context++ score from TargetScan in a random contact model, and then infers miRNA-mRNA interactions (MMIs) by the relative probability of effective contacts. Evaluation by a variety of measures shows that miRACLe consistently outperforms state-of-the-art methods in prediction accuracy, regulatory potential and biological relevance while has a distinct feature of inferring individual-specific miRNA targets. Empirical test suggests that on a laptop Intel Core i7-4712HQ personal computer with a 2.30 GHz CPU and 16 GB of RAM, our source code implementation requires less than 30 seconds of CPU time to complete the prediction for one sample. Importantly, we show that our model can also be applied to other sequence-based algorithms to improve their predictive power, such as DIANA-microT-CDS, miRanda-mirSVR and MirTarget4.

2. Executing miRACLe

2.1 Files required

In order to run the current version of miRACLe, the users should provide two data files that describe the expression levels of each miRNA and mRNA for the same sample. And one additional file that defines the correspondence of samples between the miRNA and mRNA data files. All files are tab-delimited ASCII text files and must comply with the following specifications:

Input miRNA expression file is organized as follows:

miRNA	TCGA-05-4384-01A-01T-1754-13	TCGA-05-4390-01A-02T-1754-13	TCGA-05-4396-01A-21H-1857-13	TCGA-50-5066-01A-01T-1627-13
hsa-let-7a-5p	19.0144	16.2421	19.2817	18.0721
hsa-let-7a-3p	7.31298	6.2094	7.8392	6.2667
hsa-let-7a-2-3p	6.5235	5.4594	3.7004	7.3837
hsa-let-7b-5p	16.9613	15.5496	17.8444	16.9950
hsa-let-7b-3p	7.9248	5.2094	7.6653	6.8201

The first line contains the labels Name followed by the identifiers for each sample in the dataset.

Line format: Name(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)
Example: miRNAName sample_1 sample_2 ... sample_n

The remainder of the file contains data for each of the miRNAs. There is one line for each miRNA. Each line contains the miRNA name and a value for each sample in the dataset.

Input mRNA expression file is organized as follows:

Gene	TCGA-05-4384-01	TCGA-05-4390-01	TCGA-05-4396-01	TCGA-50-5066-01
AARS	10.7094	11.6932	12.4282	11.0464
AASDHPPT	9.9081	9.6716	10.1113	9.98328
AASDH	7.9471	7.2897	8.3216	7.6274
AASS	9.9649	7.7752	9.1723	5.9506
AATF	9.9525	9.5380	9.3670	8.4375

The first line contains the labels Name followed by the identifiers for each sample in the dataset.

Line format: Name(tab)(sample 1 name)(tab)(sample 2 name) (tab) ... (sample N name)
Example: GeneName sample_a sample_b ... sample_m

The remainder of the file contains data for each of the mRNAs. There is one line for each mRNA. Each line contains the mRNA name and a value for each sample in the dataset.

Note that the input miRNA/mRNA expression file should be transformed into a non-negative matrix, in order for the main program to execute correctly. Both microarray profiling and RNA sequencing data are accepted as input. To achieve optimal prediction on the sequencing data, we strongly recommend that users provide log2 transformed normalized counts (e.g. RSEM or RPM) as the input for our program.

Sample matching file generally contains two columns, which shows the corresponding relationship of the sample identifiers in miRNA expression file and mRNA expression file (miRNA must be the first column and mRNA must be the second column). It also serves as a index to denote which samples we choose to analyze. It is organized as follows:

miRNA	Gene
TCGA-50-5066-01A-01T-1627-13	TCGA-50-5066-01
TCGA-05-4384-01A-01T-1754-13	TCGA-05-4384-01
TCGA-05-4390-01A-02T-1754-13	TCGA-05-4390-01
TCGA-05-4396-01A-21H-1857-13	TCGA-05-4396-01

The first line must contain the label Names for samples in each expression dataset with the first column for miRNA and second column for mRNA.

Line format: (sample name in miRNA file)(tab)(sample name in mRNA file)
Example: sample_1 sample_a

The remainder of the file contains sample identifiers used in the miRNA and mRNA expression files. There is one line for each sample. Each line contains the identifiers for that sample.

2.2 Script Execution

miRACLe is written in R and can be downloaded here along with test datasets. The source code of miRACLe consists of three parts, namely, 'FUNCTIONS', 'DATA INPUT' and 'MAIN CODE'. The main function "miracle" in "MAIN PROGRAM" calculates the miracle score for each miRNA-mRNA pair at individual and population levels, based on which all putative MMIs are ranked. The essential inputs that the miRACLe algorithm requires to run includes two parts:

The first part contains the sequence-based interaction scores (seqScore) for putative miRNA-mRNA pairs. These scores are originally obtained from TargetSan v7.2 (TargetScan7_CWCS_cons and TargetScan7_CWCS), DIANA-microT-CDS (DIANA_microT_CDS), MirTarget v4 (MirTarget4), miRanda-mirSVR (miRanda_mirSVR) and compiled by the developers to fit the model. Default is TargetScan7_CWCS_cons. The other scores can be downloaded here.

seqScore = as.matrix(read.table("TargetScan7_CWCS_cons.txt", head = TRUE, sep = "\t"))

User can also provide their own sequence matching scores, as long as the format of input file meets the requirements. Specifically, the first line must contain the label Names for mRNAs, miRNAs and their associated interaction scores. The remainder of the file contains RNA identifiers corresponding to those used in the expression files and the scores for each miRNA-mRNA pair. Note that the first column must contain identifiers for mRNAs, the second column must contain identifiers for miRNAs with the third column containing the associated scores.

The second part contains paired miRNA-mRNA expression profiles and should be provided by the users.

sampleMatch = as.matrix(read.table("Test_sampleMatch.txt", head = TRUE, sep = "\t"))
mirExpr = as.matrix(read.table("Test_miRNA_expression.txt", head = FALSE, sep = "\t"))
tarExpr = as.matrix(read.table("Test_mRNA_expression.txt", head = FALSE, sep = "\t"))

The "miracle" function also provides three optional parameters for users, which are: exprFilter (filter of expression profile, miRNAs/mRNAs that are not expressed in more than a given percentage of samples will be removed, default is 1), samSelect (sample selection, users can select a subset of all samples to analyze, default is no selection applied) and OutputSelect (logical variable, select “TRUE” to return the top 10 percent-ranked predictions by scores, and “FALSE” to return the whole prediction result. Default is TRUE).

miracle(seqScore, sampleMatch, mirExpr, tarExpr) #default
miracle(seqScore, sampleMatch, mirExpr, tarExpr, exprFilter = 1, samSelect, OutputSelect = TRUE) #optional parameters added

We also provide an R package of the algorithm for ease of use.

3. Benchmarking evaluations

The codes to reproduce the benchmarking evaluations are written in R.
Generally, all these codes are arranged into three parts as 'FUNCTIONS', 'INPUT DATA' and 'MAIN CODE'. The users need to download and fill in the relevant input files before implementing corresponding analyses.
Files required for the reproduction of the evaluations can be broadly classified into three categories:

Sequence-based predictions (including seqScores for integrative methods)

Data file	Description
TargetScan7_CWCS_cons.txt	cumulative weighted context++ scores for conserved targets sites of conserved miRNA families obtained from TargetScan v7.2
TargetScan7_CWCS.txt	cumulative weighted context++ scores for all miRNA-mRNA pairs obtained from TargetScan v7.2
TargetScan7_qMRE_cons.txt	number of conserved target sites of conserved miRNA families obtained from TargetScan v7.2
TargetScan7_qMRE.txt	number of target sites for all miRNA-mRNA pairs obtained from TargetScan v7.2
DIANA_microT_CDS.txt	human interactions with miTG scores greater than 0.7 obtained from DIANA-microT-CDS
miRanda_mirSVR.txt	human conserved miRNA predictions with good mirSVR score obained from miRanda-mirSVR
miRmap.txt	predictions from miRmap
miRTar2GO.txt	predictions from the “Highly sensitive” prediction set of miRTar2GO
miRTar2GO_HeLa.txt	predictions in HeLa cells from the “Highly sensitive” prediction set of miRTar2GO
MirTarget4.txt	human predictions obtained from miRDB v6.0
miRWalk3.txt	human predictions restricted to 3`UTR obtained from miRWalk v3.0
PITA.txt	the top human predictions with 3/15 flank obtained from PITA
Combine_MMIs.txt	combined predictions from DIANA-microT-CDS, miRanda-mirSVR, MirTarget4, PITA and TargetScan7.CWCS
Symbol_to_ID.txt	paired gene symbols and gene entrez IDs downloaded from HGNC

These predictions are provided in a compressed file Sequence_based_predictions.7z.

Input expression data files (mirExpr & tarExpr)

Data file	Descriptions
HeLa expression data	normalized microarray/RNA-Seq expression data for HeLa cell line
NCI60 data	normalized microarray data for 59 NCI-60 cancer cell lines
TCGA data	log2-transformed RPM/RSEM data for 7991 cancer patients from 32 TCGA cancer types
MCC data	normalized microarray data for 68 tumor tissues and 21 normal tissues

These expression data files are provided along with relevant source codes except that the TCGA expression data files are provided in a compressed file TCGA_data.7z.

Validation data (Reference data)

Experimentally validated MMIs

Data file	Description	validated MMI counts
Vset_HeLa.txt	MMIs that are validated in HeLa cells from TarBase v8.0	34,263
Vset_celllines.txt	MMIs that are validated in cell lines from TarBase v8.0	349,726
Vset_all.txt	validated MMIs obtained from TarBase v8.0	376,205
Vset_hc.txt	high-confidence set compiled from TarBase v8.0, miRTarbase v7.0, miRecords and oncomirDB	10,575

Curated miRNA transfection experiments

Data file	Description
Transet_HeLa_Array.txt	Unified dataset of 5 miRNA transfections in HeLa cell line in which gene exrpession changes are measured by microarray
Transet_HeLa_Seq.txt	Unified dataset of 25 miRNA transfections in HeLa cell line in which gene exrpession changes are measured by RNA-Seq
Transet_multi.txt	Unified dataset of 105 non-redundant miRNA transfections that are originally collected from 77 human cell lines or tissues

Known cancer genes

Data file	Description	Molecule counts
Cancer_gene_set	cancer genes obtained from cancer gene census	723

These reference data files are provided along with relevant source codes.

4. References

miRACLe: improving the prediction of miRNA-mRNA interactions by a random contact model (in preparation)

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
Benchmarking evaluations		Benchmarking evaluations
Run miRACLe		Run miRACLe
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

miRACLe: improving the prediction of miRNA-mRNA interactions by a random contact model

Table of contents

1. Introduction

2. Executing miRACLe

2.1 Files required

2.2 Script Execution

3. Benchmarking evaluations

4. References

About

Releases

Packages

License

lookgene/miRACLe-Research

Folders and files

Latest commit

History

Repository files navigation

miRACLe: improving the prediction of miRNA-mRNA interactions by a random contact model

Table of contents

About

Resources

License

Stars

Watchers

Forks