Benchmarking GRN inference methods The full documentation is hosted on ReadTheDocs.
Repository: openproblems-bio/task_grn_inference
GRNs are essential for understanding cellular identity and behavior. They are simplified models of gene expression regulated by complex processes involving multiple layers of control, from transcription to post-transcriptional modifications, incorporating various regulatory elements and non-coding RNAs. Gene transcription is controlled by a regulatory complex that includes transcription factors (TFs), cis-regulatory elements (CREs) like promoters and enhancers, and essential co-factors. High-throughput datasets, covering thousands of genes, facilitate the use of machine learning approaches to decipher GRNs. The advent of single-cell sequencing technologies, such as scRNA-seq, has made it possible to infer GRNs from a single experiment due to the abundance of samples. This allows researchers to infer condition-specific GRNs, such as for different cell types or diseases, and study potential regulatory factors associated with these conditions. Combining chromatin accessibility data with gene expression measurements has led to the development of enhancer-driven GRN (eGRN) inference pipelines, which offer significantly improved accuracy over single-modality methods.
Here, we present geneRNIB as a living benchmark platform for GRN inference. This platform provides curated datasets for GRN inference and evaluation, standardized evaluation protocols and metrics, computational infrastructure, and a dynamically updated leaderboard to track state-of-the-art methods. It runs novel GRNs in the cloud, offers competition scores, and stores them for future comparisons, reflecting new developments over time.
The platform supports the integration of new datasets and protocols. When a new feature is added, previously evaluated GRNs are re-assessed, and the leaderboard is updated accordingly. The aim is to evaluate both the accuracy and completeness of inferred GRNs. It is designed for both single-modality and multi-omics GRN inference. Ultimately, it is a community-driven platform.
So far, ten GRN inference methods have been integrated: five single-omics methods of GRNBoost2, GENIE3, Portia, PPCOR, and Scenic; and five eGRN inference methods of Scenic+, CellOracle, FigR, scGLUE, and GRaNIE.
Due to its flexible nature, the platform can incorporate various benchmark datasets and evaluation methods, using either prior knowledge or feature-based approaches. In the current version, due to the absence of standardized prior knowledge, we use indirect approaches to benchmark GRNs. Employing interventional data as evaluation datasets, we have developed 8 metrics using feature-based approach and Wasserstein distance, accounting for both accuracy and comprehensiveness.
Five datasets have been integrated so far, namely OPSCA, Nakatake, Norman, Adamson, and Replogle. For each dataset, standardized inference datasets are provided to be used for GRN inference and evaluation datasets are employed to benchmark. See our publication for the details of methods.
name | roles |
---|---|
Jalil Nourisa | author |
Robrecht Cannoodt | author |
Antoine Passimier | contributor |
Marco Stock | contributor |
Christian Arnold | contributor |
flowchart TB
file_atac_h5ad("<a href='https://github.com/openproblems-bio/task_grn_inference#file-format-NA'>NA</a>")
comp_method[/"<a href='https://github.com/openproblems-bio/task_grn_inference#component-type-method'>method</a>"/]
file_prediction("<a href='https://github.com/openproblems-bio/task_grn_inference#file-format-NA'>NA</a>")
comp_metric_regression[/"<a href='https://github.com/openproblems-bio/task_grn_inference#component-type-metric-regression'>metric_regression</a>"/]
comp_metric_ws[/"<a href='https://github.com/openproblems-bio/task_grn_inference#component-type-ws-distance'>ws_distance</a>"/]
comp_metric[/"<a href='https://github.com/openproblems-bio/task_grn_inference#component-type-metrics'>metrics</a>"/]
file_score("<a href='https://github.com/openproblems-bio/task_grn_inference#file-format-NA'>NA</a>")
file_evaluation_h5ad("<a href='https://github.com/openproblems-bio/task_grn_inference#file-format-NA'>NA</a>")
file_rna_h5ad("<a href='https://github.com/openproblems-bio/task_grn_inference#file-format-multiomics-rna'>multiomics rna</a>")
comp_method_r[/"<a href='https://github.com/openproblems-bio/task_grn_inference#component-type-method-r'>Method r</a>"/]
file_atac_h5ad-.-comp_method
comp_method-.->file_prediction
file_prediction---comp_metric_regression
file_prediction---comp_metric_ws
file_prediction---comp_metric
comp_metric_regression-->file_score
comp_metric_ws-->file_score
comp_metric-->file_score
file_evaluation_h5ad-.-comp_metric_regression
file_rna_h5ad---comp_method
comp_method_r-.->file_prediction
NA
Example file: resources_test/inference_datasets/op_atac.h5ad
A GRN inference method
Arguments:
Name | Type | Description |
---|---|---|
--rna |
file |
RNA expression for multiomics data. |
--atac |
file |
(Optional) Peak data for multiomics data. |
--prediction |
file |
(Optional, Output) GRN prediction. |
--tf_all |
file |
(Optional) NA. |
--max_n_links |
integer |
(Optional) NA. Default: 50000 . |
--num_workers |
integer |
(Optional) NA. Default: 4 . |
--temp_dir |
string |
(Optional) NA. Default: output/temdir . |
--seed |
integer |
(Optional) NA. Default: 32 . |
--causal |
boolean |
(Optional) NA. Default: TRUE . |
NA
Example file: resources_test/grn_models/op/collectri.csv
Calculates regression scores
Arguments:
Name | Type | Description |
---|---|---|
--prediction |
file |
GRN prediction. |
--score |
file |
(Output) File indicating the score of a metric. |
--method_id |
string |
(Optional) NA. |
--layer |
string |
(Optional) NA. Default: X_norm . |
--max_n_links |
integer |
(Optional) NA. Default: 50000 . |
--verbose |
integer |
(Optional) NA. Default: 2 . |
--dataset_id |
string |
(Optional) NA. Default: op . |
--evaluation_data |
file |
(Optional) Perturbation dataset for benchmarking. |
--tf_all |
file |
NA. |
--reg_type |
string |
(Optional) NA. Default: ridge . |
--subsample |
integer |
(Optional) NA. Default: -1 . |
--num_workers |
integer |
(Optional) NA. Default: 4 . |
--apply_tf |
boolean |
(Optional) NA. Default: TRUE . |
--apply_skeleton |
boolean |
(Optional) NA. Default: FALSE . |
Calculates Wasserstein distance for a given GRN and dataset
Arguments:
Name | Type | Description |
---|---|---|
--prediction |
file |
GRN prediction. |
--score |
file |
(Output) File indicating the score of a metric. |
--method_id |
string |
(Optional) NA. |
--layer |
string |
(Optional) NA. Default: X_norm . |
--max_n_links |
integer |
(Optional) NA. Default: 50000 . |
--verbose |
integer |
(Optional) NA. Default: 2 . |
--dataset_id |
string |
(Optional) NA. Default: op . |
--ws_consensus |
file |
NA. |
--ws_distance_background |
file |
NA. |
--evaluation_data_sc |
file |
NA. |
A metric to evaluate the performance of the inferred GRN
Arguments:
Name | Type | Description |
---|---|---|
--prediction |
file |
GRN prediction. |
--score |
file |
(Output) File indicating the score of a metric. |
--method_id |
string |
(Optional) NA. |
--layer |
string |
(Optional) NA. Default: X_norm . |
--max_n_links |
integer |
(Optional) NA. Default: 50000 . |
--verbose |
integer |
(Optional) NA. Default: 2 . |
--dataset_id |
string |
(Optional) NA. Default: op . |
NA
Example file: resources_test/scores/score.h5ad
NA
Example file: resources_test/evaluation_datasets/op_perturbation.h5ad
RNA expression for multiomics data.
Example file: resources_test/inference_datasets/op_rna.h5ad
Format:
AnnData object
obs: 'cell_type', 'donor_id'
layers: 'counts', 'X_norm'
Data structure:
Slot | Type | Description |
---|---|---|
obs["cell_type"] |
string |
(Optional) The annotated cell type of each cell based on RNA expression. |
obs["donor_id"] |
string |
(Optional) Donor id. |
layers["counts"] |
double |
(Optional) Counts matrix. |
layers["X_norm"] |
double |
Normalized values. |
A GRN inference method
Arguments:
Name | Type | Description |
---|---|---|
--rna_r |
file |
(Optional) NA. |
--atac_r |
file |
(Optional) NA. |
--prediction |
file |
(Optional, Output) GRN prediction. |
--temp_dir |
string |
(Optional) NA. Default: output/temdir . |
--num_workers |
integer |
(Optional) NA. Default: 4 . |