Skip to content

Latest commit

 

History

History
executable file
·
37 lines (25 loc) · 2.98 KB

File metadata and controls

executable file
·
37 lines (25 loc) · 2.98 KB

Iterative cluster analysis using multi-omics modalities and interpretation with the data translator

This repository contains the source code and results of the iterative clustering of multiomics data and interpretation with the Biomedical Data Translator project. This project began at the 2022 Bio-IT FAIR Data Hackathon. We use the Gabriella Miller Kids First Data Resource Center data supported by the NIH Common Fund--this resource contains data from over 11,000 samples, including DNA and RNA as well as clinical information.

In this project, we initially focused on clustering of gene expression profiles from RNA-Seq data collected from pediatric tumor samples. We then create a simple and interpretable predictive model to determine the gene expression signatures that differentiate the clusters from one another. To gain additional translational insight into the clusters we sought to annotate the important genes from each cluster with data from the NCATS Biomedical Data Translator (github org). Our analyses were executed on the Cavatica cloud-based data analysis and sharing platform.

Summary

Our workflow consisted of the following core steps:

  • Wrangle the NICHD Kids First data
    • See the following notebooks
  • Perform unsupervised clustering of gene expression gene expression using pvclust, which is hierarchichal clustering approach that implements a bootstrapping method for assessing statistical significance of clusters
  • Develop a classification model using xgboost to predict the cluster assignments from the gene expression data. The xgboost model provides feature importance metrics that we use the
  • Annotate results by querying the NCATS Biomedical Data Translator

Future directions

Data types

In the future we hope to integrate additional omics data modalities available through the Kids First Data Resource-such as somatic mutation calls from tumor sequencing, HPO phenotypes and patient clinical characteristics-and additional disease states like the INCLUDE project focused on Down Syndrome.

Methodology

In the future we would like to explore the use of feature selection methods, such as recursive feature elimination, to reduce the number of genes required to make cluster predictions. We would then iterate on the clustering process to see if pvclust performs better on the reduced feature set.

Technical details

Platforms

Dependencies