TF-DFE: Topology-Guided Multi-View Ensemble Learning for Evidence-Aware Somatic Variant Pathogenicity Classification in Cancer Genomics

Paper: Topology-Guided Multi-View Ensemble Learning for Evidence-Aware Somatic Variant Pathogenicity Classification in Cancer Genomics
Authors: Asif Ahamed, Md. Tanvir Hasan, Most. Alisa Tabassum, Ahammad Hossain*, Md. Kamruzzaman, Jayanta Das, A.H.M. Rahmatullah Imon
Corresponding Author: Ahammad Hossain — ahammadstatru@gmail.com

Overview

TF-DFE (Topo-Fractal Dynamic Fuzzy Ensemble) is a scalable, evidence-aware framework for classifying somatic single nucleotide variants (SNVs) as pathogenic drivers or benign passengers. The framework combines:

Standard biological predictors - REVEL, CADD, SIFT, PolyPhen2, GERP++, phyloP, structural/UniProt features
Fractal sequence features - Frequency Chaos Game Representation (FCGR) at k=3 and k=4 (320 dimensions)
Topological features - Persistent homology via giotto-tda (6 dimensions)
Dynamic ensemble selection - Enhanced KNORA-Eliminate strategy

Final dataset: 207,266 balanced somatic SNVs curated from dbNSFP v5.3a (GRCh37)
Performance: MCC = 0.8702 | AUROC = 0.9666 | AUPRC = 0.9768 | Accuracy = 93.48%

Repository Structure

TF-DFE/
├── README.md
├── requirements.txt
├── scripts/
│   ├── 01_somatic_variant_processor.py       # dbNSFP parsing, CIViC/COSMIC rescue, labeling
│   ├── 02_remove_missing_values.py           # Selective missing value removal
│   ├── 03_remove_duplicates.py               # Deduplication on chr/pos/ref/alt
│   ├── 04_structural_feature_engineering.py  # UniProt + AlphaFold2 + SASA features
│   ├── 05_remove_leakage_columns.py          # Leakage column removal
│   ├── 06_dataset_balancing.py               # Downsample to balanced 207,266 variants
│   ├── 07_eda.py                             # Exploratory data analysis + figures
│   └── 08_tf_dfe_model.py                    # Main TF-DFE model, evaluation, SHAP
└── data/
    └── Final_Dataset.csv                     # Balanced dataset (207,266 variants, 24 features)

Pipeline

Run the scripts in order. Each script outputs a CSV that is used as input to the next step.

dbNSFP5.3a_grch37.gz
        │
        ▼
01_somatic_variant_processor.py  →  somatic_variant_dbNSFP.csv  (19,613,687 variants)
        │
        ▼
02_remove_missing_values.py      →  somatic_variant_dbNSFP_Removes_missing_values.csv  (9,241,413)
        │
        ▼
05_remove_leakage_columns.py     →  somatic_variant_dbNSFP_Removes_leakage.csv
        │
        ▼
03_remove_duplicates.py          →  somatic_variant_dbNSFP_Removes_leakage_missing_values_Deduplication.csv  (6,461,982)
        │
        ▼
04_structural_feature_engineering.py  →  Final_Dataset_UniProt_Removes_leakage.csv  (adds UniProt/AlphaFold2 features)
        │
        ▼
06_dataset_balancing.py          →  data/Final_Dataset.csv  (207,266 balanced variants)
        │
        ▼
07_eda.py                        →  EDA_Results/  (figures and statistics)
        │
        ▼
08_tf_dfe_model.py               →  results_publication/  (model results, figures, SHAP)

External Data Requirements

These large/licensed files must be downloaded separately before running the pipeline.

File	Source	Used In
`dbNSFP5.3a_grch37.gz`	dbNSFP	Script 01
`01-Jan-2026-VariantSummaries.tsv`	CIViC Releases	Script 01
`Cosmic_CancerGeneCensus_v102_GRCh37.tsv`	COSMIC	Script 01
`cmc_export.tsv`	COSMIC Mutant Census	Script 01
`oncokb_biomarker_drug_associations.tsv`	OncoKB	Script 01
`uniprot_sprot_human.dat`	UniProt	Script 04
AlphaFold2 PDB files (`AF--model_v.pdb`)	AlphaFold DB	Script 04

Note: Place all external files in the same directory as the scripts before running, or update the file paths inside each script accordingly.

Installation

git clone https://github.com/asifahamed11/TF-DFE.git
cd TF-DFE
pip install -r requirements.txt

Python 3.8 or higher is required.

Quick Start (Skip to Model Training)

If you only want to reproduce the model results using the pre-processed dataset:

# Dataset is already available in data/Final_Dataset.csv
# Update DATA_PATH in script 08 if needed, then run:
python scripts/08_tf_dfe_model.py

Results will be saved to results_publication/.

Dataset

The final balanced dataset (data/Final_Dataset.csv) contains 207,266 somatic SNVs (103,633 pathogenic / 103,633 benign) with the following features:

Feature Group	Features	Dimensions
Genomic coordinates	chr, pos, ref, alt	4
Pathogenicity scores	REVEL, CADD, SIFT, PolyPhen2, GERP++, phyloP	6
Cancer gene annotations	IS_CANCER_GENE, IS_TIER1, IS_ONCOGENE, IS_TSG	4
Structural features	SASA, RELATIVE_SASA, PLDDT_SCORE	3
UniProt functional sites	IS_IN_DOMAIN, IS_ACTIVE_SITE, IS_BINDING_SITE, IS_TRANSMEMBRANE, DISTANCE_TO_ACTIVE_SITE	5
Label	LABEL_PATHOGENIC (0=benign, 1=pathogenic)	1

FCGR (320-dim) and TDA (6-dim) features are computed on-the-fly inside 08_tf_dfe_model.py.

Reproducing Paper Figures

All figures in the manuscript are generated by 08_tf_dfe_model.py and saved to results_publication/ as both .png and .tiff.

EDA figures (Fig 4a, 4b, 4c, 5, 5b) are generated by 07_eda.py and saved to EDA_Results/.

License

This project is licensed under the MIT License. See LICENSE for details.

Contact

Ahammad Hossain (Corresponding Author)
Department of Computer Science & Engineering, Varendra University, Bangladesh
Email: ahammadstatru@gmail.com
ORCID: 0000-0001-9081-5905

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-DFE: Topology-Guided Multi-View Ensemble Learning for Evidence-Aware Somatic Variant Pathogenicity Classification in Cancer Genomics

Overview

Repository Structure

Pipeline

External Data Requirements

Installation

Quick Start (Skip to Model Training)

Dataset

Reproducing Paper Figures

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TF-DFE: Topology-Guided Multi-View Ensemble Learning for Evidence-Aware Somatic Variant Pathogenicity Classification in Cancer Genomics

Overview

Repository Structure

Pipeline

External Data Requirements

Installation

Quick Start (Skip to Model Training)

Dataset

Reproducing Paper Figures

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages