This project is my graduation thesis, which is a bioinformatics about colon cancer (CRC). The main goal of the project is to develop a colorectal cancer classification software which is cheaper, based on machine learning and classification methods. The other goal is finding new potential survival markers (biomarkers) with survival analysis. More information in the report.
1. Preprocessing: This is an R project. It merges multiple NCBI GEO datasets into one single dataset. It also removes batch-effect to make values more indiscrete, and decreases the data amount with differential gene expression (DGE) analysis. 2. Training & Testing (Classification): This is a Python project. It uses the merged dataset to train models and test each one. It finds the best parameters for each classifier with Grid Search and gives different kinds of performance metrics as result. 3. Survival Analysis: This is an R project. It uses overall survival (OS) data to run survival fit. It finds genes which can be potential survival markers.
- Set working directory as desired.
- Set
gset_names
list with proper GEO series (dataset) names. - For each dataset, set the column name which includes MSS/MSI values in
mss_colnames
list. (Set in the same order with dataset names.) - Set
exprs_file_name
andmdata_file_name
as desired. (Must be .csv files.) - Run all of the code. (RStudio is recommended, but optional.)
- Set working directory as input directory.
- Set
exprs_file_name
andmdata_file_name
input file names, which will be the output files of1_get_and_merge_datasets_vX.R
. (Must be .csv files.) - Set
corrected_exprs_file_name
,pca_before_file_name
andpca_after_file_name
output file names as desired. (corrected_exprs_file_name
must be .csv,pca_before_file_name
andpca_after_file_name
must be .svg files.) - Run all of the code. (RStudio is recommended, but optional.)
- Set working directory as input directory.
- Set
exprs_file_name
andmdata_file_name
input file names, which will be the output files of2_remove_batch_effect_vX.R
. (Must be .csv files.) - Set
deg_exprs_file_name
andup_and_down_table_file_name
output file names as desired. (Must be .csv files.) - Run all of the code. (RStudio is recommended, but optional.)
- Changing input configurations is not recommended. Necessary folders must be created if do not exist.
- Copy merged metadata and expression datasets to input path.
- Set shuffle split configurations as desired.
- Set grid search configurations as desired.
- Set the desired classifiers with their parameters. Set
param_grid
as{}
for default parameters. - (Optional but recommended) Create virtual environment and activate it with (for Linux and MacOS):
pip install virtualenv cd /path/to/2_training_and_testing python -m venv /path/to/new/virtual/environment source venv/bin/activate
- Install dependencies from requirements.txt:
pip install -r requirements.txt
- Run the code:
python main_vX.py
- Set working directory as input directory.
- Set
gset_name
GEO dataset name, which must include overall survival data. - Set
exprs_file_name
,mdata_file_name
andup_and_down_table_file_name
input file names. (Must be .csv files.) - Set
survival_mdata_file_name
,survival_exprs_file_name
,survival_up_and_down_table_file_name
andsurvival_p_values_file_name
output file names as desired. (Must be .csv files.) - For
gset_name
dataset, set column namesos_time_column
andos_event_column
which include overall survival time and event values. - Run all of the code. (RStudio is recommended, but optional.)