This repository contains the implementation of the Combinatorial and Semantic Analysis of Functional Elements (CombSAFE) presented in: Leone M, Galeota E, Masseroli M, Pellizzola M. "Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions", 2021
. It is a flexible computational method to identify combinations of static and dynamic functional elements genome-wide, or in a specific genomic region, and how they change across semantically annotated biological conditions.
Given as input a set of ChIP-seq dataset samples and the list of functional elements to be considered, the CombSAFE
pipeline allows:
- considering histone modifications, transcription factors and any other type of dynamic or static genomic features (e.g., CpG islands, partially methylated domains, transposable elements, etc.)
- relying on public repositories to retrieve the considered static functional elements and identify from the input dataset metadata the biological conditions in which the dynamic functional elements of interest were charted
- leveraging natural language processing techniques and biomedical ontologies to complement the identified conditions with semantic annotations about tissue and disease types
- identifying combinations of static and dynamic functional elements throughout the genome in the corresponding omics data, using hidden Markov models
- focusing on specific genomic regions, applying clustering to explore how significant combinations of the functional elements compare among semantically annoted cell types and desease/healthy conditions
- performing functional enrichment analyses based on the genes found in genomic regions with similar combinations of functional elements.
CombSAFE/
|-- README.md
|-- LICENSE
|-- .gitignore
|-- notebook/
| |-- Functional_states_analysis.ipynb
|-- gene_list/
| |-- MYC_associated.txt
| |-- test_list.txt
| |-- tumor_suppressor.txt
|-- CombSAFE/
| |-- CombSAFE.py
|-- CombSAFE.yml
README.md
this fileLICENSE
MIT license file.gitignore
standard .gitignore file for Python projectsnotebook/Functional_states_analysis.ipynb
Python notebook to run a functional state analysisgene_list/
folder where gene name lists are stored for the CombSAFE single gene analysisgene_list/test_list.txt
list of random genesgene_list/tumor_suppressor.txt
list of tumor suppressor genesgene_list/MYC_associated.txt
list of MYC interacting genesCombSAFE/CombSAFE.py
core Python routines called from within the notebook to perform the functional state analysisCombSAFE.yml
dependency yaml file to load in order to perform the CombSAFE analysis
In order to run the CombSAFE
pipeline, please follow the steps below:
- Install the Anaconda package and environment manager from here
- Load the CombSAFE environment with the command:
conda env create -f CombSAFE.yml
- Activate the CombSAFE environment with the command:
conda activate CombSAFE
ON Linux and macOS. On Windows systems digitactivate CombSAFE
- Run the
notebook/Functional_states_analysis.ipynb
NB: The pyGMQL
package additionally requires Java. Please follow the installation procedure here.
NB2: The PyEnsembl
package additionally requires Ensembl data. Please follow the installation procedure here.
NB3: For Windows users, Visual Studio v.14 or higher is required. Please follow the installation procedure here
In the following, we show how to call the functions implemented to easily perform the different steps of our CombSAFE
computational method, providing example resuls for some of them.
combsafe.create_dataset(path, organism, threads=4, from_GEO=False)
Run a ChIP-seq peak calling pipeline from input raw data.
For single-end reads Input files must be structured as follows:
Input_folder/
|-- Raw_Reads/
| |-- 1.rawreads.fastq
| |-- 2.rawreads.fastq
| |-- 3.rawreads.fastq
| |-- 4.rawreads.fastq
| |-- 5.rawreads.fastq
| |-- 6.rawreads.fastq
| |-- 7.rawreads.fastq
| |-- ...
|-- Textual_file.txt
Raw_Reads
a folder containing raw reads in fastq formatTextual_file.txt
a text file containing the following information:file
, filename of the corresponding raw reads file in the Raw_Reads folderfactor
, transcription factor or histone mark used for the analysisdescription
, all available iformations of the biological source from which to extract terms for semantic annotations.
E.g.:
file | factor | description |
---|---|---|
1.rawreads.fastq | CTCF | low passage primary melanoma cultures |
2.rawreads.fastq | H3K4me3 | Bone marrow mononuclear cells |
3.rawreads.fastq | MYC | human primary monocytes isolated from PBMC of healthy donors |
... | ... | ... |
For paired-end reads Input files must be structured as follows:
Input_folder/
|-- Raw_Reads/
| |-- 1.forward_reads.fastq
| |-- 1.reverse_reads.fastq
| |-- 2.forward_reads.fastq
| |-- 2.reverse_reads.fastq
| |-- 3.forward_reads.fastq
| |-- 3.reverse_reads.fastq
| |-- ...
|-- Textual_file.txt
Raw_Reads
a folder containing raw reads in fastq formatTextual_file.txt
a text file containing the following information:file_1
, filename of the corresponding forward raw reads file in the Raw_Reads folderfile_2
, filename of the corresponding reverse raw reads file in the Raw_Reads folderfactor
, transcription factor or histone mark used for the analysisdescription
, all available informations of the biological source from which to extract terms for semantic annotations.
E.g.:
file_1 | file_2 | factor | description |
---|---|---|---|
1.forward_reads.fastq | 1.reverse_reads.fastq | H3K4me1 | Human embryonic stem cells received from the John Doe laboratory |
2.forward_reads.fastq | 2.reverse_reads.fastq | H3K4me3 | Nuclei derived from crude preps of adipose tissue |
3.forward_reads.fastq | 3.reverse_reads.fastq | H3K27me3 | Monocyte-derived macrophage |
... | ... | ... | ... |
If you want to start a functional state analysis on GEO experiments, set the from_GEO
label as True. In that scenario, input files must be structured as follows:
Input_folder/
|-- Textual_file.txt
Textual_file.txt
a text file containing the following information:sample_id
, Id of the samples on GEOfactor
, transcription factor or histone mark used for the analysis
E.g.,
sample_id | factor |
---|---|
GSM648494 | H3K4me1 |
GSM648495 | H3K4me3 |
GSM575280 | H3K27me3 |
... | ... |
Parameters:
- path: path object or file-like object
- input path folder
- organism: string
- reference genome assembly (e.g., "hg19", "hg38", "mm10", "mm39", "rn7", "danrer11", "dm6", "ce11", etc...)
- threads: int, default 4
- number of threads for the ChIp-seq pipeline.
- from_GEO: bool, default False
- if True, CombSAFE downloads raw reads and metadata from the GEO web page of the input GEO GSM Ids
Example:
>> dataset = create_dataset("./Input_folder/", assembly="hg38", threads=20, from_GEO=False)
combsafe.load_dataset(path, assembly, from_GEO=False)
Load the input path for the analysis. Skip this step if you have already generated the daatset.
Input files must be structured as follows:
Input_folder/
|-- Chip_Files/
| |-- 1.narrowPeak.bed
| |-- 2.broadPeak.bed
| |-- 3.narrowPeak.bed
| |-- 4.broadPeak.bed
| |-- 5.narrowPeak.bed
| |-- 6.broadPeak.bed
| |-- 7.narrowPeak.bed
| |-- ...
|-- Textual_file.txt
Chip_Files
a folder containing ChIP-Seq filesTextual_file.txt
a text file containing the following information:sample_id
, univolcal id for each samplefactor
, Transcription Factor or Histone Mark used for the analysisfile
, filename of the corresponding ChIp-seq file in the Chip_Files folderdescription
, all available informations of the biological source from which to extract terms for semantic annotations.
E.g.:
sample_id | factor | file | description |
---|---|---|---|
1 | CTCF | 1.narrowPeak.bed | low passage primary melanoma cultures |
2 | H3K4me3 | 2.narrowPeak.bed | Bone marrow mononuclear cells |
3 | MYC | 3.narrowPeak.bed | human primary monocytes isolated from PBMC of healthy donors |
... | ... | ... |
If your dataset is generate from GEO samples and you want to get the description from the GSM GEO webpage, set the from_GEO
label as True. In that scenario, Textual_file.txt must be structured as follows:
Textual_file.txt
a text file containing the following information:sample_id
, Id of the samples on GEOfactor
, Transcription Factor or Histone Mark used for the analysisfile
, filename of the corresponding ChIp-seq file in the Chip_Files folder
sample_id | factor | file |
---|---|---|
GSM648494 | H3K4me1 | 1.narrowPeak.bed |
GSM648495 | H3K4me3 | 2.broadPeak.bed |
GSM575280 | H3K27me3 | 3.narrowPeak.bed |
... | ... | ... |
Parameters:
- path: path object or file-like object
- input path folder
- organism: string
- reference genome assembly (e.g., "hg19", "hg38", "mm10", "mm39", "rn7", "danrer11", "dm6", "ce11", etc...)
- from_GEO: bool, default False
- if True, CombSAFE downloads raw reads and metadata from the GEO web page of the input GEO GSM Ids
Example:
>> input_path = import_path("./Input_folder/", assembly="hg38", from_GEO=True)
combsafe.generate_semantic_annotations(dataset, ontology_1, ontology_2, disease = False, encode_convert=False)
Generate semantic annotations about tissue and disease types from the input dataset.
Parameters:
- dataset dataset object
- imported dataset object
- ontology_1: str
- url address of the first selected ontology
- ontology_2: str
- url address of the second selected ontology
- disease bool, default False
- set True if one of the selected ontologies is involved in disease concepts
- encode_convert bool, default False
- if true, encode IDs are searched to be converted to GSM
Returns:
- semantic_dataframe
- dataset of biological conditions from the input metadata
Example:
>> ontology_1 = "https://raw.githubusercontent.com/obophenotype/cell-ontology/master/cl-basic.obo"
>> ontology_2 = "https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/main/src/ontology/doid.obo"
>> semantic_df = generate_semantic_annotations(dataset, ontology_1, ontology_2, disease = True, encode_convert=False)
combsafe.plot_factor_freq(semantic_dataframe, n)
Vertical barplot of the factor frequency in the input dataset.
Parameters:
- semantic_dataframe: dataframe
- dataset of biological conditions from the input metadata
- n: int
- number of factors to display in the barplot
Example:
>> plot_factor_freq(semantic_df, 30)
combsafe.generate_fixed_factor_pool(semantic_dataframe, factor_list, number_of_semantic_annotations)
Table containg lists of factors according to the selected parameters.
Parameters:
- semantic_dataframe: dataframe
- dataset of biological conditions from the input metadata
- factor_list: list
- list of factors to include in the analysis
- number_of_semantic_annotations: int
- number of semantic annotations to include in the analysis
Example:
>> generate_fixed_factor_pool(semantic_df, ["CTCF", "MYC"], 5)
combsafe.get_semantic_annotation_list(semantic_dataframe, factor_list)
List of semantic annotations according to the selected factors.
Parameters:
- semantic_dataframe: dataframe
- dataset of biological conditions from the input metadata
- factor_list: list
- list of factors to include in the analysis
Example:
>> get_semantic_annotation_list(semantic_df, ["CTCF", "MYC", "POLR2A", "H3K4me3", "H3K27me3"])
combsafe.extract_data(factor_list)
Combine sample replicas of the listed factors and extract their semantic annotations regarding the conditions in which they were mapped.
Parameters:
- factor_list: list
- list of factors to include in the analysis
Example:
>> extract_data(["CTCF", "MYC", "POLR2A", "H3K4me3", "H3K27me3"])
combsafe.add_custom_tracks(tracks_label, path_to_custom_tracks, index)
Add custom tracks of static genomic elements to the analysis (e.g., CpG islands).
Parameters:
- tracks_label: string
- name of the custom tracks to add for the analysis
- path_to_custom_tracks: path
- UCSC path for downloading custom tracks
Example:
>> add_custom_tracks("CpG_Islands", "./input_files/cpgIslandExt.txt")
combsafe.identify_functional_states(ChromHMM_path, number_of_states, n_core)
identification of combinations of static and dynamic functional elements throughout the genome.
Parameters:
- ChromHMM_path: path
- path to the chromHMM software folder. It can be downloaded here.
- number_of_states: int
- number of combinations of genomic functional elements
- n_core: int
- number of cores to use for the analysis
Returns:
- functional_states_dataframe
- dataset of functional states for each biological conditions
>> functional_states_df = identify_functional_states(chromhmm_path ="./ChromHMM/", number_of_states = 15, n_core = 20)
Alternatively, it is possible to load in house segmentated files from an other HMM segmentation tool and jump to the next step
combsafe.load_custom_segments(input_segment_dir, num_states)
load functional states files from input path.
Parameters:
- input_segment_dir: path
- path to the segmentated file folder.
- number_of_states: int
- number of combinations of genomic functional elements
Returns:
- functional_states_dataframe
- dataset of functional states for each biological conditions
Example:
>> functional_states_df = load_custom_segments(input_segment_dir ="./Input_folder/in_house_segmentated/", num_states=15)
Input files must be structured as follows:
Input_folder/
|-- Segmentated_Files/
| |-- 1.semantic_annotation_15_segments.bed
| |-- 2.semantic_annotation_15_segments.bed
| |-- 3.semantic_annotation_15_segments.bed
| |-- 4.semantic_annotation_15_segments.bed
| |-- 5.semantic_annotation_15_segments.bed
| |-- 6.semantic_annotation_15_segments.bed
| |-- 7.semantic_annotation_15_segments.bed
| |-- ...
|-- emissions.txt
Segmentated_Files
a folder containing raw reads in fastq formatemissions.txt
a text file structured as follows:
State | H3K4me3 | POLR2A | MYC | H3K27me3 |
---|---|---|---|---|
1 | 0.093326 | 0.457892 | 0.143540 | 0.924645 |
2 | 0.793153 | 0.658634 | 0.972344 | 0.487613 |
3 | 0.940996 | 0.000234 | 0.243758 | 0.187461 |
4 | 0.143540 | 0.763471 | 0.872346 | 0.104765 |
... | ... | ... | ... | ... |
combsafe.show_emission_graph(custom_palette=colors)
Show emission parameters heatmap of genome functional states combination.
Parameters:
- custom_palette: list of exadecimals, default=None
- add a list of customized colors in hexadecimal form to be assigned to the functional states.
Example:
>> colors = ['#c9f9ff', '#e6beff', '#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231','#911eb4', '#bcf60c', '#f032e6', '#fffac8', '#fabebe', '#9a6324', '#46f0f0', '#008080']
>> show_emission_graph(custom_palette=colors)
combsafe.show_distance_matrix()
Show distance matrix heatmap about functional states generated from the emission parameters file of an HMM model.
Example:
>> show_distance_matrix()
combsafe.single_gene_analysis(functional_states_dataframe, path_to_gene_list_file, distance_metric )
Given a list of gene symbols in a textual file, the heatmap of the functional states of the related genomic regions is shown.
Parameters:
- functional_states_dataframe: dataframe
- dataframe of functional states
- path_to_gene_list_file: path
- path to the gene list file
- distance_metric: function/string
- distance metric to use for the data. CombSAFE auto-generate from the input
emission.txt
file a special metricfunctional_state_distance
to weight distances among functional states according to the their function. Alternatively,hamming
distance can be selected
- distance metric to use for the data. CombSAFE auto-generate from the input
Example:
>> single_gene_analysis(functional_states_df, "path_to_gene_list/gene_list.txt", distance_metric = funtional_states_distance)
>> single_gene_analysis(functional_states_df, "path_to_gene_list/gene_list.txt", distance_metric = "hamming")
combsafe.show_distance_matrix()
Shows PCA heatmap among semantic annotation for selected components.
Parameters:
- functional_states_dataframe: dataframe
- dataframe of functional states
- number_of_components: int
- number of components for the principal component analysis
Return:
- loadings: dataframe
- PCA loadings
Example:
>> pca_analysis(functional_states_df, 10)
combsafe.genome_reduction(functional_states_dataframe, threshold)
Reduce the initial functional state dataframe to visualize the functional states of the various semantic annotations in the form of a heatmap.
NB: the proportions among the functional states are maintained as in the previous dataframe of functional states.
Parameters:
- functional_states_dataframe: dataframe
- dataframe of functional states
- threshold: int
- value adopted to avoid the over-representation of states that are specific for a small subset of the conditions or for one condition only. Select 100 for full representation
Return:
- reducted_dataframe: dataframe
- reducted dataframe of functional states
Example:
>> reducted_df = genome_reduction(functional_states_df, threshold=90)
combsafe.data_driven_heatmap(reducted_df, distance_metric, min_clust_size, min_sampl)
Show a genome-wide heatmap with the most significant clusters of genomic regions based on their patterns of functional states.
Parameters:
- reducted_df: dataframe
- reducted dataframe of functional states
- distance_metric: function/string
- distance metric to use for the data. CombSAFE auto-generate from the input
emission.txt
file a special metricfunctional_state_distance
to weight distances among functional states according to the their function. Alternatively,hamming
distance can be selected
- distance metric to use for the data. CombSAFE auto-generate from the input
- min_clust_size: int
- minimum number of clusters accetpted for the analysis
- min_sampl: int
- minimum number of samples per cluster accepted for the analysis
Return:
- clustered_dataframe: dataframe
- dataframe of functional states ordered according to the cluster parameters
Example:
>> clustered_heatmap = data_driven_heatmap(reducted_df, functional_states_distance, min_clust_size=10, min_sampl=2)
combsafe.gene_ontology_enrichment_analysis(clustered_dataframe, distance_metric, goea_tool)
Show a genome-wide heatmap with the most significant clusters of genomic regions based on their patterns of functional states.
Parameters:
- clustered_dataframe: dataframe
- adataframe of functional states ordered according to the cluster parameters
- distance_metric: function/string
- distance metric to use for the data. CombSAFE auto-generate from the input
emission.txt
file a special metricfunctional_state_distance
to weight distances among functional states according to the their function. Alternatively,hamming
distance can be selected
- distance metric to use for the data. CombSAFE auto-generate from the input
- goea_tool: str, "great" or "goatool"
- tool for gene ontology enrichment analysis
Example:
>> gene_ontology_enrichment_analysis(clustered_heatmap, goea_tool = "great", distance_metric=functional_states_distance)