Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idhandling I #378

Merged
merged 75 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
d0433b8
Add removal of tailing lines to maxquant loader and adjust tests (par…
JuliaS92 Nov 19, 2024
ec3911a
Merge remote-tracking branch 'origin/development' into idhandling
JuliaS92 Nov 19, 2024
76ab9c6
fix test (part two)
JuliaS92 Nov 19, 2024
2245e52
draft function to create mappings to be used around the app
JuliaS92 Nov 19, 2024
e8c9a49
write dicts to DataSet when mat is updated
JuliaS92 Nov 19, 2024
45e846a
make it work if gene names are not present
JuliaS92 Nov 19, 2024
3bc4132
fix and prepare tests
JuliaS92 Nov 19, 2024
474c5b7
Switch to new gene map and adjust intensity plot
JuliaS92 Nov 19, 2024
4c6d13b
Fix page link to LLM
JuliaS92 Nov 19, 2024
c7dc4a2
remove old mapping
JuliaS92 Nov 19, 2024
9a7cbd2
update name and documentation for getter
JuliaS92 Nov 19, 2024
de8be4d
Deduplicate and adjust y-axis label
JuliaS92 Nov 20, 2024
8a53232
Add test for id mapping based on synthetic dataset
JuliaS92 Nov 20, 2024
aa06c74
Simplify mapper at the cost of passing over data twice
JuliaS92 Nov 20, 2024
b724ee0
Use feature repr in volcano plot
JuliaS92 Nov 20, 2024
021896f
Fix failing test on generic loader
JuliaS92 Nov 20, 2024
ba46cd8
Fix yaxis after logtransformation
JuliaS92 Nov 20, 2024
df5b9a2
Adjust dropdown for intensity plot
JuliaS92 Nov 20, 2024
9d63c9d
Make uniprot retrieval handle protein ids and also return all results…
JuliaS92 Nov 20, 2024
c259545
Slim extraction from uniprot entries
JuliaS92 Nov 20, 2024
b8df06a
Add functions to handle features (pro;tein;ids) as input to uniprot r…
JuliaS92 Nov 20, 2024
7738854
Fix extraction test
JuliaS92 Nov 21, 2024
9ff89fa
fix MQ loader
JuliaS92 Nov 21, 2024
5335a57
Handle immunoglobulins, since they don't always have gene names.
JuliaS92 Nov 21, 2024
bf5c62d
Edge case for immunoglobulins
JuliaS92 Nov 21, 2024
4b0594e
Set up test class for id selection
JuliaS92 Nov 21, 2024
8438c6f
Add tests to selection
JuliaS92 Nov 21, 2024
1c1289d
Fix plotly dislay
JuliaS92 Nov 21, 2024
6242743
Fill annotation store when transferring a result to the llm page.
JuliaS92 Nov 21, 2024
238203c
Link out to selected uniprot id instead of querying for the gene name
JuliaS92 Nov 21, 2024
c02f0be
Introduce constants for extracted fields and create text representati…
JuliaS92 Nov 21, 2024
9d920fe
change analysis helper to fit
JuliaS92 Nov 21, 2024
9506c0e
Add element to LLM page to display retrieved data
JuliaS92 Nov 21, 2024
94d52bd
fix tests as far as feasible
JuliaS92 Nov 21, 2024
7fb9233
extract and document getting regulated features
JuliaS92 Nov 22, 2024
d95f07d
Tests for get_regulated_features
JuliaS92 Nov 22, 2024
8d3cad5
Add documentation and tests for get_uniprot_data
JuliaS92 Nov 22, 2024
166aa99
Todos for large numbers of proteins.
JuliaS92 Nov 22, 2024
4c38218
Extract and document display of retrieved information
JuliaS92 Nov 22, 2024
e7cc17e
Fix tests for display_proteins
JuliaS92 Nov 22, 2024
f61649f
move uniprot utils tests
JuliaS92 Nov 22, 2024
bcef330
Refactor uniprot_utils (function names, docstrings, ordering)
JuliaS92 Nov 22, 2024
6e42f58
Add interface to LLM page to select uniprot information
JuliaS92 Nov 22, 2024
56367d9
Fix broken tests
JuliaS92 Nov 22, 2024
baeab65
Add tests for the two remaining public functions in uniprot utils.
JuliaS92 Nov 22, 2024
4ede179
Add important TODO
JuliaS92 Nov 22, 2024
da9ffa6
Speed up id dicts
JuliaS92 Nov 25, 2024
4174c07
Speed up id_dicts
JuliaS92 Nov 25, 2024
01ecbbd
Simplify parsing of the protein name
JuliaS92 Nov 25, 2024
e9f228a
ClassNameCapital
JuliaS92 Nov 25, 2024
adb223e
Add quick fixes and todos addressing comments on https://github.com/M…
JuliaS92 Nov 25, 2024
5dc8f2c
Implement quick wins or add TODOs according to PR comments https://gi…
JuliaS92 Nov 25, 2024
d34cadc
TODOs from PR conversation
JuliaS92 Nov 25, 2024
9b6131a
Merge pull request #383 from MannLabs/annotation-retrieval
JuliaS92 Nov 25, 2024
1e7634a
Merge branch 'idhandling' into idhandling-ii
JuliaS92 Nov 25, 2024
f96dc60
Merge pull request #379 from MannLabs/idhandling-ii
JuliaS92 Nov 25, 2024
df4281c
Fix test
JuliaS92 Nov 26, 2024
c4de9cb
Iterator TODO
JuliaS92 Nov 26, 2024
972a4e9
Raise on invalid gene lookup, write tests for intensity plots from ge…
JuliaS92 Nov 26, 2024
5206286
remove todo
JuliaS92 Nov 26, 2024
e6ced77
Run LLM checkbox list on just one session state key.
JuliaS92 Nov 26, 2024
4500b2f
pass session state elements to display function
JuliaS92 Nov 26, 2024
ac7e71b
2 minor changes
JuliaS92 Nov 26, 2024
0873ea0
Move uniprot retrieval from dict to list, and an empty one in case of…
JuliaS92 Nov 26, 2024
f3ade6d
Add caution comments
JuliaS92 Nov 26, 2024
f1768ab
Split uniprot result selection into two functions
JuliaS92 Nov 26, 2024
bb4e0e2
simplify pathway extraction
JuliaS92 Nov 26, 2024
eb8e43e
make intensity column calling a method of the base loader
JuliaS92 Nov 26, 2024
0df1df3
Document maxquant data handling better
JuliaS92 Nov 26, 2024
7e2ae7c
Last minor changes
JuliaS92 Nov 26, 2024
dc3035d
simplify formatting of uniprot content
JuliaS92 Nov 26, 2024
af2f702
fix test
JuliaS92 Nov 26, 2024
845a1b1
Address nits, renames, copy
JuliaS92 Nov 27, 2024
76e731c
Switch spinner to sqdm progress bar
JuliaS92 Nov 27, 2024
a14e7d2
Show feature repr at preview on LLM page.
JuliaS92 Nov 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 75 additions & 30 deletions alphastats/dataset/dataset.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from collections import defaultdict
from typing import Dict, List, Optional, Tuple, Union

import pandas as pd
Expand Down Expand Up @@ -101,23 +102,11 @@ def __init__(
self.mat: pd.DataFrame = mat
self.metadata: pd.DataFrame = metadata
self.preprocessing_info: Dict = preprocessing_info

self._gene_name_to_protein_id_map = (
{
k: v
for k, v in dict(
zip(
self.rawinput[Cols.GENE_NAMES].tolist(),
self.rawinput[Cols.INDEX].tolist(),
)
).items()
if isinstance(k, str) # avoid having NaN as key
}
if Cols.GENE_NAMES in self.rawinput.columns
else {}
)
# TODO This is not necessarily unique, and should ideally raise an error in some of our test-data sets that
# contain isoform ids. E.g. TPM1 occurs 5 times in testfiles/maxquant/proteinGroups.txt with different base Protein IDs.
(
self._gene_to_features_map,
self._protein_to_features_map,
self._feature_to_repr_map,
) = self._create_id_dicts()

print("DataSet has been created.")

Expand Down Expand Up @@ -161,6 +150,55 @@ def _check_loader(loader):
"Invalid index_column: consider reloading your data with: AlphaPeptLoader, MaxQuantLoader, DIANNLoader, FragPipeLoader, SpectronautLoader"
)

def _create_id_dicts(self, sep: str = ";") -> Tuple[dict, dict, dict]:
"""
Create mappings from gene and protein to feature, and from feature to representation.
Features are the entities measured in each sample, usually protein groups represented by semicolon separated protein ids.
This is to maintain the many-to-many relationships between the three entities feature, protein and gene.

This method processes the raw input data to generate three dictionaries:
1. gene_to_features_map: Maps each gene to a list of features.
2. protein_to_features_map: Maps each protein to a list of features.
3. feature_to_repr_map: Maps each feature to its representation string.

Args:
sep (str): The separator used to split gene and protein identifiers. Default is ";".

Returns:
Tuple[dict, dict, dict]: A tuple containing three dictionaries:
- gene_to_features_map (dict): A dictionary mapping genes to features.
- protein_to_features_map (dict): A dictionary mapping proteins to features.
- feature_to_repr_map (dict): A dictionary mapping features to their representation strings.
"""

features = set(self.mat.columns.to_list())
gene_to_features_map = defaultdict(list)
protein_to_features_map = defaultdict(list)
feature_to_repr_map = {}

for proteins, feature in zip(
self.rawinput[Cols.INDEX], self.rawinput[Cols.INDEX]
):
if feature not in features:
continue
# TODO: Shorten list if too many ids e.g. to id1;...(19) if 20 ids are present
feature_to_repr_map[feature] = "ids:" + proteins
for protein in proteins.split(sep):
protein_to_features_map[protein].append(feature)

if Cols.GENE_NAMES in self.rawinput.columns:
for genes, feature in zip(
self.rawinput[Cols.GENE_NAMES], self.rawinput[Cols.INDEX]
):
if feature not in features:
continue
if isinstance(genes, str):
JuliaS92 marked this conversation as resolved.
Show resolved Hide resolved
for gene in genes.split(sep):
gene_to_features_map[gene].append(feature)
feature_to_repr_map[feature] = genes

return gene_to_features_map, protein_to_features_map, feature_to_repr_map

def _get_preprocess(self) -> Preprocess:
"""Return instance of the Preprocess object."""
return Preprocess(
Expand Down Expand Up @@ -199,6 +237,11 @@ def preprocess(
**kwargs,
)
)
(
JuliaS92 marked this conversation as resolved.
Show resolved Hide resolved
self._gene_to_features_map,
self._protein_to_features_map,
self._feature_to_repr_map,
) = self._create_id_dicts()

def reset_preprocessing(self):
"""Reset all preprocessing steps"""
Expand All @@ -208,6 +251,11 @@ def reset_preprocessing(self):
self.metadata,
self.preprocessing_info,
) = self._get_init_dataset()
(
self._gene_to_features_map,
self._protein_to_features_map,
self._feature_to_repr_map,
) = self._create_id_dicts()

def batch_correction(self, batch: str) -> None:
"""A wrapper for Preprocess.batch_correction(), see documentation there."""
Expand Down Expand Up @@ -419,6 +467,7 @@ def plot_volcano(
rawinput=self.rawinput,
metadata=self.metadata,
preprocessing_info=self.preprocessing_info,
feature_to_repr_map=self._feature_to_repr_map,
group1=group1,
group2=group2,
column=column,
Expand All @@ -434,26 +483,22 @@ def plot_volcano(

return volcano_plot.plot

def _get_protein_id_for_gene_name(
def _get_features_for_gene_name(
self,
gene_name: str,
) -> str:
"""Get protein id from gene id. If gene id is not present, return gene id, as we might already have a gene id.
'VCL;HEL114' -> 'P18206;A0A024QZN4;V9HWK2;B3KXA2;Q5JQ13;B4DKC9;B4DTM7;A0A096LPE1'
) -> list:
"""Get feature from gene name. If gene name is not present, return gene name, as we might already have a gene id.
'HEL114' -> ['P18206;A0A024QZN4;V9HWK2;B3KXA2;Q5JQ13;B4DKC9;B4DTM7;A0A096LPE1']

Args:
gene_name (str): Gene name

Returns:
str: Protein id or gene name if not present in the mapping.
list: Protein group ids or gene name if not present in the mapping.
"""
if gene_name in self._gene_name_to_protein_id_map:
return self._gene_name_to_protein_id_map[gene_name]

for gene, protein_id in self._gene_name_to_protein_id_map.items():
if gene_name in gene.split(";"):
return protein_id
return gene_name
if gene_name in self._gene_to_features_map:
return self._gene_to_features_map[gene_name]
raise ValueError(f"Gene {gene_name} is not in the (processed) data.")

def plot_intensity(
self,
Expand Down Expand Up @@ -492,7 +537,7 @@ def plot_intensity(
if gene_name is None and protein_id is not None:
pass
elif gene_name is not None and protein_id is None:
protein_id = self._get_protein_id_for_gene_name(gene_name)
protein_id = self._get_features_for_gene_name(gene_name)
else:
raise ValueError(
"Either protein_id or gene_name must be provided, but not both."
Expand Down
1 change: 1 addition & 0 deletions alphastats/dataset/plotting.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from alphastats.plots.plot_utils import PlotUtils


# TODO: Remove redundancy with PlotlyObject
class plotly_object(plotly.graph_objs._figure.Figure):
plotting_data = None
preprocessing = None
Expand Down
1 change: 0 additions & 1 deletion alphastats/dataset/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,6 @@ def _normalization(self, method: str) -> None:
def _log2_transform(self):
self.mat = np.log2(self.mat)
self.mat = self.mat.replace([np.inf, -np.inf], np.nan)
# TODO: Ideally we wouldn't need to replace infs if all downstream methods can handle them
self.preprocessing_info.update({PreprocessingStateKeys.LOG2_TRANSFORMED: True})
print("Data has been log2-transformed.")

Expand Down
7 changes: 6 additions & 1 deletion alphastats/gui/pages/05_Analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
from alphastats.gui.utils.analysis_helper import (
display_analysis_result_with_buttons,
gather_parameters_and_do_analysis,
gather_uniprot_data,
get_regulated_features,
)
from alphastats.gui.utils.ui_helper import (
StateKeys,
Expand Down Expand Up @@ -92,9 +94,12 @@ def show_start_llm_button(analysis_method: str) -> None:
if StateKeys.LLM_INTEGRATION in st.session_state:
del st.session_state[StateKeys.LLM_INTEGRATION]
st.session_state[StateKeys.LLM_INPUT] = (analysis_object, parameters)
regulated_features = get_regulated_features(analysis_object)
# TODO: Add confirmation prompt if an excessive number of proteins is to be looked up.
gather_uniprot_data(regulated_features)

st.toast("LLM analysis created!", icon="✅")
st.page_link("pages/05_LLM.py", label="=> Go to LLM page..")
st.page_link("pages/06_LLM.py", label="=> Go to LLM page..")


if analysis_result is not None:
Expand Down
40 changes: 34 additions & 6 deletions alphastats/gui/pages/06_LLM.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,25 @@
import streamlit as st
from openai import AuthenticationError

from alphastats.dataset.keys import Cols
from alphastats.dataset.plotting import plotly_object
from alphastats.gui.utils.analysis_helper import (
display_figure,
)
from alphastats.gui.utils.llm_helper import (
display_uniprot,
get_display_proteins_html,
llm_connection_test,
set_api_key,
)
from alphastats.gui.utils.ui_helper import StateKeys, init_session_state, sidebar_info
from alphastats.gui.utils.ui_helper import (
StateKeys,
init_session_state,
sidebar_info,
)
from alphastats.llm.llm_integration import LLMIntegration, Models
from alphastats.llm.prompts import get_initial_prompt, get_system_message
from alphastats.plots.plot_utils import PlotlyObject

OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")

Expand Down Expand Up @@ -99,7 +107,7 @@ def llm_config():
with c1:
regulated_genes_df = volcano_plot.res[volcano_plot.res["label"] != ""]
regulated_genes_dict = dict(
zip(regulated_genes_df["label"], regulated_genes_df["color"].tolist())
zip(regulated_genes_df[Cols.INDEX], regulated_genes_df["color"].tolist())
)

if not regulated_genes_dict:
Expand All @@ -118,17 +126,37 @@ def llm_config():
with c11:
st.write("Upregulated genes")
JuliaS92 marked this conversation as resolved.
Show resolved Hide resolved
st.markdown(
get_display_proteins_html(upregulated_genes, True), unsafe_allow_html=True
get_display_proteins_html(
upregulated_genes,
True,
annotation_store=st.session_state[StateKeys.ANNOTATION_STORE],
feature_to_repr_map=st.session_state[
StateKeys.DATASET
]._feature_to_repr_map,
),
unsafe_allow_html=True,
)

with c12:
st.write("Downregulated genes")
st.markdown(
get_display_proteins_html(downregulated_genes, False),
get_display_proteins_html(
downregulated_genes,
False,
annotation_store=st.session_state[StateKeys.ANNOTATION_STORE],
feature_to_repr_map=st.session_state[
StateKeys.DATASET
]._feature_to_repr_map,
),
unsafe_allow_html=True,
)


st.markdown("##### Select which information from Uniprot to supply to the LLM")
display_uniprot(
regulated_genes_dict, st.session_state[StateKeys.DATASET]._feature_to_repr_map
)

st.markdown("##### Prompts generated based on analysis input")

model_name = st.session_state[StateKeys.MODEL_NAME]
Expand Down Expand Up @@ -218,8 +246,8 @@ def llm_chat(llm_integration: LLMIntegration, show_all: bool = False):
for artifact in message["artifacts"]:
if isinstance(artifact, pd.DataFrame):
st.dataframe(artifact)
elif "plotly" in str(
type(artifact)
elif isinstance(
artifact, (PlotlyObject, plotly_object)
): # TODO can there be non-plotly types here
st.plotly_chart(artifact)
elif not isinstance(artifact, str):
Expand Down
18 changes: 14 additions & 4 deletions alphastats/gui/utils/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,12 +189,21 @@ def show_widget(self):
"""Gather parameters for intensity plot analysis."""
super().show_widget()

protein_id = st.selectbox(
"ProteinID/ProteinGroup",
options=self._dataset.mat.columns.to_list(),
protein_id_or_gene_name = st.selectbox(
"Gene or protein identifier to plot",
options=list(self._dataset._gene_to_features_map.keys())
+ list(self._dataset._protein_to_features_map.keys()),
)

self._parameters.update({"protein_id": protein_id})
self._parameters.update(
{
"protein_id": self._dataset._gene_to_features_map[
protein_id_or_gene_name
]
if protein_id_or_gene_name in self._dataset._gene_to_features_map
else self._dataset._protein_to_features_map[protein_id_or_gene_name]
}
)

def _do_analysis(self):
"""Draw Intensity Plot using the IntensityPlot class."""
Expand Down Expand Up @@ -327,6 +336,7 @@ def _do_analysis(self):
rawinput=self._dataset.rawinput,
metadata=self._dataset.metadata,
preprocessing_info=self._dataset.preprocessing_info,
feature_to_repr_map=self._dataset._feature_to_repr_map,
group1=self._parameters["group1"],
group2=self._parameters["group2"],
column=self._parameters["column"],
Expand Down
49 changes: 49 additions & 0 deletions alphastats/gui/utils/analysis_helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@

import pandas as pd
import streamlit as st
from stqdm import stqdm

from alphastats.dataset.keys import Cols
from alphastats.gui.utils.analysis import (
ANALYSIS_OPTIONS,
PlottingOptions,
Expand All @@ -13,6 +15,7 @@
StateKeys,
show_button_download_df,
)
from alphastats.llm.uniprot_utils import get_annotations_for_feature
from alphastats.plots.plot_utils import PlotlyObject


Expand Down Expand Up @@ -197,3 +200,49 @@ def gather_parameters_and_do_analysis(

else:
raise ValueError(f"Analysis method {analysis_method} not found.")


def gather_uniprot_data(features: list) -> None:
"""
Gathers UniProt data for a list of features and stores it in the session state.

Features that are already in the session state are skipped.

Args:
features (list): A list of features for which UniProt data needs to be gathered.
Returns:
None
"""
for feature in stqdm(
features,
desc="Retrieving uniprot data on regulated features ...",
mininterval=1,
):
if feature in st.session_state[StateKeys.ANNOTATION_STORE]:
continue
# TODO: Add some kind of rate limitation to avoid being locked out by uniprot
st.session_state[StateKeys.ANNOTATION_STORE][feature] = (
get_annotations_for_feature(feature)
)


def get_regulated_features(analysis_object: PlotlyObject) -> list:
"""
Retrieve regulated features from the analysis object.
This function extracts features that are labeled (i.e., have a non-empty label)
from the analysis results. It is specifically designed to work with volcano plots.
Args:
analysis_object (PlotlyObject): An object containing analysis results,
including feature indices and labels.
Returns:
list: A list of regulated features that have non-empty labels.
"""
# TODO: add a method to the AbstractAnalysis class to retrieve regulated features upon analysis to store in the session state. This function here only works for volcano plots.
regulated_features = [
feature
for feature, label in zip(
analysis_object.res[Cols.INDEX], analysis_object.res["label"]
)
if label != ""
]
return regulated_features
Loading
Loading