Skip to content

Updated draft for the volcanoplot function#167

Open
vbrennsteiner wants to merge 26 commits intomainfrom
enhanced_scatter
Open

Updated draft for the volcanoplot function#167
vbrennsteiner wants to merge 26 commits intomainfrom
enhanced_scatter

Conversation

@vbrennsteiner
Copy link
Collaborator

@vbrennsteiner vbrennsteiner commented Jan 28, 2026

"After archiving the initial volcanoplot branch, this is the new and improved volcanoplot branch"

This PR adds a volcanoplot wrapper

By popular and LLM demand, there should be a volcanoplot wrapper function in alphatools. As such, it should abstract the following functionalities:

  • scatterplot
  • adding lines to the scatterplot
  • adding anchored labels to the scatterplot
  • adding a legend
  • setting plot x/y limits

Main difference to the previous PR

Previously, layering of plot features was handled by volcano(), which created a suboptimal entanglement of general purpose plotting functionality (layering plots) and the specific application of a volcanoplot. Currently, two new methods + helpers have been added to the plots.py module:

  • layered_plot: Calls a plotting function repeatedly for any number of specified data layers, ensuring that indices are only plotted once and that indices not used by any layer are plotted in a default color. It operates using a PlotConfig instance and a Callable capable of interpreting the parameters from the PlotConfig and layer-specific parameters. An example is added in tutorials/tutorial_02_basic_plotting_workflow.ipynb.
color_dict = {
    "upregulated": BaseColors.get("red"),
    "downregulated": BaseColors.get("blue"),
}

To avoid having to specify basic parameters for each layer, we can summarize them in a PlotConfig instance with flexible attributes to accomodate different plotting functions:

plot_config = pl.make_scatter_config(
    data=adata,
    x_column="x",
    y_column="y",
    scatter_kwargs={"alpha": 0.7, "s": 50},  # add any kind of coloring/marker specification that should apply to all layers
)

The main specification of 'layered_plot' is the layers list, which consists of tuples with 3-4 elements:

  1. Which column in the data to use for filtering (here 'diff_exp_status')
  2. Which value to match in that column (single value or list) (here 'upregulated' for the first and 'downregulated' for the second layer)
  3. Which key to look up in the color_dict for that layer (here synonymous to the layer match values in 2.)
  4. Optional: kwargs dict for that layer (here, upregulated points should be triangular and slightly larger)
plot_layers = [
    ("diff_exp_status", "upregulated", "upregulated", {"marker": "^", "s": 100}),
    ("diff_exp_status", "downregulated", "downregulated"),
]

The function is called like this on an AnnData or DataFrame object:

fig, axm = create_figure(1, 1, figsize=(6, 6))
ax = axm.next()
Plots.layered_plot(
    ax=ax,
    base_config=plot_config,
    layers=plot_layers,
    color_dict=color_dict,
)

New volcano():

Using layered_plot internally, the volcano function is drastically simplified and can be called as demonstrated in tutorials/tutorial_04_volcanoplot.ipynb. It also incorporates labelling of different layers, which can be specified by their color_dict key. volcano internally constructs a PlotConfig instance to handle default arguments, meaning that users do not have to interact with PlotConfig.

As before, the first element refers to the column in data to be used for filtering, the second element specifies a match value or list for that column and the third provides the color lookup for that layer.

pois = ["P10291", "P10292", "P10293", "P10294", "P10295"]
plot_layers = [
    ("id", pois, "POI_hypothesis"),
    ("diff_exp_status", "upregulated", "upregulated"),
    ("diff_exp_status", "downregulated", "downregulated"),
]
color_dict = {
    "POI_hypothesis": BaseColors.get("purple", lighten=0.7), 
    "upregulated": BaseColors.get("orange"),
    "downregulated": BaseColors.get("blue"), 
}
label_layers = [
    "POI_hypothesis",  # label only the points whose layer points to the "POI_hypothesis" color-key
]

Generating the volcanoplot (with anchored labels):

Plots.volcano(
    data=adata,
    x_column="log2fc",
    y_column="neg_log10pval",
    color_dict=color_dict,
    layers=plot_layers,
    label_layers=label_layers,
    x_label_anchors=[-3.5, 3.5],
)

To create something like this:
image

Open points:

  • Pending the merge of Extend pl module docstrings #164, the label_plot method will take AnnData/DataFrame instances with specified columns instead of separate arrays.
  • The capabilities of layered_plot theoretically allow for layering arbitrary Callable instances (plotting functions) with arbitrary global and layer-specific arguments. Perhaps there is a good use case for that in future demonstration notebooks?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a volcano plot wrapper function and a general layered plotting system for alphapepttools. The implementation adds new data manipulation utilities and a flexible PlotConfig dataclass to support hierarchical scatterplot layering, where points can be colored and styled based on multiple overlapping criteria.

Changes:

  • Added layered_plot method to enable hierarchical plotting with automatic point assignment and deduplication
  • Added volcano wrapper function that uses layered_plot internally for differential expression visualization
  • Introduced PlotConfig dataclass and make_scatter_config factory for flexible plot configuration
  • Added utility functions (data_index_to_array, subset_data, _tolist) to support data manipulation
  • Enhanced scatter method with automatic limit calculation and renamed limit parameters from singular to plural

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 16 comments.

File Description
src/alphapepttools/pp/data.py Added data utility functions for index extraction, subsetting, and list conversion; updated documentation
src/alphapepttools/pl/plots.py Core implementation of layered plotting system, PlotConfig dataclass, volcano function, and helper utilities
src/alphapepttools/pl/init.py Exported new PlotConfig and make_scatter_config for public API
docs/notebooks/studies/study_01_biomarker_csf.ipynb Unintentional notebook output changes from re-execution
Comments suppressed due to low confidence (1)

src/alphapepttools/pl/plots.py:1083

  • The documentation still references the old parameter names 'xlim' and 'ylim', but the actual parameters have been renamed to 'xlims' and 'ylims' (plural). Update the documentation to match the actual parameter names.
        xlim : tuple[float, float], optional
            Limits for the x-axis. By default None.
        ylim : tuple[float, float], optional
            Limits for the y-axis. By default None.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 13.48315% with 154 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.32%. Comparing base (cbd4ca3) to head (83498b1).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
src/alphapepttools/pl/plots.py 12.57% 139 Missing ⚠️
src/alphapepttools/pp/data.py 16.66% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #167      +/-   ##
==========================================
- Coverage   73.61%   72.32%   -1.30%     
==========================================
  Files          35       35              
  Lines        1823     1980     +157     
==========================================
+ Hits         1342     1432      +90     
- Misses        481      548      +67     
Files with missing lines Coverage Δ
src/alphapepttools/pl/__init__.py 100.00% <100.00%> (ø)
src/alphapepttools/pp/data.py 76.33% <16.66%> (-5.01%) ⬇️
src/alphapepttools/pl/plots.py 31.58% <12.57%> (-8.59%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@lucas-diedrich lucas-diedrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the plots that are returned by the function a lot, they look really clean.


I commented on some specific things associated with the pull request.

I marked issues that were identified by Claude with [Claude] based on the following prompt:

You are a scientific software developer who is putting a lot of thought into user-friendliness of APIs etc. You are working on a software package for the analysis of proteomics data with anndata as data container. Critically review the changes in the current branch by comparing them to the main branch that  aim to implement a convenience function for volcano plots                                                                                                                                                                                                                                                            

On a general note, I find the plots module increasingly confusing to review:

  • the Plots class does not provide a unifying config as its config is not used in any of the functions. This in combination with the @classmethod pattern means that individual functions would behave equivalently. This would also make the package more aligned with the other scverse packages (that use individual functions).
  • the number of utility functions + implicit contracts (e.g. expected behavior of configs) makes it complicated to understand the logic

-----
Uses pandas.isna() to handle both NaN and None values correctly.
"""
keep_mask = ~(pd.isna(x_values) | pd.isna(y_values))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This action might lead to surprising results (imagine all data coords have a missing value and suddenly no point gets plotted) - should it emit a warning?

Copy link
Collaborator Author

@vbrennsteiner vbrennsteiner Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is the tradeoff, if a datapoint has a missing value in either the x or y axis, how would it be plottable at all? I am quite averse to imputing them with zeros, if there are nonstandard axis choices or only a single point is concerned, it could lead to false conclusions about points that have no actual values - I'd much rather handle this upstream with imputation.

As for warnings, this would then raise a warning if any point has a missing value, which could be seen as excessive/annoying.

The scenario where this causes (or rather exposes) a problem is when users expect to see e.g. a certain gene but it does not show up in the plot, which would likely cause them to search for it in the data where the missing values could be identified & addressed.

A pure discovery scenario would (in my opinion) not benefit from somehow showing/including points that have no values associated.

>>> _get_plot_lims(values, 1.1, set_left=0)
(0, 3.3)
"""
series = pd.Series(values)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this conversion necessary?

Wouldn't

max(abs(min(values)), abs(max(values)))

produce equivalent results?

cls,
# Required data parameters
data: ad.AnnData | pd.DataFrame,
x_column: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could these values all default to the standardized output of the tl.diff_exp functions?

We would expect that the users just plug in our DEG results (and if they don't, they can change it)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see that the diff-exp outputs are not really standardized at the moment as they vary depending on which kinds of contrasts are passed (t_value__condition1_condition2). Should we update that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff exp outputs are standardized with respect to the necessary columns, but we keep the method-specific columns around as well in case users want them. The standardized columns are in tl.defaults.py, so we could set column defaults to those

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do that

xlim: tuple[float, float] | None = None,
ylim: tuple[float, float] | None = None,
figure_kwargs: dict | None = None,
xlims: tuple[float, float] | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change - Is it necessary to rename the arguments in this PR?

layers: list[tuple] | None = None,
color_dict: dict[str, str | tuple] | None = None,
# Volcano-specific thresholds
x_thresholds: tuple | None = (-1, 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the None option does not add value here

max_labels: int | None = None,
x_label_anchors: list[float] | None = None,
y_display_start: float | None = 1,
y_padding_factor: float | None = 4,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would drop None default

display_id_column: str | None = None,
max_labels: int | None = None,
x_label_anchors: list[float] | None = None,
y_display_start: float | None = 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would drop None default

left = -abs_max * padding_factor
right = abs_max * padding_factor
else:
left = series.min() * padding_factor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] This assumes that the minimal value of series is negative (otherwise the limit is shifted to the right)

data: ad.AnnData | pd.DataFrame | None = None
_extra: dict | None = None # Store additional fields

def __post_init__(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from dataclasses import dataclass, field

@dataclass
class Data:
    extra: dict[str, int] = field(default_factory=dict)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants