Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lefesrPlotClad function #63

Open
wants to merge 5 commits into
base: devel
Choose a base branch
from
Open

Conversation

sdgamboa
Copy link
Contributor

I'll keep working here the progress made in #61

Feedback so far:

  1. Is there any way to deal with the overlapping labels for publication? Maybe an option to only label internal nodes instead?
  2. An option to choose other color palettes (including a color-blind option) would be great.

@sdgamboa
Copy link
Contributor Author

sdgamboa commented Sep 18, 2024

Add lefserPlotClad:

  • lefserPlotClad supports both "lefser_df" and "lefser_df_all". However, the plot
    for "lefser_df" would not be very informative. Users should use the lefserAllRanks function if interested in the internal nodes (taxonomic ranks).
  • An option for choosing palettes or custom colors has been added.
  • Added unit tests.
  • The user can choose which taxonomic ranks are plotted (to avoid crowded labels).
suppressPackageStartupMessages(library(lefser))
data("zeller14")
z14 <- zeller14[, zeller14$study_condition != "adenoma"]
tn <- get_terminal_nodes(rownames(z14))
z14tn <- z14[tn, ]
z14tn_ra <- relativeAb(z14tn)

res <- lefser(z14tn_ra, groupCol = "study_condition")
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
resAll <- lefserAllRanks(relab = z14tn_ra, groupCol = "study_condition")
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
#> Warning in lefser(relab = x, ...): Convert counts to relative abundances with
#> 'relativeAb()'
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
#> Warning in lefser(relab = x, ...): Convert counts to relative abundances with
#> 'relativeAb()'
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.
#> The outcome variable is specified as 'study_condition' and the reference category is 'control'.
#>  See `?factor` or `?relevel` to change the reference category.

x <- lefserPlotClad(df = res)
#> Woriking with lefser_df. Consider using lefserAll.
#> Using palette: colorblind
x

y <- lefserPlotClad(df = resAll)
#> Working with lefser_df_all
#> Using palette: colorblind
y

z <- lefserPlotClad(df = resAll, showTipLabels = TRUE, showNodeLabels = c("c"))
#> Working with lefser_df_all
#> Using palette: colorblind
z

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.0 (2024-04-24)
#>  os       Pop!_OS 22.04 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2024-09-17
#>  pandoc   3.1.12.3 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package                  * version  date (UTC) lib source
#>  abind                      1.4-5    2016-07-21 [1] RSPM (R 4.4.0)
#>  ape                        5.8      2024-04-11 [1] RSPM (R 4.4.0)
#>  aplot                      0.2.3    2024-06-17 [1] RSPM (R 4.4.0)
#>  backports                  1.5.0    2024-05-23 [1] RSPM (R 4.4.0)
#>  base64enc                  0.1-3    2015-07-28 [1] RSPM (R 4.4.0)
#>  beachmat                   2.21.6   2024-09-05 [1] Bioconductor 3.20 (R 4.4.0)
#>  beeswarm                   0.4.0    2021-06-01 [1] RSPM (R 4.4.0)
#>  Biobase                  * 2.65.1   2024-08-28 [1] Bioconductor 3.20 (R 4.4.0)
#>  BiocGenerics             * 0.51.1   2024-09-03 [1] Bioconductor 3.20 (R 4.4.0)
#>  BiocNeighbors              1.99.0   2024-09-05 [1] Bioconductor 3.20 (R 4.4.0)
#>  BiocParallel               1.39.0   2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#>  BiocSingular               1.21.3   2024-09-10 [1] Bioconductor 3.20 (R 4.4.0)
#>  Biostrings                 2.73.1   2024-06-02 [1] Bioconductor 3.20 (R 4.4.0)
#>  bluster                    1.15.1   2024-09-06 [1] Bioconductor 3.20 (R 4.4.0)
#>  boot                       1.3-31   2024-08-28 [2] RSPM (R 4.4.0)
#>  brio                       1.1.5    2024-04-24 [1] RSPM (R 4.4.0)
#>  checkmate                  2.3.2    2024-07-29 [1] RSPM (R 4.4.0)
#>  cli                        3.6.3    2024-06-21 [1] RSPM (R 4.4.0)
#>  cluster                    2.1.6    2023-12-01 [2] CRAN (R 4.4.0)
#>  codetools                  0.2-20   2024-03-31 [2] CRAN (R 4.4.0)
#>  coin                       1.4-3    2023-09-27 [1] RSPM (R 4.4.0)
#>  colorspace                 2.1-1    2024-07-26 [1] RSPM (R 4.4.0)
#>  crayon                     1.5.3    2024-06-20 [1] RSPM (R 4.4.0)
#>  data.table                 1.16.0   2024-08-27 [1] RSPM (R 4.4.0)
#>  DBI                        1.2.3    2024-06-02 [1] RSPM (R 4.4.0)
#>  DECIPHER                   3.1.4    2024-06-12 [1] Bioconductor 3.20 (R 4.4.0)
#>  decontam                   1.25.0   2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#>  DelayedArray               0.31.11  2024-08-04 [1] Bioconductor 3.20 (R 4.4.0)
#>  DelayedMatrixStats         1.27.3   2024-08-08 [1] Bioconductor 3.20 (R 4.4.0)
#>  digest                     0.6.37   2024-08-19 [1] RSPM (R 4.4.0)
#>  DirichletMultinomial       1.47.0   2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#>  dplyr                      1.1.4    2023-11-17 [1] RSPM (R 4.4.0)
#>  evaluate                   0.24.0   2024-06-10 [1] RSPM (R 4.4.0)
#>  fansi                      1.0.6    2023-12-08 [1] RSPM (R 4.4.0)
#>  farver                     2.1.2    2024-05-13 [1] RSPM (R 4.4.0)
#>  fastmap                    1.2.0    2024-05-15 [1] RSPM (R 4.4.0)
#>  foreign                    0.8-87   2024-06-26 [2] RSPM (R 4.4.0)
#>  Formula                    1.2-5    2023-02-24 [1] RSPM (R 4.4.0)
#>  fs                         1.6.4    2024-04-25 [1] RSPM (R 4.4.0)
#>  generics                   0.1.3    2022-07-05 [1] RSPM (R 4.4.0)
#>  GenomeInfoDb             * 1.41.1   2024-05-24 [1] Bioconductor 3.20 (R 4.4.0)
#>  GenomeInfoDbData           1.2.12   2024-05-08 [1] Bioconductor
#>  GenomicRanges            * 1.57.1   2024-06-12 [1] Bioconductor 3.20 (R 4.4.0)
#>  ggbeeswarm                 0.7.2    2023-04-29 [1] RSPM (R 4.4.0)
#>  ggfun                      0.1.6    2024-08-28 [1] RSPM (R 4.4.0)
#>  ggplot2                    3.5.1    2024-04-23 [1] RSPM (R 4.4.0)
#>  ggplotify                  0.1.2    2023-08-09 [1] RSPM (R 4.4.0)
#>  ggrepel                    0.9.6    2024-09-07 [1] RSPM (R 4.4.0)
#>  ggtree                     3.13.1   2024-08-18 [1] Bioconductor 3.20 (R 4.4.0)
#>  glue                       1.7.0    2024-01-09 [1] RSPM (R 4.4.0)
#>  gridExtra                  2.3      2017-09-09 [1] RSPM (R 4.4.0)
#>  gridGraphics               0.5-1    2020-12-13 [1] RSPM (R 4.4.0)
#>  gtable                     0.3.5    2024-04-22 [1] RSPM (R 4.4.0)
#>  Hmisc                      5.1-3    2024-05-28 [1] RSPM (R 4.4.0)
#>  htmlTable                  2.4.3    2024-07-21 [1] RSPM (R 4.4.0)
#>  htmltools                  0.5.8.1  2024-04-04 [1] RSPM (R 4.4.0)
#>  htmlwidgets                1.6.4    2023-12-06 [1] RSPM (R 4.4.0)
#>  httr                       1.4.7    2023-08-15 [1] RSPM (R 4.4.0)
#>  igraph                     2.0.3    2024-03-13 [1] RSPM (R 4.4.0)
#>  IRanges                  * 2.39.2   2024-07-17 [1] Bioconductor 3.20 (R 4.4.0)
#>  irlba                      2.3.5.1  2022-10-03 [1] RSPM (R 4.4.0)
#>  jsonlite                   1.8.8    2023-12-04 [1] RSPM (R 4.4.0)
#>  knitr                      1.48     2024-07-07 [1] RSPM (R 4.4.0)
#>  labeling                   0.4.3    2023-08-29 [1] RSPM (R 4.4.0)
#>  lattice                    0.22-6   2024-03-20 [2] CRAN (R 4.4.0)
#>  lazyeval                   0.2.2    2019-03-15 [1] RSPM (R 4.4.0)
#>  lefser                   * 1.15.8   2024-09-18 [1] Bioconductor
#>  libcoin                    1.0-10   2023-09-27 [1] RSPM (R 4.4.0)
#>  lifecycle                  1.0.4    2023-11-07 [1] RSPM (R 4.4.0)
#>  lme4                       1.1-35.5 2024-07-03 [1] RSPM (R 4.4.0)
#>  lpSolve                    5.6.20   2023-12-10 [1] RSPM (R 4.4.0)
#>  magrittr                   2.0.3    2022-03-30 [1] RSPM (R 4.4.0)
#>  MASS                       7.3-61   2024-06-13 [2] RSPM (R 4.4.0)
#>  Matrix                     1.7-0    2024-03-22 [2] CRAN (R 4.4.0)
#>  MatrixGenerics           * 1.17.0   2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#>  matrixStats              * 1.4.1    2024-09-08 [1] RSPM (R 4.4.0)
#>  mediation                  4.5.0    2019-10-08 [1] RSPM (R 4.4.0)
#>  mgcv                       1.9-1    2023-12-21 [2] CRAN (R 4.4.0)
#>  mia                        1.13.36  2024-08-28 [1] Bioconductor 3.20 (R 4.4.0)
#>  minqa                      1.2.8    2024-08-17 [1] RSPM (R 4.4.0)
#>  modeltools                 0.2-23   2020-03-05 [1] RSPM (R 4.4.0)
#>  multcomp                   1.4-26   2024-07-18 [1] RSPM (R 4.4.0)
#>  MultiAssayExperiment       1.31.5   2024-08-23 [1] Bioconductor 3.20 (R 4.4.0)
#>  munsell                    0.5.1    2024-04-01 [1] RSPM (R 4.4.0)
#>  mvtnorm                    1.3-1    2024-09-03 [1] RSPM (R 4.4.0)
#>  nlme                       3.1-166  2024-08-14 [1] RSPM (R 4.4.0)
#>  nloptr                     2.1.1    2024-06-25 [1] RSPM (R 4.4.0)
#>  nnet                       7.3-19   2023-05-03 [2] CRAN (R 4.4.0)
#>  patchwork                  1.2.0    2024-01-08 [1] RSPM (R 4.4.0)
#>  permute                    0.9-7    2022-01-27 [1] RSPM (R 4.4.0)
#>  pillar                     1.9.0    2023-03-22 [1] RSPM (R 4.4.0)
#>  pkgconfig                  2.0.3    2019-09-22 [1] RSPM (R 4.4.0)
#>  plyr                       1.8.9    2023-10-02 [1] RSPM (R 4.4.0)
#>  purrr                      1.0.2    2023-08-10 [1] RSPM (R 4.4.0)
#>  R6                         2.5.1    2021-08-19 [1] RSPM (R 4.4.0)
#>  rbiom                      1.0.3    2021-11-05 [1] RSPM (R 4.4.0)
#>  Rcpp                       1.0.13   2024-07-17 [1] RSPM (R 4.4.0)
#>  RcppParallel               5.1.9    2024-08-19 [1] RSPM (R 4.4.0)
#>  reprex                     2.1.1    2024-07-06 [1] CRAN (R 4.4.0)
#>  reshape2                   1.4.4    2020-04-09 [1] RSPM (R 4.4.0)
#>  rlang                      1.1.4    2024-06-04 [1] RSPM (R 4.4.0)
#>  rmarkdown                  2.28     2024-08-17 [1] RSPM (R 4.4.0)
#>  rpart                      4.1.23   2023-12-05 [2] CRAN (R 4.4.0)
#>  rstudioapi                 0.16.0   2024-03-24 [1] RSPM (R 4.4.0)
#>  rsvd                       1.0.5    2021-04-16 [1] RSPM (R 4.4.0)
#>  S4Arrays                   1.5.7    2024-08-06 [1] Bioconductor 3.20 (R 4.4.0)
#>  S4Vectors                * 0.43.2   2024-07-17 [1] Bioconductor 3.20 (R 4.4.0)
#>  sandwich                   3.1-0    2023-12-11 [1] RSPM (R 4.4.0)
#>  ScaledMatrix               1.13.0   2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#>  scales                     1.3.0    2023-11-28 [1] RSPM (R 4.4.0)
#>  scater                     1.33.4   2024-07-21 [1] Bioconductor 3.20 (R 4.4.0)
#>  scuttle                    1.15.4   2024-08-14 [1] Bioconductor 3.20 (R 4.4.0)
#>  sessioninfo                1.2.2    2021-12-06 [1] RSPM (R 4.4.0)
#>  SingleCellExperiment       1.27.2   2024-05-24 [1] Bioconductor 3.20 (R 4.4.0)
#>  slam                       0.1-53   2024-09-02 [1] RSPM (R 4.4.0)
#>  SparseArray                1.5.33   2024-09-10 [1] Bioconductor 3.20 (R 4.4.0)
#>  sparseMatrixStats          1.17.2   2024-06-12 [1] Bioconductor 3.20 (R 4.4.0)
#>  stringi                    1.8.4    2024-05-06 [1] RSPM (R 4.4.0)
#>  stringr                    1.5.1    2023-11-14 [1] RSPM (R 4.4.0)
#>  SummarizedExperiment     * 1.35.1   2024-06-28 [1] Bioconductor 3.20 (R 4.4.0)
#>  survival                   3.7-0    2024-06-05 [2] RSPM (R 4.4.0)
#>  testthat                   3.2.1.1  2024-04-14 [1] RSPM (R 4.4.0)
#>  TH.data                    1.1-2    2023-04-17 [1] RSPM (R 4.4.0)
#>  tibble                     3.2.1    2023-03-20 [1] RSPM (R 4.4.0)
#>  tidyr                      1.3.1    2024-01-24 [1] RSPM (R 4.4.0)
#>  tidyselect                 1.2.1    2024-03-11 [1] RSPM (R 4.4.0)
#>  tidytree                   0.4.6    2023-12-12 [1] RSPM (R 4.4.0)
#>  treeio                     1.29.1   2024-08-18 [1] Bioconductor 3.20 (R 4.4.0)
#>  TreeSummarizedExperiment   2.13.0   2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#>  UCSC.utils                 1.1.0    2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#>  utf8                       1.2.4    2023-10-22 [1] RSPM (R 4.4.0)
#>  vctrs                      0.6.5    2023-12-01 [1] RSPM (R 4.4.0)
#>  vegan                      2.6-8    2024-08-28 [1] RSPM (R 4.4.0)
#>  vipor                      0.4.7    2023-12-18 [1] RSPM (R 4.4.0)
#>  viridis                    0.6.5    2024-01-29 [1] RSPM (R 4.4.0)
#>  viridisLite                0.4.2    2023-05-02 [1] RSPM (R 4.4.0)
#>  withr                      3.0.1    2024-07-31 [1] RSPM (R 4.4.0)
#>  xfun                       0.47     2024-08-17 [1] RSPM (R 4.4.0)
#>  XVector                    0.45.0   2024-05-01 [1] Bioconductor 3.20 (R 4.4.0)
#>  yaml                       2.3.10   2024-07-26 [1] RSPM (R 4.4.0)
#>  yulab.utils                0.1.7    2024-08-26 [1] RSPM (R 4.4.0)
#>  zlibbioc                   1.51.1   2024-06-05 [1] Bioconductor 3.20 (R 4.4.0)
#>  zoo                        1.8-12   2023-04-13 [1] RSPM (R 4.4.0)
#> 
#>  [1] /home/user/R/x86_64-pc-linux-gnu-library/4.4
#>  [2] /home/user/apps/R-4.4.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Created on 2024-09-17 with reprex v2.1.1

@sdgamboa sdgamboa mentioned this pull request Sep 18, 2024
2 tasks
@sdgamboa sdgamboa marked this pull request as ready for review September 18, 2024 03:40
@shbrief
Copy link
Contributor

shbrief commented Sep 18, 2024

I have a couple questions:

  1. Why does the lefserAllRanks function return this message, #> Warning in lefser(relab = x, ...): Convert counts to relative abundances with #> 'relativeAb()'? Input is already in a relative abundance form...
  2. Is there a reason that we should keep both lefser and lefserAllRanks functions? Does LEfSe implement corresponding functions separately?

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Sep 18, 2024

  1. Why does the lefserAllRanks function return this message, #> Warning in lefser(relab = x, ...): Convert counts to relative abundances with #> 'relativeAb()'? Input is already in a relative abundance form...

It is tripping this conditional code:

lefser/R/lefser.R

Lines 248 to 251 in 0ca4811

if (checkAbundances && !identical(all.equal(colSums(relab_data),
rep(1e6, ncol(relab_data)),
check.attributes = FALSE),
TRUE)) {

@sdgamboa
Copy link
Contributor Author

sdgamboa commented Sep 18, 2024

@shbrief,

Why does the lefserAllRanks function return this message, #> Warning in lefser(relab = x, ...): Convert counts to relative abundances with #> 'relativeAb()'? Input is already in a relative abundance form...

lefserAllRanks uses the mia::splitbyRanks function to agglomerate relative abundances and then runs lefser at each taxonomic level. When this happens the total sum of the columns ranges from 9800 to 10000 (or something like that). I guess this is what triggers the warning.

Is there a reason that we should keep both lefser and lefserAllRanks functions? Does LEfSe implement corresponding functions separately?

lefserAllRanks uses lefser. I think lefser is great if the user wants to know which of the input features are differentially abundant. However, if the user wants to know which clades are DA, they would need lefserAllRanks to get order, class, phylum, etc. This is what the original lefse-conda software does and it's implemented in the microbiomeMarker package.

@shbrief
Copy link
Contributor

shbrief commented Sep 19, 2024

@sdgamboa Thanks for explaining!

I was wondering whether it's better/worth combining lefser and lefserAllRanks with some conditional argument because lefserAllRanks already have lefser inside. But it sounds like keeping them separately might be cleaner.

About the warning message on relativeAb, should we silence it if relative abundance is provided correctly to the lefserAllRanks function, but lefser inside lefserAllRanks received non-relative-abundance table because of the split at the different rank? I think it's unfair/confusing to return a warning message when there is no action to take.

@sdgamboa
Copy link
Contributor Author

I could silence it. I was wondering, though, if the warning is really needed (or if it could be modified). Relative abundance could be set to 100 instead of 1e6 and still be relative abundance (I think this is actually more common than 1e6), but the user would still get a warning.

@shbrief
Copy link
Contributor

shbrief commented Sep 19, 2024

@sdgamboa That makes sense. Should we update the code to check whether the input is relative abundance at any scale? Here is the quick LLM-generated checker. :)

check_relative_abundance <- function(data, tolerance = 1e-5) {
  # Check if all values are non-negative
  if (any(data < 0)) {
    return(FALSE)
  }
  
  # Calculate column sums
  col_sums <- colSums(data)
  
  # Check if all column sums are approximately equal
  mean_sum <- mean(col_sums)
  if (all(abs(col_sums - mean_sum) / mean_sum < tolerance)) {
    # Calculate the proportion of each column
    proportions <- apply(data, 2, function(x) x / sum(x))
    
    # Check if the proportions for each column sum to approximately 1
    prop_sums <- colSums(proportions)
    if (all(abs(prop_sums - 1) < tolerance)) {
      return(TRUE)
    }
  }
  
  return(FALSE)
}


## Add taxonomic information to rowData
## This step is necessary for mia to work
.rowNames2RowData <- function(x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this part breaks if the row name of the input se is not the concatenation of each clade with the prefix (e.g., gingival dataset which has rowData properly populated already but rowName is in otu).

Copy link
Contributor Author

@sdgamboa sdgamboa Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. That's the case. That's how zelle14 is formatted (taxonomy in rownames). I think we should encourage only one data format (with the taxonomy in the rowData), but we would need either an updated version of zeller14 (cMD3) or use a different dataset.

Copy link
Contributor Author

@sdgamboa sdgamboa Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that having zeller14 ready for analysis already takes two extra steps:

  1. Getting terminal nodes.
  2. Calculating relative abundance.

A third would be incorporating the taxonomy into the rowData.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, having the taxonomy in the rowData would imply that all features are of the same level. I think there has been some discussion about it: #41 (comment)

Copy link
Contributor

@shbrief shbrief Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does lefserAllRanks require inputs with only the terminal nodes? If that's the case, can we just add get_terminal_node inside the lefserAllRanks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants