Create functions for converting ensembl ids to gene symbols #12

jashapiro · 2024-11-21T01:40:58Z

Closes #2

Here I am adding two functions for gene id to symbol conversion. One function converts an arbitrary list of Ensembl ids, and the other resets the row indexes of the SCE to gene symbols.

I considered making these more flexible, either with bidirectional conversion or arbitrary column names, but I decided to keep these limited in scope; neither one is particularly complex at this point, and I think if complexity is needed, the user may well to need to write a custom function anyway.

The one thing I have not yet added is gene symbol deduplication. I do want to do that in a future PR, but for now I want to keep this straightforward. Deduplication may require more options, and I think is better tackled as a separate task.

One thing to note here is that I added a data file for testing: this is actually the test file for SCPCL000001_processed.rds, so random data, but with fairly complete structure. I left the other tests using random data, but it might be worth using the same test data there too?

Let me know if you see other places where there may be useful options that I didn't consider, but keep in mind that I want to keep this as simple as possible, if not simpler.

sjspielman · 2024-11-21T12:53:54Z

The one thing I have not yet added is gene symbol deduplication. I do want to do that in a future PR, but for now I want to keep this straightforward. Deduplication may require more options, and I think is better tackled as a separate task.

This will be important to implement, although definitely in a separate PR, because otherwise this is going to have sort of limited utility given that the main use case we see it for is when working with Seurat objects -

> s <- SeuratObject::CreateSeuratObject(counts(converted_sce), assay = "RNA")
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
Error in validObject(.Object) : 
  invalid class “LogMap” object: Duplicate rownames not allowed

sjspielman

This seems fine to me for a simple first implementation of the functionality, and tests cover probably as much as we need them to cover.

I left some small comments, the biggest one being the need for an S4Vectors import, and docs will need to be rebuilt too. And, I noted in my other comment, I am in favor of implementing de-duplication sooner rather than later!

sjspielman · 2024-11-21T12:32:24Z

R/convert-gene-ids.R

+    "`ensembl_ids` must be a character vector." = is.character(ensembl_ids),
+    "`sce` must contain both a `gene_ids` and `gene_symbol` column in the row data." =
+      all(c("gene_ids", "gene_symbol") %in% names(rowData(sce))),
+    "`leave_na` must be TRUE or FALSE." = is.logical(leave_na)


This will pass if leave_na is NA, but seems unlikely that will actually happen, so this is probably fine.

That is fine because the default is false and NA evaluates to false in an if statement.

sjspielman · 2024-11-21T12:33:12Z

R/convert-gene-ids.R

+
+  missing_symbols <- is.na(gene_symbols)
+  if (!leave_na && any(missing_symbols)) {
+    warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")


Suggested change

warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")

warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input Ensembl ids for missing values.")

The input id may not actually be an Ensembl id...

In this case, we might change the warning to not use the ensembl_ids variable name?

Not all input ids are present in sce `gene_ids`, using input ids for missing values.

Only if we change the argument name, which I don't want to do because it tells people the kind of ids that are expected.

I won't die on this hill, but this confuses me even more...

we tell them ensembl ids are "expected" via argument/variable names

they don't have to provide ensembl ids since the code is flexible enough

if they don't provide ensembl ids, the error will tell them there's a "problem" with their ensembl ids. If I were in this camp, the error message would confuse me!

We are telling them there is a problem with the values they gave to the ensembl_ids argument. It isn't saying that there was a necessarily problem with the ids, but that there may not be a gene symbol for the id.

Which is actually a problem with the error message, now that I read it. How about:

Suggested change

warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")

warning("Not all `ensembl_ids` values have corresponding gene symbols, using input ids for missing values.")

Ah ok ok ok - since the argument is called ensembl_ids to begin with (that's not only an internal variable name is the thing I ~~glanced~~ glossed(?) over!), they know what they've done. ✅

R/convert-gene-ids.R

sjspielman · 2024-11-21T12:49:49Z

R/convert-gene-ids.R

+#' @return A SingleCellExperiment object with row names set as gene symbols.
+#' @export
+#'
+#' @import SingleCellExperiment


Suggested change

#' @import SingleCellExperiment

#' @import SingleCellExperiment

#' @import S4Vectors

While the tests for this pass, I was unable to run this function locally:

Error: 'metadata' is not an exported object from 'namespace:SingleCellExperiment'

Turns out this comes from S4Vectors!

I'm not sure how you are doing the local tests, but it should pass if you load SingleCellExperiment, or if if you actually load the package. The following works in a fresh R session:

> library(rOpenScPCA) > sce <- readRDS(testthat::test_path("data", "scpca_sce.rds")) > sce2 <- sce_to_symbols(sce)

As does:

> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds")) > sce2 <- rOpenScPCA::sce_to_symbols(sce)

I loaded with devtools::load_all(). The docs suggest that load_all should actually load up more than library, so the fact that this doesn't suffice surprises/confuses me:

Whereas loadNamespace() and library() only load package dependencies when they are needed, load_all() loads all packages referenced in Imports at load time.

It works as expected when explicitly load library(SingleCellExperiment).

In a separate fresh R session without any sneaky renv business, I still got the error:

> remotes::install_github("AlexsLemonade/rOpenScPCA@fe08c631cf61437710b5bac8155978c5bf9a2bea") ######### I quit and restarted R here ########## > library(rOpenScPCA) > sce <- readRDS("~/ALSF/open-scpca/rOpenScPCA/tests/testthat/data/scpca_sce.rds") > converted_sce <- sce_to_symbols(sce) Error in metadata(sce) : could not find function "metadata" In addition: Warning message: In sce_to_symbols(sce) : Not all rows have gene symbols, using ensembl ids for missing values.

More context:

> library(rOpenScPCA) > sce <- readRDS("~/ALSF/open-scpca/rOpenScPCA/tests/testthat/data/scpca_sce.rds") > converted_sce <- sce_to_symbols(sce) Error in metadata(sce) : could not find function "metadata" In addition: Warning message: In sce_to_symbols(sce) : Not all rows have gene symbols, using ensembl ids for missing values. > > > > sessionInfo() R version 4.4.0 (2024-04-24) Platform: aarch64-apple-darwin20 Running under: macOS 15.1 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/New_York tzcode source: internal attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rOpenScPCA_0.1.0 loaded via a namespace (and not attached): [1] GenomeInfoDb_1.40.1 R6_2.5.1 [3] SparseArray_1.4.8 zlibbioc_1.50.0 [5] Matrix_1.7-1 lattice_0.22-6 [7] MatrixGenerics_1.16.0 matrixStats_1.4.1 [9] abind_1.4-8 GenomeInfoDbData_1.2.12 [11] S4Arrays_1.4.1 XVector_0.44.0 [13] Biobase_2.64.0 BiocGenerics_0.50.0 [15] UCSC.utils_1.0.0 stats4_4.4.0 [17] GenomicRanges_1.56.2 IRanges_2.38.1 [19] grid_4.4.0 DelayedArray_0.30.1 [21] compiler_4.4.0 httr_1.4.7 [23] tools_4.4.0 crayon_1.5.3 [25] jsonlite_1.8.9 S4Vectors_0.42.1 [27] SummarizedExperiment_1.34.0 SingleCellExperiment_1.28.1

That seems to indicate that S4Vectors is loaded. I'm not sure what is going wrong.

Confirming that I see the same error in a fresh install.

So tests are probably only passing because:

rOpenScPCA/tests/testthat/test-calculate-clusters.R

Line 1 in 9c5472c

suppressPackageStartupMessages(library(SingleCellExperiment))

rOpenScPCA/tests/testthat/test-evaluate-clusters.R

Line 1 in 9c5472c

suppressPackageStartupMessages(library(SingleCellExperiment))

rOpenScPCA/tests/testthat/test-evaluate-clusters.R

Line 1 in 9c5472c

suppressPackageStartupMessages(library(SingleCellExperiment))

...My bad! We should definitely not explicitly load it, because we apparently miss other (confusing but real) errors.

sjspielman · 2024-11-21T12:59:39Z

R/convert-gene-ids.R

+#' gene_symbols
+#' ### [1] "TP53" "MYCN"
+#' }
+ensembl_to_symbol <- function(ensembl_ids, sce, leave_na = FALSE) {


Given that there's already a bit of a level of abstraction here, is there a reason to not pass in a gene_symbol vector here rather than the full sce? I may be answering my own question - I suspect to confirm that the source of the gene symbols is our SCE object as we have mapped symbols and ids already?

Yes, I don't want people to have to extract anything.

jashapiro

Thank you for your review. I am not sure why you got the error you did: if the tests pass it means that loading the package will work as expected.

jashapiro · 2024-11-21T13:30:53Z

R/convert-gene-ids.R

+    "`ensembl_ids` must be a character vector." = is.character(ensembl_ids),
+    "`sce` must contain both a `gene_ids` and `gene_symbol` column in the row data." =
+      all(c("gene_ids", "gene_symbol") %in% names(rowData(sce))),
+    "`leave_na` must be TRUE or FALSE." = is.logical(leave_na)


That is fine because the default is false and NA evaluates to false in an if statement.

jashapiro · 2024-11-21T13:31:42Z

R/convert-gene-ids.R

+#' gene_symbols
+#' ### [1] "TP53" "MYCN"
+#' }
+ensembl_to_symbol <- function(ensembl_ids, sce, leave_na = FALSE) {


Yes, I don't want people to have to extract anything.

jashapiro · 2024-11-21T13:32:39Z

R/convert-gene-ids.R

+
+  missing_symbols <- is.na(gene_symbols)
+  if (!leave_na && any(missing_symbols)) {
+    warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")


The input id may not actually be an Ensembl id...

jashapiro · 2024-11-21T13:41:42Z

R/convert-gene-ids.R

+#' @return A SingleCellExperiment object with row names set as gene symbols.
+#' @export
+#'
+#' @import SingleCellExperiment


I'm not sure how you are doing the local tests, but it should pass if you load SingleCellExperiment, or if if you actually load the package. The following works in a fresh R session:

> library(rOpenScPCA) > sce <- readRDS(testthat::test_path("data", "scpca_sce.rds")) > sce2 <- sce_to_symbols(sce)

As does:

> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds")) > sce2 <- rOpenScPCA::sce_to_symbols(sce)

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>

for more information, see https://pre-commit.ci

So odd that this seems to be only sometimes, and passing tests

jashapiro · 2024-11-21T16:08:27Z

S4vectors now explicitly called out. Still don't understand why that one is needed when usually importing SingleCellExperiment is sufficient, but I'm not going to worry about it.

sjspielman

Confirmed the (weird) error is gone! Onward 🚀

jashapiro added 4 commits November 20, 2024 13:57

Add initial gene conversion functions

7b06068

Update gene id conversion and add docs

84c8b54

add test file and tests

b762979

Update documentation

322883d

jashapiro requested a review from sjspielman November 21, 2024 01:40

sjspielman reviewed Nov 21, 2024

View reviewed changes

jashapiro added 2 commits November 21, 2024 08:23

remove DE genes from test SCE singler results

1e648cd

compress

3d578a2

jashapiro commented Nov 21, 2024

View reviewed changes

jashapiro and others added 7 commits November 21, 2024 08:43

Update R/convert-gene-ids.R

9a39d19

Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

eaac295

for more information, see https://pre-commit.ci

update documentation

fe08c63

Adding S4Vectors explicitly

d1c698c

So odd that this seems to be only sometimes, and passing tests

remove library loading from tests

b6cbed3

minor warning text updates

67171c5

update warning text again

85d043e

jashapiro requested a review from sjspielman November 21, 2024 16:08

sjspielman approved these changes Nov 21, 2024

View reviewed changes

jashapiro merged commit 758edb4 into main Nov 21, 2024
2 checks passed

jashapiro deleted the jashapiro/2-gene-symbols branch November 21, 2024 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create functions for converting ensembl ids to gene symbols #12

Create functions for converting ensembl ids to gene symbols #12

jashapiro commented Nov 21, 2024

sjspielman commented Nov 21, 2024

sjspielman left a comment

sjspielman Nov 21, 2024

jashapiro Nov 21, 2024

sjspielman Nov 21, 2024

jashapiro Nov 21, 2024

sjspielman Nov 21, 2024

jashapiro Nov 21, 2024

sjspielman Nov 21, 2024

jashapiro Nov 21, 2024

jashapiro Nov 21, 2024

sjspielman Nov 21, 2024 •

edited

Loading

sjspielman Nov 21, 2024

jashapiro Nov 21, 2024 •

edited

Loading

sjspielman Nov 21, 2024 •

edited

Loading

sjspielman Nov 21, 2024

jashapiro Nov 21, 2024

jashapiro Nov 21, 2024

sjspielman Nov 21, 2024 •

edited

Loading

sjspielman Nov 21, 2024

jashapiro Nov 21, 2024

jashapiro left a comment

jashapiro Nov 21, 2024

jashapiro Nov 21, 2024

jashapiro Nov 21, 2024

jashapiro Nov 21, 2024 •

edited

Loading

jashapiro commented Nov 21, 2024

sjspielman left a comment

	warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")
	warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input Ensembl ids for missing values.")

	warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")
	warning("Not all `ensembl_ids` values have corresponding gene symbols, using input ids for missing values.")

	#' @import SingleCellExperiment
	#' @import SingleCellExperiment
	#' @import S4Vectors

Create functions for converting ensembl ids to gene symbols #12

Create functions for converting ensembl ids to gene symbols #12

Conversation

jashapiro commented Nov 21, 2024

sjspielman commented Nov 21, 2024

sjspielman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjspielman Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

sjspielman Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjspielman Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

jashapiro commented Nov 21, 2024

sjspielman left a comment

Choose a reason for hiding this comment

sjspielman Nov 21, 2024 •

edited

Loading

jashapiro Nov 21, 2024 •

edited

Loading

sjspielman Nov 21, 2024 •

edited

Loading

sjspielman Nov 21, 2024 •

edited

Loading

jashapiro Nov 21, 2024 •

edited

Loading