Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create functions for converting ensembl ids to gene symbols #12

Merged
merged 13 commits into from
Nov 21, 2024

Conversation

jashapiro
Copy link
Member

Closes #2

Here I am adding two functions for gene id to symbol conversion. One function converts an arbitrary list of Ensembl ids, and the other resets the row indexes of the SCE to gene symbols.

I considered making these more flexible, either with bidirectional conversion or arbitrary column names, but I decided to keep these limited in scope; neither one is particularly complex at this point, and I think if complexity is needed, the user may well to need to write a custom function anyway.

The one thing I have not yet added is gene symbol deduplication. I do want to do that in a future PR, but for now I want to keep this straightforward. Deduplication may require more options, and I think is better tackled as a separate task.

One thing to note here is that I added a data file for testing: this is actually the test file for SCPCL000001_processed.rds, so random data, but with fairly complete structure. I left the other tests using random data, but it might be worth using the same test data there too?

Let me know if you see other places where there may be useful options that I didn't consider, but keep in mind that I want to keep this as simple as possible, if not simpler.

@jashapiro jashapiro requested a review from sjspielman November 21, 2024 01:40
@sjspielman
Copy link
Member

The one thing I have not yet added is gene symbol deduplication. I do want to do that in a future PR, but for now I want to keep this straightforward. Deduplication may require more options, and I think is better tackled as a separate task.

This will be important to implement, although definitely in a separate PR, because otherwise this is going to have sort of limited utility given that the main use case we see it for is when working with Seurat objects -

> s <- SeuratObject::CreateSeuratObject(counts(converted_sce), assay = "RNA")
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
Error in validObject(.Object) : 
  invalid class “LogMap” object: Duplicate rownames not allowed

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me for a simple first implementation of the functionality, and tests cover probably as much as we need them to cover.

I left some small comments, the biggest one being the need for an S4Vectors import, and docs will need to be rebuilt too. And, I noted in my other comment, I am in favor of implementing de-duplication sooner rather than later!

"`ensembl_ids` must be a character vector." = is.character(ensembl_ids),
"`sce` must contain both a `gene_ids` and `gene_symbol` column in the row data." =
all(c("gene_ids", "gene_symbol") %in% names(rowData(sce))),
"`leave_na` must be TRUE or FALSE." = is.logical(leave_na)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will pass if leave_na is NA, but seems unlikely that will actually happen, so this is probably fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine because the default is false and NA evaluates to false in an if statement.


missing_symbols <- is.na(gene_symbols)
if (!leave_na && any(missing_symbols)) {
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input Ensembl ids for missing values.")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input id may not actually be an Ensembl id...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we might change the warning to not use the ensembl_ids variable name?

Not all input ids are present in sce `gene_ids`, using input ids for missing values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if we change the argument name, which I don't want to do because it tells people the kind of ids that are expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't die on this hill, but this confuses me even more...

  • we tell them ensembl ids are "expected" via argument/variable names
  • they don't have to provide ensembl ids since the code is flexible enough
  • if they don't provide ensembl ids, the error will tell them there's a "problem" with their ensembl ids. If I were in this camp, the error message would confuse me!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are telling them there is a problem with the values they gave to the ensembl_ids argument. It isn't saying that there was a necessarily problem with the ids, but that there may not be a gene symbol for the id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is actually a problem with the error message, now that I read it. How about:

Suggested change
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")
warning("Not all `ensembl_ids` values have corresponding gene symbols, using input ids for missing values.")

Copy link
Member

@sjspielman sjspielman Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok ok ok - since the argument is called ensembl_ids to begin with (that's not only an internal variable name is the thing I glanced glossed(?) over!), they know what they've done. ✅

R/convert-gene-ids.R Outdated Show resolved Hide resolved
#' @return A SingleCellExperiment object with row names set as gene symbols.
#' @export
#'
#' @import SingleCellExperiment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' @import SingleCellExperiment
#' @import SingleCellExperiment
#' @import S4Vectors

While the tests for this pass, I was unable to run this function locally:

Error: 'metadata' is not an exported object from 'namespace:SingleCellExperiment'

Turns out this comes from S4Vectors!

Copy link
Member Author

@jashapiro jashapiro Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how you are doing the local tests, but it should pass if you load SingleCellExperiment, or if if you actually load the package. The following works in a fresh R session:

> library(rOpenScPCA)
> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds"))
> sce2 <- sce_to_symbols(sce)

As does:

> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds"))
> sce2 <- rOpenScPCA::sce_to_symbols(sce)

Copy link
Member

@sjspielman sjspielman Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I loaded with devtools::load_all(). The docs suggest that load_all should actually load up more than library, so the fact that this doesn't suffice surprises/confuses me:

Whereas loadNamespace() and library() only load package dependencies when they are needed, load_all() loads all packages referenced in Imports at load time.

It works as expected when explicitly load library(SingleCellExperiment).

In a separate fresh R session without any sneaky renv business, I still got the error:

> remotes::install_github("AlexsLemonade/rOpenScPCA@fe08c631cf61437710b5bac8155978c5bf9a2bea")

######### I quit and restarted R here ##########

> library(rOpenScPCA)
> sce <- readRDS("~/ALSF/open-scpca/rOpenScPCA/tests/testthat/data/scpca_sce.rds")
> converted_sce <- sce_to_symbols(sce)
Error in metadata(sce) : could not find function "metadata"
In addition: Warning message:
In sce_to_symbols(sce) :
  Not all rows have gene symbols, using ensembl ids for missing values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More context:

> library(rOpenScPCA)
> sce <- readRDS("~/ALSF/open-scpca/rOpenScPCA/tests/testthat/data/scpca_sce.rds")
> converted_sce <- sce_to_symbols(sce)
Error in metadata(sce) : could not find function "metadata"
In addition: Warning message:
In sce_to_symbols(sce) :
  Not all rows have gene symbols, using ensembl ids for missing values.
>
>
>
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rOpenScPCA_0.1.0

loaded via a namespace (and not attached):
 [1] GenomeInfoDb_1.40.1         R6_2.5.1                   
 [3] SparseArray_1.4.8           zlibbioc_1.50.0            
 [5] Matrix_1.7-1                lattice_0.22-6             
 [7] MatrixGenerics_1.16.0       matrixStats_1.4.1          
 [9] abind_1.4-8                 GenomeInfoDbData_1.2.12    
[11] S4Arrays_1.4.1              XVector_0.44.0             
[13] Biobase_2.64.0              BiocGenerics_0.50.0        
[15] UCSC.utils_1.0.0            stats4_4.4.0               
[17] GenomicRanges_1.56.2        IRanges_2.38.1             
[19] grid_4.4.0                  DelayedArray_0.30.1        
[21] compiler_4.4.0              httr_1.4.7                 
[23] tools_4.4.0                 crayon_1.5.3               
[25] jsonlite_1.8.9              S4Vectors_0.42.1           
[27] SummarizedExperiment_1.34.0 SingleCellExperiment_1.28.1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to indicate that S4Vectors is loaded. I'm not sure what is going wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that I see the same error in a fresh install.

Copy link
Member

@sjspielman sjspielman Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So tests are probably only passing because:

suppressPackageStartupMessages(library(SingleCellExperiment))

suppressPackageStartupMessages(library(SingleCellExperiment))

suppressPackageStartupMessages(library(SingleCellExperiment))

...My bad! We should definitely not explicitly load it, because we apparently miss other (confusing but real) errors.

#' gene_symbols
#' ### [1] "TP53" "MYCN"
#' }
ensembl_to_symbol <- function(ensembl_ids, sce, leave_na = FALSE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there's already a bit of a level of abstraction here, is there a reason to not pass in a gene_symbol vector here rather than the full sce? I may be answering my own question - I suspect to confirm that the source of the gene symbols is our SCE object as we have mapped symbols and ids already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't want people to have to extract anything.

Copy link
Member Author

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review. I am not sure why you got the error you did: if the tests pass it means that loading the package will work as expected.

"`ensembl_ids` must be a character vector." = is.character(ensembl_ids),
"`sce` must contain both a `gene_ids` and `gene_symbol` column in the row data." =
all(c("gene_ids", "gene_symbol") %in% names(rowData(sce))),
"`leave_na` must be TRUE or FALSE." = is.logical(leave_na)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine because the default is false and NA evaluates to false in an if statement.

#' gene_symbols
#' ### [1] "TP53" "MYCN"
#' }
ensembl_to_symbol <- function(ensembl_ids, sce, leave_na = FALSE) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't want people to have to extract anything.


missing_symbols <- is.na(gene_symbols)
if (!leave_na && any(missing_symbols)) {
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input id may not actually be an Ensembl id...

#' @return A SingleCellExperiment object with row names set as gene symbols.
#' @export
#'
#' @import SingleCellExperiment
Copy link
Member Author

@jashapiro jashapiro Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how you are doing the local tests, but it should pass if you load SingleCellExperiment, or if if you actually load the package. The following works in a fresh R session:

> library(rOpenScPCA)
> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds"))
> sce2 <- sce_to_symbols(sce)

As does:

> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds"))
> sce2 <- rOpenScPCA::sce_to_symbols(sce)

@jashapiro
Copy link
Member Author

S4vectors now explicitly called out. Still don't understand why that one is needed when usually importing SingleCellExperiment is sufficient, but I'm not going to worry about it.

@jashapiro jashapiro requested a review from sjspielman November 21, 2024 16:08
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the (weird) error is gone! Onward 🚀

@jashapiro jashapiro merged commit 758edb4 into main Nov 21, 2024
2 checks passed
@jashapiro jashapiro deleted the jashapiro/2-gene-symbols branch November 21, 2024 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create functions for converting ensembl ids to gene symbolss
2 participants