-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create functions for converting ensembl ids to gene symbols #12
Conversation
This will be important to implement, although definitely in a separate PR, because otherwise this is going to have sort of limited utility given that the main use case we see it for is when working with Seurat objects -
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me for a simple first implementation of the functionality, and tests cover probably as much as we need them to cover.
I left some small comments, the biggest one being the need for an S4Vectors import, and docs will need to be rebuilt too. And, I noted in my other comment, I am in favor of implementing de-duplication sooner rather than later!
"`ensembl_ids` must be a character vector." = is.character(ensembl_ids), | ||
"`sce` must contain both a `gene_ids` and `gene_symbol` column in the row data." = | ||
all(c("gene_ids", "gene_symbol") %in% names(rowData(sce))), | ||
"`leave_na` must be TRUE or FALSE." = is.logical(leave_na) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will pass if leave_na
is NA
, but seems unlikely that will actually happen, so this is probably fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine because the default is false and NA evaluates to false in an if
statement.
R/convert-gene-ids.R
Outdated
|
||
missing_symbols <- is.na(gene_symbols) | ||
if (!leave_na && any(missing_symbols)) { | ||
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.") | |
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input Ensembl ids for missing values.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input id may not actually be an Ensembl id...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we might change the warning to not use the ensembl_ids
variable name?
Not all input ids are present in sce `gene_ids`, using input ids for missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if we change the argument name, which I don't want to do because it tells people the kind of ids that are expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't die on this hill, but this confuses me even more...
- we tell them ensembl ids are "expected" via argument/variable names
- they don't have to provide ensembl ids since the code is flexible enough
- if they don't provide ensembl ids, the error will tell them there's a "problem" with their ensembl ids. If I were in this camp, the error message would confuse me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are telling them there is a problem with the values they gave to the ensembl_ids
argument. It isn't saying that there was a necessarily problem with the ids, but that there may not be a gene symbol for the id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is actually a problem with the error message, now that I read it. How about:
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.") | |
warning("Not all `ensembl_ids` values have corresponding gene symbols, using input ids for missing values.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok ok ok - since the argument is called ensembl_ids
to begin with (that's not only an internal variable name is the thing I glanced glossed(?) over!), they know what they've done. ✅
#' @return A SingleCellExperiment object with row names set as gene symbols. | ||
#' @export | ||
#' | ||
#' @import SingleCellExperiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' @import SingleCellExperiment | |
#' @import SingleCellExperiment | |
#' @import S4Vectors |
While the tests for this pass, I was unable to run this function locally:
Error: 'metadata' is not an exported object from 'namespace:SingleCellExperiment'
Turns out this comes from S4Vectors
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how you are doing the local tests, but it should pass if you load SingleCellExperiment
, or if if you actually load the package. The following works in a fresh R session:
> library(rOpenScPCA)
> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds"))
> sce2 <- sce_to_symbols(sce)
As does:
> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds"))
> sce2 <- rOpenScPCA::sce_to_symbols(sce)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I loaded with devtools::load_all()
. The docs suggest that load_all
should actually load up more than library
, so the fact that this doesn't suffice surprises/confuses me:
Whereas loadNamespace() and library() only load package dependencies when they are needed, load_all() loads all packages referenced in Imports at load time.
It works as expected when explicitly load library(SingleCellExperiment)
.
In a separate fresh R session without any sneaky renv business, I still got the error:
> remotes::install_github("AlexsLemonade/rOpenScPCA@fe08c631cf61437710b5bac8155978c5bf9a2bea")
######### I quit and restarted R here ##########
> library(rOpenScPCA)
> sce <- readRDS("~/ALSF/open-scpca/rOpenScPCA/tests/testthat/data/scpca_sce.rds")
> converted_sce <- sce_to_symbols(sce)
Error in metadata(sce) : could not find function "metadata"
In addition: Warning message:
In sce_to_symbols(sce) :
Not all rows have gene symbols, using ensembl ids for missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More context:
> library(rOpenScPCA)
> sce <- readRDS("~/ALSF/open-scpca/rOpenScPCA/tests/testthat/data/scpca_sce.rds")
> converted_sce <- sce_to_symbols(sce)
Error in metadata(sce) : could not find function "metadata"
In addition: Warning message:
In sce_to_symbols(sce) :
Not all rows have gene symbols, using ensembl ids for missing values.
>
>
>
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rOpenScPCA_0.1.0
loaded via a namespace (and not attached):
[1] GenomeInfoDb_1.40.1 R6_2.5.1
[3] SparseArray_1.4.8 zlibbioc_1.50.0
[5] Matrix_1.7-1 lattice_0.22-6
[7] MatrixGenerics_1.16.0 matrixStats_1.4.1
[9] abind_1.4-8 GenomeInfoDbData_1.2.12
[11] S4Arrays_1.4.1 XVector_0.44.0
[13] Biobase_2.64.0 BiocGenerics_0.50.0
[15] UCSC.utils_1.0.0 stats4_4.4.0
[17] GenomicRanges_1.56.2 IRanges_2.38.1
[19] grid_4.4.0 DelayedArray_0.30.1
[21] compiler_4.4.0 httr_1.4.7
[23] tools_4.4.0 crayon_1.5.3
[25] jsonlite_1.8.9 S4Vectors_0.42.1
[27] SummarizedExperiment_1.34.0 SingleCellExperiment_1.28.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems to indicate that S4Vectors
is loaded. I'm not sure what is going wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming that I see the same error in a fresh install.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So tests are probably only passing because:
suppressPackageStartupMessages(library(SingleCellExperiment)) |
suppressPackageStartupMessages(library(SingleCellExperiment)) |
suppressPackageStartupMessages(library(SingleCellExperiment)) |
...My bad! We should definitely not explicitly load it, because we apparently miss other (confusing but real) errors.
#' gene_symbols | ||
#' ### [1] "TP53" "MYCN" | ||
#' } | ||
ensembl_to_symbol <- function(ensembl_ids, sce, leave_na = FALSE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there's already a bit of a level of abstraction here, is there a reason to not pass in a gene_symbol vector here rather than the full sce? I may be answering my own question - I suspect to confirm that the source of the gene symbols is our SCE object as we have mapped symbols and ids already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't want people to have to extract anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review. I am not sure why you got the error you did: if the tests pass it means that loading the package will work as expected.
"`ensembl_ids` must be a character vector." = is.character(ensembl_ids), | ||
"`sce` must contain both a `gene_ids` and `gene_symbol` column in the row data." = | ||
all(c("gene_ids", "gene_symbol") %in% names(rowData(sce))), | ||
"`leave_na` must be TRUE or FALSE." = is.logical(leave_na) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine because the default is false and NA evaluates to false in an if
statement.
#' gene_symbols | ||
#' ### [1] "TP53" "MYCN" | ||
#' } | ||
ensembl_to_symbol <- function(ensembl_ids, sce, leave_na = FALSE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't want people to have to extract anything.
R/convert-gene-ids.R
Outdated
|
||
missing_symbols <- is.na(gene_symbols) | ||
if (!leave_na && any(missing_symbols)) { | ||
warning("Not all `ensembl_ids` are present in sce `gene_ids`, using input ids for missing values.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input id may not actually be an Ensembl id...
#' @return A SingleCellExperiment object with row names set as gene symbols. | ||
#' @export | ||
#' | ||
#' @import SingleCellExperiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how you are doing the local tests, but it should pass if you load SingleCellExperiment
, or if if you actually load the package. The following works in a fresh R session:
> library(rOpenScPCA)
> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds"))
> sce2 <- sce_to_symbols(sce)
As does:
> sce <- readRDS(testthat::test_path("data", "scpca_sce.rds"))
> sce2 <- rOpenScPCA::sce_to_symbols(sce)
Co-authored-by: Stephanie Spielman <stephanie.spielman@gmail.com>
for more information, see https://pre-commit.ci
So odd that this seems to be only sometimes, and passing tests
S4vectors now explicitly called out. Still don't understand why that one is needed when usually importing SingleCellExperiment is sufficient, but I'm not going to worry about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed the (weird) error is gone! Onward 🚀
Closes #2
Here I am adding two functions for gene id to symbol conversion. One function converts an arbitrary list of Ensembl ids, and the other resets the row indexes of the SCE to gene symbols.
I considered making these more flexible, either with bidirectional conversion or arbitrary column names, but I decided to keep these limited in scope; neither one is particularly complex at this point, and I think if complexity is needed, the user may well to need to write a custom function anyway.
The one thing I have not yet added is gene symbol deduplication. I do want to do that in a future PR, but for now I want to keep this straightforward. Deduplication may require more options, and I think is better tackled as a separate task.
One thing to note here is that I added a data file for testing: this is actually the test file for
SCPCL000001_processed.rds
, so random data, but with fairly complete structure. I left the other tests using random data, but it might be worth using the same test data there too?Let me know if you see other places where there may be useful options that I didn't consider, but keep in mind that I want to keep this as simple as possible, if not simpler.