Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using SCT counts with runEscape? #138

Closed
alexwskh opened this issue Jan 16, 2025 · 9 comments
Closed

Using SCT counts with runEscape? #138

alexwskh opened this issue Jan 16, 2025 · 9 comments

Comments

@alexwskh
Copy link

alexwskh commented Jan 16, 2025

Hi there, and if the developer of this package happens to read this, thank you for developing this tool.

I have an integrated SCTransformed Seurat object from five separate rnaseq experiments. My understanding is that if I have run PrepSCTFindMarkers, the counts layer of the SCT assay should be adjusted for the integrated model. I want to runEscape on my Seurat object, and it works-but it is defaulting to the RNA assay counts instead of the SCT assay.

Is there a way I can tell it which assay to use? Thank you.

@ncborcherding
Copy link
Member

Hey @alexwskh,

As of now, runEscape() only uses RNA counts. This is due to the different methodologies requiring raw counts. You can bypass this using escape.matrix() using the SCT assay directly and then adding the results as a new assay.

If you are playing around with it, I would love to get more info on SCT vs RNA count results that you see. It is an unknown in the field and might spark future parameters to allow users to select assays.

Thanks again,
Nick

@alexwskh
Copy link
Author

alexwskh commented Jan 16, 2025

Hey @ncborcherding , thank you so much for the quick reply. I have some quick results here to chat about and a question if you have time.

Just as a bit of background, each individual scRNAseq run has had its counts adjusted by decontX in an attempt to compensate for ambient RNA, and doubletfinder was used to identify possible doublets for removal. Objects were then SCTransformed and integrated via Seurat. This is actually a subset of the main dataset, but that shouldn't matter.

I have two signatures I'm interested in. Our lab studies neuroblastoma, and we have a Mesenchymal(MES) and Adrenergic(ADRN) signatures. I created assays from both the RNA assay counts and the SCT counts, and normalized according to the number of features.

RNA assay, non-normalized
Image
SCT assay, non-normalized
Image

So, not really any major differences, however it gets weird when I normalize with performNormalization.

RNA assay, normalized
Image
SCT assay, normalized
Image

I'm unsure why the normalization is so different for the SCT counts. I'm guessing it's something about the number of features after SCT scaling?

Also, here's a question for you. I don't want to bias myself and cherry pick things, and the non-normalized expression score looks better. The enrichment of the mesenchymal score in the top clusters is exactly what I want to see, and you lose a lot of signal intensity when normalizing.

So my question is, does normalizing by number of features detected not artificially deflate scores in cells that are more transcriptionally complex? From my understanding the ssGSEA scoring is taking into consideration the size of the gene set and the number of genes... If mesenchymal cells happen to be doing more things biologically than adrenergic cells and require a broader range of transcripts, they will have a lower normalized score...

Given that the SCT counts should be corrected for sequencing depth across conditions, do you think I am able to work with the expression score metric directly instead of normalizing? Or is that an incorrect assumption?

@ncborcherding
Copy link
Member

@alexwskh

This is an excellent rundown!! Thank you so much for taking the time.

For the normalization, are you using performNormalization() (this is the normalization method in runEscape() and escape.matrix()) with a scale.factor? Or using the default settings? The default setting will calculate the presence of genes within a gene set across each cell and use that as a denominator for normalization.

So my question is, does normalizing by number of features detected not artificially deflate scores in cells that are more transcriptionally complex? From my understanding the ssGSEA scoring is taking into consideration the size of the gene set and the number of genes... If mesenchymal cells happen to be doing more things biologically than adrenergic cells and require a broader range of transcripts, they will have a lower normalized score...

This is a real possibility, normalizing by features may be a blunt instrument in very complex cells. There is no problem with reporting non-normalized enrichment values, just like unequal distribution of RNA within these cells.

But to clarify, the ssGSEA implementation in escape reports raw enrichment score unless you normalize, so there is no two levels of normalization.

Given that the SCT counts should be corrected for sequencing depth across conditions, do you think I am able to work with the expression score metric directly instead of normalizing? Or is that an incorrect assumption?

Did you use PrepSCTFindMarkers() to update the counts for SCT? That will apply the correction to the SCT count slot. If so, it might make sense to use it to reduce noise. But you will need to check about the features that are carried over - both for the sake of your enrichment scores and if you end up normalizing.

Interesting stuff!!

Nick

@alexwskh
Copy link
Author

alexwskh commented Jan 20, 2025

@ncborcherding Happy to help when you're spending the time developing tools like this.

For the normalization, are you using performNormalization() (this is the normalization method in runEscape() and escape.matrix()) with a scale.factor? Or using the default settings? The default setting will calculate the presence of genes within a gene set across each cell and use that as a denominator for normalization.

The pictures I showed you were normalized with performNormalization using the nFeature_RNA parameter (Pretty sure, I'll confirm... I was playing with my script a lot!). It is also a bit strange when I normalize with the default settings (or vice versa). So to confirm, setting normalize = TRUE for escape.matrix is equivalent to running performNoramlization on default settings? Default would be log1p (enrichment score / #genes from gene set that are present), versus log1p (enrichment score / #total detected genes in each sample) ?

Did you use PrepSCTFindMarkers() to update the counts for SCT? That will apply the correction to the SCT count slot. If so, it might make sense to use it to reduce noise. But you will need to check about the features that are carried over - both for the sake of your enrichment scores and if you end up normalizing.

Yeah I did run PrepSCTFindMarkers before extracting the counts, as my understanding is that it strips away the individual SCT scaling for each sample and readjusts as a whole. When you say "it might make sense to use it to reduce noise", do you mean employing normalization if I use SCT? Or that PrepSCTFFindMarkers itself would be good to use?

As for the #features, taking a look at the assays on the seurat object can see the SCT assay has 21562 features while the RNA assay has 25109, so it definitely cuts some as it transforms the counts.

Here are the #counts and #genes for the cells. I feel a bit better seeing that the high-nFeature/nCount cells with the high mesenchymal scores in the top are separate from the high nFeature cells with the higher Adrn score in the bottom right... So it's not just deeper sequencing/more genes detected solely influencing these scores. At least that's my interpretation, what about yours?

Image

@ncborcherding
Copy link
Member

So to confirm, setting normalize = TRUE for escape.matrix is equivalent to running performNoramlization on default settings? Default would be log1p (enrichment score / #genes from gene set that are present), versus log1p (enrichment score / #total detected genes in each sample)

yes the normalize = TRUE will run performNormalization()``` with the default settings. If using a scale.factor` the normalization is (enrichment score / #total genes in each cell) - the default normalization is log1p (enrichment score / # genes detected per gene set)

Yeah I did run PrepSCTFindMarkers before extracting the counts, as my understanding is that it strips away the individual SCT scaling for each sample and readjusts as a whole. When you say "it might make sense to use it to reduce noise", do you mean employing normalization if I use SCT? Or that PrepSCTFFindMarkers itself would be good to use?

My understanding is that until you use PrepSCTFFindMarkers(), the count assay in the SCT slot is unmodified from the raw counts. If you want to take advantage of the SCT normalization, you would have to run ``PrepSCTFFindMarkers()beforeescape.matrix()`

As for the #features, taking a look at the assays on the seurat object can see the SCT assay has 21562 features while the RNA assay has 25109, so it definitely cuts some as it transforms the counts.

Good to know - so there is some internal filtering when running SCT, but it is maintaining like ~85% of the feature space.

@alexwskh
Copy link
Author

Great, thanks for the continued discussion.

My understanding is that until you use PrepSCTFFindMarkers(), the count assay in the SCT slot is unmodified from the raw counts. If you want to take advantage of the SCT normalization, you would have to run ``PrepSCTFFindMarkers()beforeescape.matrix()`

Hmmmm. Here are some relevant entries from the Seurat web pages. It's possible I'm misunderstanding but...

From SCTransform page
"This function calls sctransform::vst. The sctransform package is available at https://github.com/satijalab/sctransform. Use this function as an alternative to the NormalizeData, FindVariableFeatures, ScaleData workflow. Results are saved in a new assay (named SCT by default) with counts being (corrected) counts, data being log1p(counts), scale.data being pearson residuals; sctransform::vst intermediate results are saved in misc slot of new assay."

From PrepSCTFindMarkers page
Given a merged object with multiple SCT models, this function uses minimum of the median UMI (calculated using the raw UMI counts) of individual objects to reverse the individual SCT regression model using minimum of median UMI as the sequencing depth covariate. The counts slot of the SCT assay is replaced with recorrected counts and the data slot is replaced with log1p of recorrected counts.

I took from this that the SCT counts slot is a modified value, and if you want the raw counts you need to go back to the RNA assay and extract counts from there. I can confirm this as when I am trying to extract all cells with transgenic GFP mRNA, if I use the SCT assay counts I lose some cells (ones with very low count value likely). Either way, I did run the PrepSCTFindMarkers before running escape.matrix, so that should be accounted for!

@alexwskh
Copy link
Author

@ncborcherding Just a second question, I am seeing some strange results when I do default normalization on the raw counts from the RNA assay... Any thoughts?

Image

@ncborcherding
Copy link
Member

Hey @alexwskh

What version are you currently using? The default performNormalization() function recently changed as it was creating issues downstream with differential analysis (see issue #131).

Nick

@alexwskh
Copy link
Author

alexwskh commented Feb 3, 2025

@ncborcherding Looks like I am using the correct version (2.2.2)?

other attached packages:
 [1] escape_2.2.2                biomaRt_2.58.2              scDblFinder_1.16.0          SingleCellExperiment_1.24.0
 [5] SummarizedExperiment_1.32.0 Biobase_2.62.0              GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
 [9] IRanges_2.36.0              S4Vectors_0.40.2            BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
[13] matrixStats_1.5.0           vctrs_0.6.5                 readr_2.1.5                 tablesgg_0.9-1             
[17] readxl_1.4.3                openxlsx_4.2.7.1            enrichR_3.2                 devtools_2.4.5             
[21] usethis_3.1.0               Matrix_1.6-5                spacexr_2.2.1               forstringr_1.0.0           
[25] stringr_1.5.1               ggplot2_3.5.1               hdf5r_1.3.11                patchwork_1.3.0            
[29] dplyr_1.1.4                 Seurat_5.2.0                SeuratObject_5.0.2          sp_2.1-4 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants