Hi,
I’m working on a large single-cell RNA-seq dataset consisting of approximately 200,000 cells and four distinct batches. I’ve been using kBET to evaluate batch correction after integration, but I consistently observe very high rejection rates (close to 1) and p-values near zero, regardless of whether I use cell type annotations or integration-derived clusters as the label vector.
Given the size of my dataset, I suspect this may be related to the sensitivity of kBET to large sample sizes, as noted in [Issue #80]
I’d really appreciate your guidance on the following:
- Are there recommended parameter settings (e.g., adjusting k0, or other options) for running kBET on very large datasets?
- Would you recommend downsampling the dataset to a smaller subset of cells (e.g., 10k or 20k), and if so, is there an optimal sample size to maintain statistical power without inflating rejection?
Thanks in advance for any help or clarification you can provide!