-
Notifications
You must be signed in to change notification settings - Fork 101
Description
Hello,
I am using the Predict.py script to generate a gene expression matrix for a cohort. However, the resulting matrix contains some NA values.
Based on my understanding of the code, these NAs likely occur when SNPs required by a gene's prediction model are completely missing from the genotype data for all samples in my cohort, making it impossible to calculate the expression value for that gene.
My central question is: What is the recommended best practice for handling these NA values in the predicted expression matrix for downstream analyses (e.g., association studies)?
Specifically, I am weighing two common approaches and would appreciate guidance:
Imputation with 0 (or another value): Replacing NA with zero, implicitly assuming that the missing prediction is equivalent to no expression or a baseline level.
Deletion: Removing the entire column (gene) that contains any NA values. This is simple but results in loss of data.