You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You are looking at NA values. Plot 54 is great. It shows that the NA's do not occur randomly distributed over the genome, but at specific locations.
Additionally, you could check the missing frequency (or non missing frequency) by individuals. You do the mutation rate calculation for VARIANTS + INDIVIDUALS, just do the same with NA's. This is useful because we want to make sure to exclude individuals that have too many NA's (an indication for bad quality sequencing).
One detail: in df['GT'].value_counts() there are two levels that are called Mixed genotype detectedand Mixed. This seems like they should be coded the same. Can you use a common level for any downstream analysis?
Of all the variables in the clinical dataset, keep only the following for further analysis: IGM_ID (host id), gilead_id (pathogen_id), GT, COUNTRY, ETHNICITY, RACE, SEX, AGE, OAV_EXPERIENCE, BASELINE_HBVDNA_IU/mL, BASELINE_HBVDNA_Dil_IU/mL, BASELINE_HBEAG_STATUS.
They have less and more sophisticated models. We will start with a simple model (one outcome, one predictor: y ~ x) and then extend this model with covariates.
You will also need to do QC on the genotype data side. Keywords here are missing genotypes, missing individuals, minor allele frequency, Hardy Weinberg equilibrium.
General remarks
Happy to see that there is pandas-plink.
It is good practice to avoid spaces in file names.
Out of curiosity - what python version are you using?
The text was updated successfully, but these errors were encountered:
I put the same code for Mixed genotype detected and Mixed
I kept only the variables you specified
Viral data
I plotted the missing rates per individuals and per variant. I dropped the variants that had a missing rate over 15% which removes 168 variants (out of 5393). I didn't drop any individual since the worst case has about 15% of missing values.
Viral data.ipynb
You are looking at NA values. Plot 54 is great. It shows that the NA's do not occur randomly distributed over the genome, but at specific locations.
Additionally, you could check the missing frequency (or non missing frequency) by individuals. You do the mutation rate calculation for VARIANTS + INDIVIDUALS, just do the same with NA's. This is useful because we want to make sure to exclude individuals that have too many NA's (an indication for bad quality sequencing).
Clinical data.ipynb
df['GT'].value_counts()
there are two levels that are calledMixed genotype detected
andMixed
. This seems like they should be coded the same. Can you use a common level for any downstream analysis?IGM_ID
(host id),gilead_id
(pathogen_id),GT
,COUNTRY
,ETHNICITY
,RACE
,SEX
,AGE
,OAV_EXPERIENCE
,BASELINE_HBVDNA_IU/mL
,BASELINE_HBVDNA_Dil_IU/mL
,BASELINE_HBEAG_STATUS
.Plink introduction.ipynb
You can start exploring the association analysis done in PLINK: https://www.cog-genomics.org/plink/1.9/assoc.
They have less and more sophisticated models. We will start with a simple model (one outcome, one predictor:
y ~ x
) and then extend this model with covariates.You will also need to do QC on the genotype data side. Keywords here are missing genotypes, missing individuals, minor allele frequency, Hardy Weinberg equilibrium.
General remarks
pandas-plink
.The text was updated successfully, but these errors were encountered: