The project examplifies the use of Dask in processing of larger-than-memory datasets. This kind of datasets is impossible to be opened using Excel, and slow using Pandas. With Dask, which provides multi-core execution, it only takes less than one minute, to aggregate and filter data containing millions of rows, and to write the results in parquets: all on a single laptop. The tutorial of Dask can be found here: official-dask.
The goal of this project is to identify peptide biomarkers related to kidney dysfunction in 10000 patients with different levels of kidney function. There are several steps:
- Extract and filter the peptidomic data from the internal relational database
- Transform the peptidomic data in the forms of pivot table (peptide * patients)
- Normalise the peptidomic data
- Regress the peptide intensity aganist eGFR, which is a measurement of kidney function, while correctin for confounders (age, sex, history of disease)
- Analyse the distribution of errors in regression
We will use the following Python libraries:
- dask.dataframe to process the large peptidomic data
- Pandas for basic data analysis and cleaning
- statsmodels for regressional analysis of peptides vs eGFR
- Matplotlib and Seaborn for plotting
All data curated from relational database are stored under /data/original/
.
There are more than 25000 measurable peptides in one sample from a patient, therefore the total size of peptidomic data is large (from a few to tens of gigabytes, so larger than memory). It is too large to be efficiently handled by Pandas, which motivates the use of Dask.
A screen shot of the first few lines:
The clinical data contain the clinical parameters such as age, sex, eGFR from the patients, which I queried from the relational database. They need to be "cleaned" to satisfy the inclusion criteria.
All python codes are stored under /python/
directory.
All results generated by codes are stored under /data/curated/
direcotry.
- Filter the patients so that every patient must have age, sex, eGFR available
- The clinical parameters must be curated so that they are within a physiological range (e.g. age>1000 should be removed)
- DA (initial data analysis) to look at missingness of all variables,
- check the normality of age, eGFR, acr with visualisation
- check if incidence of diseases is balanced between males and females
- IDA to look at distributino of uCreat and acr per study group,
- and visualise the distribution in histogram
-
A pipeline to filter
-
Patients recorded in clinical data (as we cleaned above)
-
Sequenced peptides from 1)
-
Peptides which pass a certain frequency threshold from 2). The default is to appear in at least 50% of patients.
-
-
Export the filtered results to parquet files
- A pipeline to transform the parquet files to pivot format (which is easier for regression)
- Normalise the data with three options: logarithmic, rank-normalisation, or not doing anything
- Export the results to csv
- Join the peptidomic (contain peptide intensity) and clinical data (contain eGFR)
- Regression the peptide intensity aganist eGFR, corrected for age, sex, diseases
- Ajust the significance level for multiple-testing according to FDR
- Store the regression results (coefficient, adjusted p-values etc) in a csv
- check the assumptions of OLS by various plots, that:
- If the errors are normally distributed (top)
- If the expected values of erros are 0 (middle)
- If the errors have the same variance (bottom)
The plots are stored under /figures/
.
For the most significant peptide, their ln-intensity in four eGFR groups were plotted in a box plot. Here, 0=no kidney function loss, 3=most severe kidney function loss.
For example, it is discovered that the intensity of peptide 1 drops when eGFR drops, while keeping age, sex, disease constant:
- We could identify X peptides which are significantly correlated to kidney function.
- The significant peptides found will be confirmed at transcription level, and further examined with pathway analysis.