-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata_preprocessing.Rmd
83 lines (66 loc) · 3.11 KB
/
data_preprocessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
title: "Ensemble Boolean Model Data Preprocessing"
author: "[John Zobolas](https://github.com/bblodfon)"
output:
html_document:
theme: united
---
## Intro
The purpose of this R markdown document is to extract from one or multiple
**models** directories the models' **stable states** and **boolean link
operators** to proper R data structures and save them for later use. This data
extraction can take some time and it is better to have the data in tab-delimited
files, ready to be loaded as matrices before the analysis is done for each cell
line.
## Data Structure and Output
The structure of the data and input to this document should be a
single directory (`input.dir`). This directory must have
one or many `cell line` directories, each one having:
- A **models** directory (the models produced by the `Gitsbe` module in
`.gitsbe` format).
- A **model_predictions** file
- An **observed_synergies** file
- Maybe a **training_data** or **steady_state** file!
The output will be two more files in each cell line directory: **models_stable_state**
and **models_equations** or **models_link_operator**.
**Note** that if instead of a `models` directory a compressed file `models.tar.gz` is there, you will have to run the command `tar xzvf models.tar.gz`.
## Data extraction
Necessary libraries:
```{r Load libraries, message = FALSE}
library(usefun)
library(emba)
```
For input, we provide the relative path to the data dir:
```{r Input}
#input.dir = "/atopo/cell-lines-2500"
input.dir = "/cascade/cell-lines-2500"
data.dir = paste0(getwd(), input.dir)
cell.line.dirs = list.dirs(path = data.dir, recursive = FALSE)
```
Note that below R code does not subset the models to the **unique** ones
(those that have the exactly the same boolean equations, i.e. same link
operators in our models). Such a processing should be done (if it is desirable)
when the models' data is loaded in a seperate analysis document.
```{r Data extraction and saving, eval=FALSE}
for (cell.line.dir in cell.line.dirs) {
# cell line name same as the directory name
cell.line = basename(cell.line.dir)
print(paste0("Saving files for ", cell.line, " cell line..."))
models.dir = paste0(cell.line.dir, "/models")
model.predictions.file = paste0(cell.line.dir, "/model_predictions")
models.stable.state.file = paste0(cell.line.dir, "/models_stable_state")
#models.equations.file = paste0(cell.line.dir, "/models_equations")
models.equations.file = paste0(cell.line.dir, "/models_link_operator")
if (file.exists(model.predictions.file)) {
# for having the row names in the same order as in the 'model_predictions' file
model.predictions = get_model_predictions(model.predictions.file)
models = rownames(model.predictions)
models.stable.state = get_stable_state_from_models_dir(models.dir)
models.stable.state = models.stable.state[models,]
save_mat_to_file(mat = models.stable.state, file = models.stable.state.file)
models.equations = get_link_operators_from_models_dir(models.dir)
models.equations = models.equations[models,]
save_mat_to_file(mat = models.equations, file = models.equations.file)
}
}
```