You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
git clone --branch main https://github.com/biosustain/pankb_db.git pankb_db
19
19
cd pankb_db
20
20
```
21
21
22
22
## 1. ETL scripts
23
23
24
24
The ETL (Extract-Transform-Load) scripts:
25
-
1) extracts information about pangenomes from the Microsoft Azure Blob Storage *.json files. The storage serves as the data lake;
26
-
2) transforms it into the Django- and MongoDB-compatible model;
27
-
3) loads the transformed data into a MongoDB database instance.
25
+
1) extract information about pangenomes from the Microsoft Azure Blob Storage *.json files. The storage serves as the data lake;
26
+
2) transform it into the Django- and MongoDB-compatible model;
27
+
3) load the transformed data into a MongoDB database instance;
28
+
4) (optionally) upload the logs needed for statistics and quality control to the Azure Blob Storage after the pipeline scripts are executed.
28
29
29
30
Initially, the database tables are created by Django web framework, which the PanKB website is built on. It is achieved by setting the parameter `managed = True` in the `models.py` files.
30
31
@@ -38,24 +39,35 @@ The python packages versions to be installed can be found in the `requirements.t
38
39
```
39
40
pip install -r requirements.txt
40
41
```
42
+
or
43
+
```
44
+
pip3 install -r requirements.txt
45
+
```
41
46
42
47
### 1.2. Execute the ETL Scripts
43
48
Before executing any scripts, create the `.env` file under the subfolder `/etl` with the following content in case of populating a self-deployed MongoDB instance:
44
49
```
45
50
## Do not put this file under version control!
46
51
47
-
MONGODB_NAME = 'pankb' # the db name
48
-
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin' # the db admin name
49
-
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>' # the db admin pass
52
+
# The MongoDB database name
53
+
MONGODB_NAME = 'pankb'
54
+
55
+
# The MongoDB root username
56
+
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin'
57
+
58
+
# The MongoDB root password
59
+
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>'
50
60
51
61
## Azure Blob Storage Connection String
52
-
BLOB_STORAGE_CONN_STRING = '<copy the Azure Blob Storage connection string from the Azure web portal>'
62
+
BLOB_STORAGE_CONN_STRING = '<copy the Azure Blob Storage connection string from the Azure web portal>'
53
63
```
54
64
or in case of populating a cloud-based Azure CosmosDB for MongoDB instance:
55
65
```
56
66
## Do not put this file under version control!
57
67
58
-
MONGODB_NAME = 'pankb' # the db name
68
+
# The MongoDB database name
69
+
MONGODB_NAME = 'pankb'
70
+
59
71
## MongoDB-PROD (Azure CosmosDB for MongoDB) Connection String
60
72
MONGODB_CONN_STRING = '<copy the Azure CosmosDB for MongoDB connection string from the Azure web portal>'
61
73
@@ -70,17 +82,24 @@ Then, edit the included `etl/config.py` file setting the following parameters:
70
82
71
83
Finally, the ETL scripts must be executed in the following order:
72
84
1.`organisms.py`
73
-
2.`gene_annotations.py`
85
+
2.`gene_annotations.py`
74
86
3.`gene_info.py`
75
87
4.`genome_info.py`
76
88
5.`pathway_info.py`
89
+
77
90
```
78
91
python3 <insert the respective script name here>
79
92
```
80
93
The scripts were not joined into one pipeline, because in practice it is more convenient to run them one by one for the sake of:
81
94
- quality control after each step;
82
95
- monitoring that the storage and RAM are not running out on the DEV server and CPUs both on the DEV and PROD servers are not overloaded (via "Metrics" section on the Azure Portal or with the help of a Remote IDE, e.g., PyCharm).
83
96
97
+
A good practice is to clean up unneccessary docker images and containers and restart the docker daemon after with the following commands:
98
+
```
99
+
docker system prune
100
+
sudo systemctl restart docker
101
+
```
102
+
84
103
## 2. Self-deployed MongoDB
85
104
86
105
### 2.1. Development configuration on Ubuntu servers
@@ -98,11 +117,18 @@ Create a file with the name ".env" under the /projects/pankb_web/pankb_db/mongod
98
117
```
99
118
## Do not put this file under version control!
100
119
101
-
## MongoDB: Docker Compose Env Variables
102
-
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin' # also the remote CosmosDB admin username: DbAdmin
103
-
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>' # also the remote CosmosDB admin pass
120
+
# The MongoDB root username
121
+
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin'
122
+
123
+
# The MongoDB root password
124
+
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>'
125
+
126
+
# The MongoDB database admin password
104
127
MONGODB_USERNAME = 'pankbDbOwner'
128
+
129
+
# The MongoDB database admin password
105
130
MONGODB_PASSWORD = '<any password you choose>'
131
+
106
132
MONGODB_AUTH_SOURCE = 'pankb'
107
133
```
108
134
Change to the appropriate folder and build the containers with Docker Compose:
0 commit comments