Skip to content

Commit 311938b

Browse files
Minor fixes in the README file
1 parent 47c8bbf commit 311938b

File tree

1 file changed

+40
-14
lines changed

1 file changed

+40
-14
lines changed

README.md

Lines changed: 40 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# ETL scripts for the PanKB Website Databases & Self-deployed MongoDB
1+
# The PanKB website DBs: ETL & Self-deployed MongoDB instance
22

33
The repo contains :
44
- ETL scripts used to populate the PanKB website DEV and PROD databases;
@@ -15,16 +15,17 @@ cd pankb_web
1515
```
1616
Clone the PanKB git repo into the subdirectory /pankb_db and change to it:
1717
```
18-
git clone --branch develop https://github.com/biosustain/pankb_db.git pankb_db
18+
git clone --branch main https://github.com/biosustain/pankb_db.git pankb_db
1919
cd pankb_db
2020
```
2121

2222
## 1. ETL scripts
2323

2424
The ETL (Extract-Transform-Load) scripts:
25-
1) extracts information about pangenomes from the Microsoft Azure Blob Storage *.json files. The storage serves as the data lake;
26-
2) transforms it into the Django- and MongoDB-compatible model;
27-
3) loads the transformed data into a MongoDB database instance.
25+
1) extract information about pangenomes from the Microsoft Azure Blob Storage *.json files. The storage serves as the data lake;
26+
2) transform it into the Django- and MongoDB-compatible model;
27+
3) load the transformed data into a MongoDB database instance;
28+
4) (optionally) upload the logs needed for statistics and quality control to the Azure Blob Storage after the pipeline scripts are executed.
2829

2930
Initially, the database tables are created by Django web framework, which the PanKB website is built on. It is achieved by setting the parameter `managed = True` in the `models.py` files.
3031

@@ -38,24 +39,35 @@ The python packages versions to be installed can be found in the `requirements.t
3839
```
3940
pip install -r requirements.txt
4041
```
42+
or
43+
```
44+
pip3 install -r requirements.txt
45+
```
4146

4247
### 1.2. Execute the ETL Scripts
4348
Before executing any scripts, create the `.env` file under the subfolder `/etl` with the following content in case of populating a self-deployed MongoDB instance:
4449
```
4550
## Do not put this file under version control!
4651
47-
MONGODB_NAME = 'pankb' # the db name
48-
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin' # the db admin name
49-
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>' # the db admin pass
52+
# The MongoDB database name
53+
MONGODB_NAME = 'pankb'
54+
55+
# The MongoDB root username
56+
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin'
57+
58+
# The MongoDB root password
59+
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>'
5060
5161
## Azure Blob Storage Connection String
52-
BLOB_STORAGE_CONN_STRING = '<copy the Azure Blob Storage connection string from the Azure web portal>'
62+
BLOB_STORAGE_CONN_STRING = '<copy the Azure Blob Storage connection string from the Azure web portal>'
5363
```
5464
or in case of populating a cloud-based Azure CosmosDB for MongoDB instance:
5565
```
5666
## Do not put this file under version control!
5767
58-
MONGODB_NAME = 'pankb' # the db name
68+
# The MongoDB database name
69+
MONGODB_NAME = 'pankb'
70+
5971
## MongoDB-PROD (Azure CosmosDB for MongoDB) Connection String
6072
MONGODB_CONN_STRING = '<copy the Azure CosmosDB for MongoDB connection string from the Azure web portal>'
6173
@@ -70,17 +82,24 @@ Then, edit the included `etl/config.py` file setting the following parameters:
7082

7183
Finally, the ETL scripts must be executed in the following order:
7284
1. `organisms.py`
73-
2. `gene_annotations.py`
85+
2. `gene_annotations.py`
7486
3. `gene_info.py`
7587
4. `genome_info.py`
7688
5. `pathway_info.py`
89+
7790
```
7891
python3 <insert the respective script name here>
7992
```
8093
The scripts were not joined into one pipeline, because in practice it is more convenient to run them one by one for the sake of:
8194
- quality control after each step;
8295
- monitoring that the storage and RAM are not running out on the DEV server and CPUs both on the DEV and PROD servers are not overloaded (via "Metrics" section on the Azure Portal or with the help of a Remote IDE, e.g., PyCharm).
8396

97+
A good practice is to clean up unneccessary docker images and containers and restart the docker daemon after with the following commands:
98+
```
99+
docker system prune
100+
sudo systemctl restart docker
101+
```
102+
84103
## 2. Self-deployed MongoDB
85104

86105
### 2.1. Development configuration on Ubuntu servers
@@ -98,11 +117,18 @@ Create a file with the name ".env" under the /projects/pankb_web/pankb_db/mongod
98117
```
99118
## Do not put this file under version control!
100119
101-
## MongoDB: Docker Compose Env Variables
102-
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin' # also the remote CosmosDB admin username: DbAdmin
103-
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>' # also the remote CosmosDB admin pass
120+
# The MongoDB root username
121+
MONGO_INITDB_ROOT_USERNAME = 'allDbAdmin'
122+
123+
# The MongoDB root password
124+
MONGO_INITDB_ROOT_PASSWORD = '<any password you choose>'
125+
126+
# The MongoDB database admin password
104127
MONGODB_USERNAME = 'pankbDbOwner'
128+
129+
# The MongoDB database admin password
105130
MONGODB_PASSWORD = '<any password you choose>'
131+
106132
MONGODB_AUTH_SOURCE = 'pankb'
107133
```
108134
Change to the appropriate folder and build the containers with Docker Compose:

0 commit comments

Comments
 (0)