Skip to content

Commit 7dc3e94

Browse files
authored
Merge pull request #202 from cancervariants/staging
Staging
2 parents 2f9b644 + 6f8f685 commit 7dc3e94

37 files changed

+5191
-5848
lines changed

.ebextensions/02_app_config.config

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ commands:
55
command: "yum install -y awscli"
66
03_install_unzip:
77
command: "yum install -y unzip"
8-
04_export_eb_env_var:
8+
04_eb_packages:
9+
command: "/var/app/venv/staging-LQM1lest/bin/pip install uvloop websockets httptools typing-extensions"
10+
05_export_eb_env_var:
911
command: "export $(cat /opt/elasticbeanstalk/deployment/env | xargs)"
1012

1113
container_commands:

.github/workflows/release.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ jobs:
99
runs-on: ubuntu-latest
1010

1111
steps:
12-
- uses: actions/checkout@v2
13-
- uses: actions/setup-python@v2
12+
- uses: actions/checkout@v3
13+
- uses: actions/setup-python@v4
1414
- name: Install dependencies
1515
run: |
1616
python3 -m pip install --upgrade pip

.pre-commit-config.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,4 @@ repos:
99
- id: check-added-large-files
1010
args: ['--maxkb=1024']
1111
exclude: ^tests/data
12-
- id: detect-aws-credentials
1312
- id: detect-private-key

Pipfile

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,22 @@ verify_ssl = true
44
name = "pypi"
55

66
[packages]
7-
"ga4gh.vrs" = {version = ">=0.7.5.dev1", extras = ["extras"]}
8-
civicpy = "*"
7+
"ga4gh.vrs" = "==0.8.0dev0"
8+
civicpy = ">=2.0.0"
99
requests = "*"
1010
jsondiff = "*"
1111
pydantic = "*"
1212
requests-cache = "*"
13-
gene-normalizer = ">=0.1.25"
14-
disease-normalizer = ">=0.2.12"
15-
thera-py = ">=0.3.4"
13+
gene-normalizer = {version = "==0.1.30", extras = ["dev"]}
14+
disease-normalizer = {version = "==0.2.15", extras = ["dev"]}
15+
thera-py = {version = "==0.3.7", extras = ["dev"]}
1616
neo4j = "*"
1717
uvicorn = "*"
1818
fastapi = "*"
19-
uvloop = "*"
20-
websockets = "*"
21-
httptools = "*"
22-
typing-extensions = "*"
2319
boto3 = "*"
2420
botocore = "*"
25-
variation-normalizer = ">= 0.4.0a7"
26-
"ga4gh.vrsatile.pydantic" = ">=0.0.11"
21+
variation-normalizer = "==0.5.1"
22+
"ga4gh.vrsatile.pydantic" = "==0.0.11"
2723
asyncclick = "*"
2824

2925
[dev-packages]

README.md

Lines changed: 48 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,15 +30,13 @@ Once Pipenv is installed, clone the repo and install the package requirements in
3030
```sh
3131
git clone https://github.com/cancervariants/metakb
3232
cd metakb
33-
pipenv lock
34-
pipenv sync
33+
pipenv lock && pipenv sync
3534
```
3635

3736
If you intend to provide development support, install the development dependencies:
3837

3938
```sh
40-
pipenv lock --dev
41-
pipenv sync
39+
pipenv lock --dev && pipenv sync
4240
```
4341

4442
### Setting up Neo4j
@@ -49,14 +47,14 @@ First, follow the [desktop setup instructions](https://neo4j.com/developer/neo4j
4947

5048
Once you have opened Neo4j desktop, use the "New" button in the upper-left region of the window to create a new project. Within that project, click the "Add" button in the upper-right region of the window and select "Local DBMS". The name of the DBMS doesn't matter, but the password will be used later to connect the database to MetaKB (we have been using "admin" by default). Click "Create". Then, click the row within the project screen corresponding to your newly-created DBMS, and click the green "Start" button to start the database service.
5149

52-
The graph will initially be empty, but once you have successfully loaded data, Neo4j Desktop provides an interface for exploring and visualizing relationships within the graph. To access it, click the blue "Open" button. The prompt at the top of this window processes [Cypher queries](https://neo4j.com/docs/cypher-refcard/current/); to start, try `MATCH (n:Statement {id:"civic.eid:5818"}) RETURN n`. Buttons on the left-hand edge of the results pane let you select graph, tabular, or textual output.
50+
The graph will initially be empty, but once you have successfully loaded data, Neo4j Desktop provides an interface for exploring and visualizing relationships within the graph. To access it, click the blue "Open" button. The prompt at the top of this window processes [Cypher queries](https://neo4j.com/docs/cypher-refcard/current/); to start, try `MATCH (n:Statement {id:"civic.eid:1409"}) RETURN n`. Buttons on the left-hand edge of the results pane let you select graph, tabular, or textual output.
5351

5452

5553
### Setting up normalizers
5654

5755
The MetaKB calls a number of normalizer libraries to transform resource data and resolve incoming search queries. These will be installed as part of the package requirements, but require additional setup.
5856

59-
First, [download and install Amazon's DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html). Once installed, in a separate terminal instance, navigate to its source directory and run the following to start the database instance:
57+
First, [follow these instructions for deploying DynamoDB locally on your computer](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.DownloadingAndRunning.html). Once setup, in a separate terminal instance, navigate to its source directory and run the following to start the database instance:
6058

6159
```sh
6260
java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb
@@ -65,10 +63,10 @@ java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb
6563
Next, navigate to the `site-packages` directory of your virtual environment. Assuming Pipenv is installed to your user directory, this should be something like:
6664

6765
```sh
68-
cd ~/.local/share/virtualenvs/metakb-<various characters>/python3.7/site-packages/ # replace <various characters>
66+
cd ~/.local/share/virtualenvs/metakb-<various characters>/lib/python<python-version>/site-packages/ # replace <various characters> and <python-version>
6967
```
7068

71-
Next, initialize the [Variation Normalizer](https://github.com/cancervariants/variation-normalization) by following the instructions in the [README](https://github.com/cancervariants/variation-normalization#installation).
69+
Next, initialize the [Variation Normalizer](https://github.com/cancervariants/variation-normalization) by following the instructions in the [README](https://github.com/cancervariants/variation-normalization#installation). When setting up the UTA database, [these](https://github.com/ga4gh/vrs-python/tree/main/docs/setup_help) docs may be helpful.
7270

7371

7472
The MetaKB can acquire all other needed normalizer data, except for that of [OMIM](https://www.omim.org/downloads), which must be manually placed:
@@ -79,9 +77,46 @@ mkdir -p data/omim
7977
cp ~/YOUR/PATH/TO/mimTitles.txt data/omim/omim_<date>.tsv # replace <date> with date of data acquisition formatted as YYYYMMDD
8078
```
8179

80+
### Environment Variables
81+
82+
MetaKB relies on environment variables to set in order to work.
83+
84+
* Always Required:
85+
* `UTA_DB_URL`
86+
* Used in Variation Normalizer which relies on UTA Tools
87+
* Format: `driver://user:pass@host/database/schema`
88+
* More info can be found [here](https://github.com/GenomicMedLab/uta-tools#connecting-to-the-database)
89+
90+
Example:
91+
92+
```shell script
93+
export UTA_DB_URL=postgresql://uta_admin:password@localhost:5432/uta/uta_20210129
94+
```
95+
96+
* Required when using the `--load_normalizers_db` or `--force_load_normalizers_db` arguments in CLI commands
97+
* `RXNORM_API_KEY`
98+
* Used in Therapy Normalizer to retrieve RxNorm data
99+
* RxNorm requires a UMLS license, which you can register for one [here](https://www.nlm.nih.gov/research/umls/index.html). You must set the `RxNORM_API_KEY` environment variable to your API key. This can be found in the [UTS 'My Profile' area](https://uts.nlm.nih.gov/uts/profile) after singing in.
100+
101+
Example:
102+
103+
```shell script
104+
export RXNORM_API_KEY={rxnorm_api_key}
105+
```
106+
107+
* `DATAVERSE_API_KEY`
108+
* Used in Therapy Normalizer to retrieve HemOnc data
109+
* HemOnc.org data requires a Harvard Dataverse API key. After creating a user account on the Harvard Dataverse website, you can follow [these instructions](https://guides.dataverse.org/en/latest/user/account.html) to generate a key. You will create or login to your account at [this](https://dataverse.harvard.edu/) site. You must set the `DATAVERSE_API_KEY` environment variable to your API key.
110+
111+
Example:
112+
113+
```shell script
114+
export DATAVERSE_API_KEY={dataverse_api_key}
115+
```
116+
82117
### Loading data
83118

84-
Once Neo4j and DynamoDB instances are both active, and necessary normalizer data has been placed, run the MetaKB CLI with the `--initialize_normalizers` flag to acquire all other necessary normalizer source data, and execute harvest, transform, and load operations into the graph datastore.
119+
Once Neo4j and DynamoDB instances are both running, and necessary normalizer data has been placed, run the MetaKB CLI with the `--initialize_normalizers` flag to acquire all other necessary normalizer source data, and execute harvest, transform, and load operations into the graph datastore.
85120

86121
In the MetaKB project root, run the following:
87122

@@ -90,6 +125,8 @@ pipenv shell
90125
python3 -m metakb.cli --db_url=bolt://localhost:7687 --db_username=neo4j --db_password=<neo4j-password-here> --load_normalizers_db
91126
```
92127

128+
For more information on the different CLI arguments, see the [CLI README](docs/cli/README.md).
129+
93130
### Starting the server
94131

95132
Once data has been loaded successfully, use the following to start service on localhost port 8000:
@@ -98,6 +135,8 @@ Once data has been loaded successfully, use the following to start service on lo
98135
uvicorn metakb.main:app --reload
99136
```
100137

138+
Ensure that both the MetaKB Neo4j and Normalizers databases are running.
139+
101140
Navigate to [http://localhost:8000/api/v2](http://localhost:8000/api/v2) in your browser to enter queries.
102141

103142
## Running tests

docs/cli/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# MetaKB CLI
2+
3+
More information on MetaKB CLI arguments
4+
5+
* `--db_url`
6+
* URL endpoint for the application Neo4j database. Can also be provided via environment variable `METAKB_DB_URL`.
7+
8+
* `--db_username`
9+
* Username to provide to application Neo4j database. Can also be provided via environment variable `METAKB_DB_USERNAME`.
10+
11+
* `--db_password`
12+
* Password to provide to application Neo4j database. Can also be provided via environment variable `METAKB_DB_PASSWORD`.
13+
14+
* `--load_normalizers_db`
15+
* Check normalizers' (therapy, disease, and gene) DynamoDB database and load data if source data is not present.
16+
17+
* `--force_load_normalizers_db`
18+
* Load all normalizers' (therapy, disease, and gene) data into DynamoDB database. Overrides `--load_normalizers_db` if both are selected.
19+
20+
* `--normalizers_db_url`
21+
* URL endpoint of normalizers' (therapy, disease, and gene) DynamoDB database. Set to `http://localhost:8000` by default.
22+
23+
* `--load_latest_cdms`
24+
* Deletes all nodes from the MetaKB Neo4j database and loads it with the latest source transformed CDM files stored locally in the `metakb/data` directory. This bypasses having to run the source harvest and transform steps. Exclusive with `--load_target_cdm` and `--load_latest_s3_cdms`.
25+
26+
* `--load_target_cdm`
27+
* Load a source's transformed CDM file at specified path. This bypasses having to run the source harvest and transform steps. Exclusive with `--load_latest_cdms` and `--load_latest_s3_cdms`.
28+
29+
* `--load_latest_s3_cdms`
30+
* Deletes all nodes from the MetaKB Neo4j database, retrieves latest source transformed CDM files from public s3 bucket, and loads the Neo4j database with the retrieved data. This bypasses having to run the source harvest and transform steps Exclusive with `--load_latest_cdms` and `--load_target_cdms`.

metakb/__init__.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,6 @@
77
PROJECT_ROOT = Path(__file__).resolve().parents[1]
88

99
if 'METAKB_NORM_EB_PROD' in environ:
10-
environ['VARIATION_NORM_EB_PROD'] = "true"
11-
environ['GENE_NORM_EB_PROD'] = "true"
12-
environ['THERAPY_NORM_EB_PROD'] = "true"
13-
environ['DISEASE_NORM_EB_PROD'] = "true"
1410
LOG_FN = "/tmp/metakb.log"
1511
else:
1612
LOG_FN = "metakb.log"

metakb/cli.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -114,11 +114,21 @@ class CLI:
114114
"from VICC S3 bucket, and load the database with retrieved "
115115
"data. Exclusive with --load_latest_cdms and load_target_cdm.")
116116
)
117+
@click.option(
118+
"--update_cached",
119+
"-u",
120+
is_flag=True,
121+
default=False,
122+
required=False,
123+
help=("`True` if civicpy cache should be updated. Note this will take serveral"
124+
"minutes. `False` if local cache should be used")
125+
)
117126
async def update_metakb_db(
118127
db_url: str, db_username: str, db_password: str,
119128
load_normalizers_db: bool, force_load_normalizers_db: bool,
120129
normalizers_db_url: str, load_latest_cdms: bool,
121-
load_target_cdm: Optional[Path], load_latest_s3_cdms: bool
130+
load_target_cdm: Optional[Path], load_latest_s3_cdms: bool,
131+
update_cached: bool
122132
):
123133
"""Execute data harvest and transformation from resources and upload
124134
to graph datastore.
@@ -141,7 +151,7 @@ async def update_metakb_db(
141151
if load_normalizers_db or force_load_normalizers_db:
142152
CLI()._load_normalizers_db(force_load_normalizers_db)
143153

144-
CLI()._harvest_sources()
154+
CLI()._harvest_sources(update_cached)
145155
await CLI()._transform_sources()
146156

147157
# Load neo4j database
@@ -225,7 +235,7 @@ def _retrieve_s3_cdms(self) -> str:
225235
return newest_version
226236

227237
@staticmethod
228-
def _harvest_sources() -> None:
238+
def _harvest_sources(update_cached) -> None:
229239
"""Run harvesting procedure for all sources."""
230240
echo_info("Harvesting sources...")
231241
# TODO: Switch to using constant
@@ -238,7 +248,12 @@ def _harvest_sources() -> None:
238248
echo_info(f"Harvesting {source_str}...")
239249
start = timer()
240250
source: Harvester = source_class()
241-
source_successful = source.harvest()
251+
if source_str == "civic" and update_cached:
252+
# Use latest civic data
253+
echo_info("(civicpy cache is also being updated)")
254+
source_successful = source.harvest(update_cache=True)
255+
else:
256+
source_successful = source.harvest()
242257
end = timer()
243258
if not source_successful:
244259
echo_info(f'{source_str} harvest failed.')

metakb/database.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -457,7 +457,7 @@ def _add_statement(tx, statement: Dict, added_ids: Set[str]):
457457
@staticmethod
458458
def get_secret():
459459
"""Get secrets for MetaKB instances."""
460-
secret_name = environ['METAKB_DB_PASSWORD']
460+
secret_name = environ['METAKB_DB_SECRET']
461461
region_name = "us-east-2"
462462

463463
# Create a Secrets Manager client

metakb/harvesters/base.py

Lines changed: 4 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,18 @@
66

77
from metakb import APP_ROOT, DATE_FMT
88

9-
logger = logging.getLogger('metakb')
9+
logger = logging.getLogger("metakb.harvesters.base")
1010
logger.setLevel(logging.DEBUG)
1111

1212

1313
class Harvester:
1414
"""A base class for content harvesters."""
1515

16-
def __init__(self):
16+
def __init__(self) -> None:
1717
"""Initialize Harvester class."""
1818
self.assertions = []
1919

20-
def harvest(self):
20+
def harvest(self) -> bool:
2121
"""
2222
Retrieve and store records from a resource. Records may be stored in
2323
any manner, but must be retrievable by :method:`iterate_records`.
@@ -27,16 +27,6 @@ def harvest(self):
2727
"""
2828
raise NotImplementedError
2929

30-
def iter_assertions(self):
31-
"""
32-
Yield all :class:`ClinSigAssertion` records for the resource.
33-
34-
:return: An iterator
35-
:rtype: Iterator[:class:`ClinSigAssertion`]
36-
"""
37-
for statement in self.assertions:
38-
yield statement
39-
4030
def create_json(self, items: Dict[str, List],
4131
filename: Optional[str] = None) -> bool:
4232
"""Create composite and individual JSON for harvested data.
@@ -59,7 +49,7 @@ def create_json(self, items: Dict[str, List],
5949
if filename is None:
6050
filename = f"{src}_harvester_{today}.json"
6151
with open(src_dir / filename, "w+") as f:
62-
json.dump(composite_dict, f, indent=4)
52+
f.write(json.dumps(composite_dict, indent=4))
6353
except Exception as e:
6454
logger.error(f"Unable to create json: {e}")
6555
return False

0 commit comments

Comments
 (0)