Note: This repo has been archived because LINK (Library) has been decommissioned.
This pipeline is designed to run with Apache Beam using the dataflow runner. It has not been tested with other Beam backends, but it should work there as well pending minimal modifications. Please see the Apache Beam SDK for more info.
Use python2 with pip and virtualenv
-
Generate a mirror of MEDLINE FTP to a Google Storage Bucket (any other storage provider supported by Python Beam SDK should work). E.g. using rclone
- Download pre-built rclone binaries rather than platform packaged ones as they tend to be more up-to-date
- configure rclone with MEDLINE FTP ftp.ncbi.nlm.nih.gov and your target gcp project
(my-gcp-project-buckets)
rclone config
. Medline must have usernameanonymous
and passwordanonymous
. - Generate a full mirror:
rclone sync -v medline-ftp:pubmed/baseline my-gcp-project-buckets:my-medline-bucket/baseline
- Update new files:
rclone sync -v medline-ftp:pubmed/updatefiles my-gcp-project-buckets:my-medline-bucket/updatefiles
- Note: you can use
--dry-run
argument to test
-
install tooling
sudo apt-get install python-dev virtualenv build-essential git libxml2-dev libxslt-dev zlib1g-dev tmux
-
Download the pipeline
git clone https://github.com/opentargets/library-beam cd library-beam
-
Create a virtual environment to manage dependencies in
virtualenv venv --python=python2 source venv/bin/activate
-
Install the pipeline into the virtual environment
python setup.py install #note this needs between 3.75GB and 7.5GB RAM pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.0/en_core_web_lg-2.2.0.tar.gz
-
Grant the permission to compute user.
numberHidden -compute@developer.gserviceaccount.com Cloud Build Service Agent
-
Change the value for the vocabulary info under modules/vocabulary.py
-
Run pipeline
python -m main \ --project open-targets-library \ --job_name medline201911 \ --runner DataflowRunner \ --temp_location gs://medline_2019_11/temp \ --setup_file ./setup.py \ --worker_machine_type n1-highmem-32 \ --input_baseline gs://medline_2019_11/baseline/pubmed19n*.xml.gz \ --input_updates gs://medline_2019_11/updatefiles/pubmed19n*.xml.gz \ --output_enriched gs://medline_2019_11/analyzed/pubmed19 \ --output_splitted gs://medline_2019_11/splitted/pubmed19 \ --max_num_workers 32 \ --region europe-west1 \ --zone europe-west1-d
This can be monitored via Google Dataflow. Note that "wall time" displayed is not the usual definition but is per thread and worker.
In total it takes approximately 4h.
The directory gcp contains the infrastructure scripts to generate the Elasticsearch cluster.
- Create a virtual environment to manage dependencies in
virtualenv venv_elasticsearch --python=python2 source venv_elasticsearch/bin/activate pip install -r venv_elasticsearch.txt
- Run job load JSONs in Elasticsearch
WARNING: the loading scripts takes a lot of time currently, particurlarly the concept one (24h+). It is good to use screen
or tmux
or similar, so it will keep going after disconect and can be recovered.
python load2es.py publication bioentity taggedtext concept --es http://es:9200
Note: Elasticsearch must have the International Components for Unicode support plugin installed.i.e. /usr/share/elasticsearch/bin/elasticsearch-plugin -s install analysis-icu
- Increase elasticsearch capacity for the adjancency matrix aggregation (used by LINK tool)
curl -XPUT 'http://myesnode1:9200/pubmed-18-concept/_settings' -H 'Content-Type: application/json' -d' { "index" : { "max_adjacency_matrix_filters" : 500 } }'
When controlling this process from a Google cloud machine, make sure it has sufficient scopes enabled.