Skip to content

Commit ca1585c

Browse files
authored
Merge pull request #10 from CogStack/devel
Merge devel into master.
2 parents a1f039b + 13bb489 commit ca1585c

File tree

11 files changed

+975
-299
lines changed

11 files changed

+975
-299
lines changed

.gitattributes

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
###############################
2+
# Git Line Endings #
3+
###############################
4+
5+
# Set default behaviour to automatically normalize line endings.
6+
* text=auto
7+
8+
# Force batch scripts to always use CRLF line endings so that if a repo is accessed
9+
# in Windows via a file share from Linux, the scripts will work.
10+
*.{cmd,[cC][mM][dD]} text eol=crlf
11+
*.{bat,[bB][aA][tT]} text eol=crlf
12+
13+
# Force bash scripts to always use LF line endings so that if a repo is accessed
14+
# in Unix via a file share from Windows, the scripts will work.
15+
*.sh text eol=lf
16+
17+
###############################
18+
# Git Large File System (LFS) #
19+
###############################
20+
21+
# Archives
22+
*.7z filter=lfs diff=lfs merge=lfs -text
23+
*.br filter=lfs diff=lfs merge=lfs -text
24+
*.gz filter=lfs diff=lfs merge=lfs -text
25+
*.tar filter=lfs diff=lfs merge=lfs -text
26+
*.zip filter=lfs diff=lfs merge=lfs -text
27+
28+
# Documents
29+
*.pdf filter=lfs diff=lfs merge=lfs -text
30+
31+
# Images
32+
*.gif filter=lfs diff=lfs merge=lfs -text
33+
*.ico filter=lfs diff=lfs merge=lfs -text
34+
*.jpg filter=lfs diff=lfs merge=lfs -text
35+
*.pdf filter=lfs diff=lfs merge=lfs -text
36+
*.png filter=lfs diff=lfs merge=lfs -text
37+
*.psd filter=lfs diff=lfs merge=lfs -text
38+
*.webp filter=lfs diff=lfs merge=lfs -text
39+
40+
# Fonts
41+
*.woff2 filter=lfs diff=lfs merge=lfs -text
42+
43+
# Other
44+
*.exe filter=lfs diff=lfs merge=lfs -text

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,11 @@
1+
# IDE
12
.DS_Store
23
.idea
4+
.vscode
5+
6+
# python
37
*.pyc
48
venv
9+
__pycache__
10+
venv-test
11+
.mypy_cache

Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
FROM python:3.8
1+
FROM python:3.9
22

33
WORKDIR /app
44

55
# configure and install the required packages
66
ENV PYTHONPATH="/app:${PYTHONPATH}"
77
COPY ./requirements.txt /app
88

9+
RUN pip3 install --upgrade pip
910
RUN pip3 install -r requirements.txt
1011

1112
# copy the source and config files

README.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ Firstly install the Python libraries specified in `requirements.txt`.
1515
To run:
1616
`python main.py --config config/config.yml`
1717

18-
1918
# Configuration
2019
The ingestion process properties are configured in `config.yml`.
2120

@@ -31,18 +30,36 @@ Entries under `sink` key specify the ElasticSearch sink.
3130

3231
#### Security and SSL certificates
3332
This only applies when ElasticSearch cluster is using X-Pack / Open Distro and requires secure connections with using SSL certificates. Under the key `source` and/or `sink` can be optionally specified `security` entry that will contain necessary configuration -- these are:
33+
' `ca-file-path` -- path to CA certificate file (PEM) used solely with SSL, (DO NOT USE THIS along with the other ca/client parameters mentioned underneath, use it solely)
3434
- `ca-certs-path` -- the path to CA certificates file (PEM),
3535
- `client-cert-path` -- the path to client certificate file (PEM),
3636
- `client-key-path` -- the path to client key (PEM).
3737

38-
### Extra ES connection options
38+
### Extra ES sink/source connection options
3939
Entires under the key `extra_params` are optional, useful for test cases or deployments where we only use internal resources:
4040
- `use_ssl` -- use ssl connection
4141
- `verify_certs` -- verify SSL certificates
4242

43+
- `credentials`
44+
- `username` and `password` can be used to provide connection credentials
45+
- `use-api-key` if this is enabled the username and password fields will be used as api_id and api_key
46+
47+
### NLP service
48+
- `endpoint-url`
49+
- `endpoint-request-mode` , this is either left empty, or in case of use with the GATE NLP Annie annotation service it should be set to `gate-nlp`
50+
- `use-bulk-indexing` ingest in bulk mode (1000 docs / bulk chunk),
51+
52+
- `credentials`
53+
- `username` and `password` can be used to provide connection credentials
54+
4355
### Fields mapping
4456
Entries under `mapping` key define the mapping of the document fields for the ingestion.
4557

58+
The sub-entry `index-ingest-mode` defines:
59+
- `same-index`: `False` , set to `True` if you wish to ingest annotations into the same index
60+
- `use-nested-objects` : 'False` , set to True if you wish to ingest into the same index with nested object type, useful but beware of search query speed impact
61+
- `es-nested-object-schema-mapping` : for medcat annotations use `medcat-separate-index` or `medcat-nested-object` , for GATE-nlp use `gate-nlp-separate-index` or `gate-nlp-nested-object`
62+
4663
The sub-entry `source` defines the field names that contain:
4764
- `text-field` - the free text to be processed,
4865
- `docid-field` - the unique identifier of the document,
@@ -57,7 +74,7 @@ The sub-entry `batch` defines the possible portion of documents to be processed
5774
- `threads` - the number of processing threads to speed up the ingestion.
5875

5976
The sub-entry `sink` specifies additional options during sending the processed annotations:
60-
- `split-index-by-field` - the name of the field in the returned annotations the value of which will be used as a prefix for the index name (e.g., used to send annotations of different types to separate indices).
77+
- `split-index-by-field` - the name of the field in the returned annotations the value of which will be used as a prefix for the index name (e.g., used to send annotations of different types to separate indices). If you don't want this functionality simply leave the field empty, otherwise , to split by annotation type use `type`
6178

6279
The sub-entry `nlp` specifies additional options during processing the documents with NLP:
6380
- `skip-processed-doc-check` - whether to skip checking for already processed documents in ElasticSearch,

config/config.yml

Lines changed: 46 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,72 @@
11
source:
22
es:
3-
hosts: ["https://cogstack:cogstack@localhost:9200"]
4-
index-name: 'cogstack_observations_view'
5-
extra_params:
6-
use_ssl: False
7-
verify_certs: False
3+
hosts: ["http://127.0.0.1:9200"]
4+
credentials:
5+
username : "elastic"
6+
password : "admin"
7+
use-api-key : False
8+
index-name: 'medical_reports_text'
9+
extra-params:
10+
use-ssl: False
11+
verify-certs: False
812
security:
13+
ca-file-path : "/app/config/cert.pem"
914
ca-certs-path: "/app/config/root-ca.pem"
1015
client-cert-path: "/app/config/client.pem"
1116
client-key-path: "/app/config/client.key"
1217

1318
sink:
1419
es:
15-
hosts: ["http://cogstack:cogstack@localhost:9200"]
16-
index-name: 'cogstack_atomic_annotations'
17-
extra_params:
18-
use_ssl: False
19-
verify_certs: False
20+
hosts: ["http://127.0.0.1:9200"]
21+
credentials:
22+
username : "elastic"
23+
password : "admin"
24+
use-api-key : False
25+
index-name: 'medical_reports_text_annotations'
26+
extra-params:
27+
use-ssl: False
28+
verify-certs: False
2029
security:
30+
ca-file-path : "/app/config/cert.pem"
2131
ca-certs-path: "/app/config/root-ca.pem"
2232
client-cert-path: "/app/config/client.pem"
2333
client-key-path: "/app/config/client.key"
2434

2535
nlp-service:
26-
endpoint-url: 'http://localhost:8095/api/process'
36+
endpoint-url: 'http://localhost:5555/api/process'
37+
endpoint-request-mode : ''
38+
use-bulk-indexing : True
39+
annotation-response:
40+
dict-key : "annotations"
41+
result-key : "result"
42+
credentials :
43+
username : ""
44+
password : ""
2745

2846
mapping:
47+
index-ingest-mode:
48+
same-index: False
49+
use-nested-objects: False
50+
es-nested-object-schema-mapping : ""
2951
source:
30-
text-field: 'document_content'
31-
docid-field: 'encounter_id'
52+
text-field: 'document'
53+
docid-field: '_id'
3254
persist-fields:
33-
- 'encounter_id'
34-
- 'patient_id'
35-
- 'encounter_start'
36-
- 'encounter_end'
55+
- '_id'
3756
batch:
38-
date-field: 'encounter_start'
57+
date-field: 'dct'
3958
date-format: 'yyyy-MM-dd'
4059
python-date-format: '%Y-%m-%d'
41-
interval: 30
42-
date-start: '2010-01-01'
43-
date-end: '2018-06-01'
60+
interval: 30
61+
date-start: '1999-01-01'
62+
date-end: '2021-02-01'
4463
threads: 8
4564
sink:
46-
split-index-by-field: 'type'
65+
split-index-by-field: '' # 'type', splits into different indices with separate prefix
4766
nlp:
48-
skip-processed-doc-check: 'true'
67+
skip-processed-doc-check: True
4968
annotation-id-field: 'id'
69+
70+
# DEBUG = 10 , INFO = 20, WARNING = 30, ERROR = 40, CRITICAL = 50
71+
logging-level: "20"
72+

docker-compose.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
version: "3.5"
22
services:
3-
43
annotation-ingester:
54
build: .
65
shm_size : 128mb

ingester/__main__.py

Lines changed: 48 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88
from ingester.nlp_service import *
99
from ingester.annotations_indexer import *
1010

11-
1211
class AppConfig:
1312
"""
1413
The configuration file for the indexer application
@@ -32,7 +31,6 @@ def __init__(self, file_path):
3231
except FileNotFoundError:
3332
raise Exception("Cannot open configuration file")
3433

35-
3634
if __name__ == "__main__":
3735
# parse the input parameters
3836
parser = argparse.ArgumentParser(description='ElasticSearch-to-ElasticSearch annotations indexer')
@@ -43,76 +41,84 @@ def __init__(self, file_path):
4341
parser.print_usage()
4442
exit(0)
4543

46-
# setup logging
47-
log_format = '[%(asctime)s] [%(levelname)s] %(name)s: %(message)s'
48-
logging.basicConfig(format=log_format, level=logging.INFO)
49-
5044
try:
5145
config = AppConfig(args.config)
5246

53-
# set up Elastic logger for the initialization time
54-
logging.getLogger('elasticsearch').setLevel(level=logging.INFO)
47+
# setup logging
48+
log_format = '[%(asctime)s] [%(levelname)s] %(name)s: %(message)s'
49+
if config.params['logging-level']:
50+
logging.basicConfig(format=log_format, level=int(config.params['logging-level']))
51+
else:
52+
logging.basicConfig(format=log_format, level=logging.INFO)
53+
logging.getLogger('elasticsearch')
5554

5655
# initialize the elastic source
5756
source_params = config.params['source']
58-
if 'security' in source_params['es']:
59-
src_sec = source_params['es']['security']
60-
source_ssl_config = SslConnectionConfig(ca_certs_path=src_sec['ca-certs-path'],
61-
client_cert_path=src_sec['client-cert-path'],
62-
client_key_path=src_sec['client-key-path'])
63-
es_source_conf = ElasticConnectorConfig(hosts=source_params['es']['hosts'],
64-
ssl_config=source_ssl_config)
65-
elif 'extra_params' in source_params['es']:
66-
e_params = source_params['es']['extra_params']
67-
es_source_conf = ElasticConnectorConfig(hosts=source_params['es']['hosts'], extra_params=e_params)
68-
else:
69-
es_source_conf = ElasticConnectorConfig(hosts=source_params['es']['hosts'])
70-
es_source_conn = ElasticConnector(es_source_conf)
7157

58+
source_credentials = source_params['es']['credentials'] if source_params['es']['credentials'] else None
59+
source_extra_params = source_params['es']['extra-params'] if source_params['es']['extra-params'] else None
60+
source_security = source_params['es']['security'] if source_params['es']['security'] else None
61+
62+
source_ssl_config = SslConnectionConfig(ca_file_path=source_security['ca-file-path'],
63+
ca_certs_path=source_security['ca-certs-path'],
64+
client_cert_path=source_security['client-cert-path'],
65+
client_key_path=source_security['client-key-path'])
66+
67+
es_source_conf = ElasticConnectorConfig(hosts=source_params['es']['hosts'], credentials=source_credentials, extra_params=source_extra_params,
68+
ssl_config=source_ssl_config)
69+
70+
es_source_conn = ElasticConnector(es_source_conf)
7271
es_source = ElasticRangedIndexer(es_source_conn, source_params['es']['index-name'])
7372

7473
# initialize NLP service
75-
nlp_service = NlpService(config.params['nlp-service']['endpoint-url'])
74+
nlp_service = NlpService(config.params['nlp-service']['endpoint-url'],
75+
endpoint_request_mode=config.params['nlp-service']['endpoint-request-mode'],
76+
use_bulk_indexing=config.params['nlp-service']['use-bulk-indexing'],
77+
username=config.params['nlp-service']['credentials']['username'],
78+
password=config.params['nlp-service']['credentials']['password'])
7679

7780
# initialize the elastic sink
7881
sink_params = config.params['sink']
79-
if 'security' in sink_params['es']:
80-
sink_sec = sink_params['es']['security']
81-
sink_ssl_config = SslConnectionConfig(ca_certs_path=sink_sec['ca-certs-path'],
82-
client_cert_path=sink_sec['client-cert-path'],
83-
client_key_path=sink_sec['client-key-path'])
84-
85-
es_sink_conf = ElasticConnectorConfig(hosts=sink_params['es']['hosts'],
86-
ssl_config=sink_ssl_config)
87-
elif 'extra_params' in sink_params['es']:
88-
e_params = sink_params['es']['extra_params']
89-
es_sink_conf = ElasticConnectorConfig(hosts=sink_params['es']['hosts'], extra_params=e_params)
90-
else:
91-
es_sink_conf = ElasticConnectorConfig(hosts=sink_params['es']['hosts'])
82+
83+
sink_credentials = sink_params['es']['credentials'] if sink_params['es']['credentials'] else None
84+
sink_extra_params = sink_params['es']['extra-params'] if sink_params['es']['extra-params'] else None
85+
sink_security = sink_params['es']['security'] if sink_params['es']['security'] else None
86+
87+
sink_ssl_config = SslConnectionConfig(ca_file_path=sink_security['ca-file-path'],
88+
ca_certs_path=sink_security['ca-certs-path'],
89+
client_cert_path=sink_security['client-cert-path'],
90+
client_key_path=sink_security['client-key-path'])
91+
92+
es_sink_conf = ElasticConnectorConfig(hosts=sink_params['es']['hosts'], credentials=sink_credentials, extra_params=sink_extra_params,
93+
ssl_config=sink_ssl_config)
9294

9395
es_sink_conn = ElasticConnector(es_sink_conf)
9496
es_sink = ElasticIndexer(es_sink_conn, sink_params['es']['index-name'])
9597

96-
9798
# initialize the indexer
9899
mapping = config.params['mapping']
99-
indexer = BatchAnnotationsIndexer(nlp_service=nlp_service,
100+
annoation_indexer_config = AnnotationIndexerConfig(nlp_service=nlp_service,
100101
source_indexer=es_source,
101102
source_text_field=mapping['source']['text-field'],
102103
source_docid_field=mapping['source']['docid-field'],
103104
source_fields_to_persist=mapping['source']['persist-fields'],
104-
split_index_by_field=mapping['sink']['split-index-by-field'],
105105
sink_indexer=es_sink,
106+
split_index_by_field=mapping['sink']['split-index-by-field'],
106107
source_batch_date_field=mapping['source']['batch']['date-field'],
107108
batch_date_format=mapping['source']['batch']['date-format'],
108-
skip_doc_check=mapping['nlp']['skip-processed-doc-check'].lower() == "true",
109+
skip_doc_check=mapping['nlp']['skip-processed-doc-check'],
109110
nlp_ann_id_field=mapping['nlp']['annotation-id-field'],
110111
threads=mapping['source']['batch']['threads'],
111112
python_date_format=mapping['source']['batch']['python-date-format'],
112-
interval=mapping['source']['batch']['interval'])
113+
interval=mapping['source']['batch']['interval'],
114+
same_index_ingest=mapping['index-ingest-mode']['same-index'],
115+
use_nested_objects=mapping['index-ingest-mode']['use-nested-objects'],
116+
es_nested_object_schema_mapping=mapping['index-ingest-mode']['es-nested-object-schema-mapping'])
117+
118+
indexer = BatchAnnotationsIndexer(annoation_indexer_config)
119+
113120
except Exception as e:
114-
log = logging.getLogger('main')
115-
log.error("Cannot initialize the application: " + str(e))
121+
logging.error("Cannot initialize the application: " + str(e))
116122
exit(1)
117123

118124
# set up the Elastic logger to be more verbose and run the indexer

0 commit comments

Comments
 (0)