Skip to content

Commit

Permalink
Update record_indexer README and environment variables
Browse files Browse the repository at this point in the history
  • Loading branch information
amywieliczka committed Jun 25, 2024
1 parent 9237b15 commit 94c6390
Show file tree
Hide file tree
Showing 7 changed files with 57 additions and 19 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ export MOUNT_CODEBASE=<path to rikolti, for example: /Users/awieliczka/Projects/
In order to run the indexer code, make sure the following variables are set:

```
export RIKOLTI_ES_ENDPOINT= # ask for endpoint url
export OPENSEARCH_ENDPOINT= # ask for endpoint url
```

Also make sure to set your temporary AWS credentials and the region so that the mwaa-local-runner container can authenticate when talking to the OpenSearch API:
Expand Down
2 changes: 1 addition & 1 deletion dags/shared_tasks/indexing_tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def update_stage_index_for_collection_task(
print(
f"\n\nReview indexed records at: https://rikolti-data.s3.us-west-2."
f"amazonaws.com/index.html#{version.rstrip('/')}/data/ \n\n"
f"Or on opensearch at: {os.environ.get('RIKOLTI_ES_ENDPOINT')}"
f"Or on opensearch at: {os.environ.get('OPENSEARCH_ENDPOINT')}"
"/_dashboards/app/dev_tools#/console with query:\n"
f"{json.dumps(dashboard_query, indent=2)}\n\n\n"
)
Expand Down
3 changes: 1 addition & 2 deletions env.example
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,8 @@ export NUXEO_PASS=
export CONTENT_ROOT=file:///usr/local/airflow/rikolti_content

# indexer
export RIKOLTI_ES_ENDPOINT= # ask for endpoint url
export OPENSEARCH_ENDPOINT= # ask for endpoint url
export RIKOLTI_HOME=/usr/local/airflow/dags/rikolti
export RIKOLTI_ES_STAGE_ALIAS=rikolti-stg

# indexer when run locally via aws-mwaa-local-runner
# export AWS_ACCESS_KEY_ID=
Expand Down
50 changes: 45 additions & 5 deletions record_indexer/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,56 @@
## Create OpenSearch index template
# Indexing

To create the [index template](https://www.elastic.co/guide/en/elasticsearch/reference/7.9/index-templates.html) for rikolti:
We push all records that have been run through the Rikolti harvesting pipeline into an OpenSearch index.

Make sure that RIKOLTI_ES_ENDPOINT is set in your environment.
Records must adhere strictly to the fields specified [our index template](index_templates/record_index_config.py). Please review [documentation from opensearch on index templates](https://opensearch.org/docs/latest/im-plugin/index-templates/) for more information on index templates.

Our `record_indexer` component is designed to remove any fields that are not in our index template. The `record_indexer` indexes records by collection into indicies identified by aliases.

## Configuring the Record Indexer - AWS and Docker Options

The Record Indexer indexes records by hitting the configured `OPENSEARCH_ENDPOINT` - the API endpoint for an opensearch instance. Rikolti supports authenticating against an AWS hosted OpenSearch endpoint (via IAM permissioning and/or `AWS_*` environment variables) or using basic auth against a dev OpenSearch Docker container

### AWS Hosted OpenSearch
If you're trying to set up the record_indexer to communicate with an AWS hosted OpenSearch instance, set the `OPENSEARCH_ENDPOINT` to the AWS-provided endpoint. Make sure your machine or your AWS account has access, and, if relevant, set the following environment variables: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_REGION`.

### Dev OpenSearch Docker Container
There is also an OpenSearch dev environment docker compose file available. You can run `docker-compose up` to startup an OpenSearch instance with API access at https://localhost:9200. The default username is `admin` and the default password is `Rikolti_05`. [OpenSearch Docker Container Documentation](https://hub.docker.com/r/opensearchproject/opensearch)

Send requests to the OpenSearch REST API to verify the docker container is working.

> By default, OpenSearch uses self-signed TLS certificates. The -k short option skips the certificate verification step so requests don't fail
```
curl -X GET "https://localhost:9200/_cat/indices" -ku admin:Rikolti_05
```

To use this docker container with the record indexer, you will have to configure:

```
export OPENSEARCH_USER=admin
export OPENSEARCH_PASS=Rikolti_05
export OPENSEARCH_IGNORE_TLS=True
```

**To use this OpenSearch docker container with mwaa-local-runner, set the previous values and the below endpoint in dags/startup.sh:**

```
export OPENSEARCH_ENDPOINT=https://host.docker.internal:9200/
```

## Initializing an OpenSearch instance to work with Rikolti

Create an index template for rikolti:

Make sure that OPENSEARCH_ENDPOINT and the relevant authentication is set in your environment.

```
python -m record_indexer.index_templates.rikolti_template
```

This creates a template that will be used whenever an index with name matching `rikolti*` is added to the cluster.
This creates a record template that will be used for adding documents to any index with name matching `rikolti*` is added to the cluster.

## Run indexer from command line
## Running the Record Indexer

TODO: We don't currently support running the indexer from the command line

Expand Down
8 changes: 4 additions & 4 deletions record_indexer/initialize_rikolti_indices.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ printf "\n\n>>> Creating the 'rikolti-dev-index' index\n"
curl -X PUT "https://localhost:9200/rikolti-dev-index" -ku admin:Rikolti_05

printf "\n\n>>> Creating the 'rikolti' template\n"
export RIKOLTI_ES_ENDPOINT=https://localhost:9200/
export RIKOLTI_ES_USER=admin
export RIKOLTI_ES_PASS="Rikolti_05"
export RIKOLTI_ES_IGNORE_TLS=True
export OPENSEARCH_ENDPOINT=https://localhost:9200/
export OPENSEARCH_USER=admin
export OPENSEARCH_PASS="Rikolti_05"
export OPENSEARCH_IGNORE_TLS=True
python -m record_indexer.index_templates.rikolti_template

printf "\n>>> Creating the 'rikolti-stg' alias\n"
Expand Down
9 changes: 4 additions & 5 deletions record_indexer/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@

load_dotenv()

es_user = os.environ.get("RIKOLTI_ES_USER")
es_pass = os.environ.get("RIKOLTI_ES_PASS")
es_user = os.environ.get("OPENSEARCH_USER")
es_pass = os.environ.get("OPENSEARCH_PASS")

def verify_certs():
return not os.environ.get("RIKOLTI_ES_IGNORE_TLS", False)
return not os.environ.get("OPENSEARCH_IGNORE_TLS", False)

def get_auth():
if es_user and es_pass:
Expand All @@ -23,7 +23,6 @@ def get_auth():
credentials, os.environ.get("AWS_REGION", "us-west-2"))


ENDPOINT = os.environ.get("RIKOLTI_ES_ENDPOINT", False)
ENDPOINT = os.environ.get("OPENSEARCH_ENDPOINT", False)
if ENDPOINT:
ENDPOINT = ENDPOINT.rstrip("/")
STAGE_ALIAS = os.environ.get("RIKOLTI_ES_STAGE_ALIAS")
2 changes: 1 addition & 1 deletion record_indexer/update_stage_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

def update_stage_index_for_collection(collection_id: str, version_pages: list[str]):
''' update stage index with a new set of collection records '''
index = get_index_for_alias(settings.STAGE_ALIAS)
index = get_index_for_alias("rikolti-stg")

# delete existing records
delete_collection_records_from_index(collection_id, index)
Expand Down

0 comments on commit 94c6390

Please sign in to comment.