Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index version path #1012

Merged
merged 15 commits into from
Jul 11, 2024
Merged

Index version path #1012

merged 15 commits into from
Jul 11, 2024

Conversation

amywieliczka
Copy link
Collaborator

@amywieliczka amywieliczka commented Jun 25, 2024

Schema Updates and Changes to existing indexing to Stage Processes:

  • Rename update_stage_index to index_collection and add alias as a parameter to index_collection to generalize this function.
  • Rename add_page to index_page, update the bulk opensearch action to be index, rather than create, and use refresh: true as a parameter to the bulk opensearch request. The index action will create opensearch documents if they don't already exist, and will overwrite opensearch documents if they do exist. The refresh: true parameter will make re-indexed documents available immediately on all shards (at an index-time cost to the cluster), in order to then run our delete by query request against the most up-to-date version of all records.
  • Add page, version_path, and indexed_at to records just prior to indexing. indexed_at is defined as the datetime at the start of indexing.
  • Move the invocation of delete_collection_records_from_index to happen just after the bulk indexing, and update delete_collection_records_from_index to first query for "outdated records" - records of the given collection ID, that don't match the given data version - before then deleting all these outdated records. The query helps us report out the versions of each outdated record.
  • Added the version and index to the SNS event sent to the registry at the end of the Airflow update_stage_index_for_collection_task

Migration script:

  • Adds values version_path: initial, indexed_at: <time migration script started>, and page: unknown to records in the index already via a re-index.

Publish Processes:

  • Renamed update_stage_index_for_collection_task to index_collection_task to generalize between -stg and -prd index aliases.
  • Added stage_collection_task and publish_collection_task, which both call index_collection_task with a different alias.
  • Created a publish_collection DAG

Pooling:

  • Specifies that all actions hitting OpenSearch should run in the rikolti_opensearch_pool. Since we just have one cluster across all stage and prod indices, any and all Airflow tasks hitting OpenSearch should by added to this pool. We can configure the pool using the Airflow UI, and should monitor the OpenSearch cluster's performance using the CloudWatch Dashboard.

Developer Candy:

  • Adds a dashboard for developer ease to the docker-compose file for the record_indexer, in order to run the record_indexer locally, must set OPENSEARCH_IGNORE_TLS=True in the environment.
  • Adds an initialization script to add the rikolti-stg and rikolti-prd aliases to a new opensearch cluster (as one would get when running a new docker compose).
  • Update the README

@amywieliczka amywieliczka force-pushed the index-version-path branch 2 times, most recently from 69400d3 to dacc492 Compare June 26, 2024 00:26
@amywieliczka amywieliczka marked this pull request as ready for review June 27, 2024 21:22
)
verbed = "published" if alias == 'rikolti-prd' else "staged"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You verbed the noun verb!

bibliotechy
bibliotechy previously approved these changes Jul 8, 2024
@bibliotechy
Copy link
Contributor

@amywieliczka This looks great. Glad we are using pools!

description=(
"Creates an empty index at rikolti-<name>; if no name "
"provided, uses the current timestamp. Adds the index to "
"the rikolti-stg alias."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor thing, but this description isn't quite right.

barbarahui
barbarahui previously approved these changes Jul 9, 2024
@amywieliczka amywieliczka dismissed stale reviews from barbarahui and bibliotechy via 60ab54c July 11, 2024 20:15
@amywieliczka amywieliczka merged commit efbf06e into main Jul 11, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment