diff --git a/documentation/ingestion_server/guides/index.md b/documentation/ingestion_server/guides/index.md index 43e211488e3..77646be9728 100644 --- a/documentation/ingestion_server/guides/index.md +++ b/documentation/ingestion_server/guides/index.md @@ -8,4 +8,7 @@ config mapping test deploy +migrate +upgrade +troubleshoot ``` diff --git a/documentation/ingestion_server/guides/migrate.md b/documentation/ingestion_server/guides/migrate.md new file mode 100644 index 00000000000..9de9fa58b62 --- /dev/null +++ b/documentation/ingestion_server/guides/migrate.md @@ -0,0 +1,256 @@ +# Index migration runbook + +From time to time, we will need to update our Elasticsearch indices. These +modifications can be classified into two broad-strokes categories, depending on +whether the changes affect the main consumer of the indices, the API. + +## Migration types + +### API-free + +These changes are safe modifications to the ES schema that do not affect the +API. As such they do not need any migration process. Examples: + +- addition of new fields or subfields +- removal of fields that are not referenced or used by the API +- changing the type to another compatible type (like `text` ↔ `keyword` ) + +For API-free changes, we deploy the ingestion server and perform one of the two: + +- standard data-refresh (either triggered manually or as scheduled) +- [manual index upgrade](/ingestion_server/guides/upgrade.md) + +The indices will be updated to the new schema and will be made available to the +API. + +### API-involved + +These changes are modifications to fields that already are in use by the API and +involve code changes in both the ingestion server and the API. Examples: + +- removal of a field +- changing the type to an incompatible type +- renaming of a field + +Such kinds of changes need us to precisely deploy the API in coordination with +the promotion of new index because of these reasons: + +- If the API deployment lags behind index promotion, the old field that the API + uses will disappear. +- If the API deployment leads ahead of index promotion, the new field the API + uses will not be present. + +This runbook documents guidelines and processes for API-involved migrations. + +Our goal is to break down an API-involved change into multiple small, atomic +changes with each step affecting at most one of the ingestion server or the API +and ensuring that the API and ES remain compatible throughout the process. + +## Pull request guidelines + +A change that involves modification to the ES index as well as its usage in the +API requires at least three steps, each associated with exactly one PR that +modifies exactly one of the ingestion server or the API to allow them to be +deployed independently. + +1. Change the ES index mapping in the ingestion server. Ensure that the change + is purely additive, keeping the old fields unchanged and creating new fields + that contain the data the API will need. + + This PR should make changes only within the `ingestion_server/` directory, + more specifically the following two files concerned with ES mappings and + document schemas: + + - [`es_mappings.py`](https://github.com/WordPress/openverse/tree/main/ingestion_server/ingestion_server/es_mapping.py) + - [`elasticsearch_models.py`](https://github.com/WordPress/openverse/tree/main/ingestion_server/ingestion_server/elasticsearch_models.py) + +2. Update the API code to reference and use the new ES fields added in the + previous step. Ensure that the old fields become unreferenced. + + The PR should make changes only within the `api/` directory. + +3. Change the ES index mapping in the ingestion server to remove the old, + now-unreferenced fields. + + Like PR number 1, this PR should also make changes only within the + `ingestion_server/` directory. + +```{tip} +Get the PRs reviewed in advance so that the entire process has been vetted by +the team and there are no surprises or delays when the plans have been set into +motion. +``` + +```{caution} +Each PR in the chain should branch from, and point to, its predecessor in the +chain so that CI continues to pass for each PR. +``` + +### Example + +Assume we have a field `foo` with type `text` in the index. It has a subfield +`keyword` with type `keyword`. The API uses `foo.keyword` for all purposes. We +want the `foo` field to have type `keyword` and for the API to use `foo` instead +of `foo.keyword`. To accomplish this without downtime, we need three PRs: + +1. Changing `foo` to type `keyword` would be an API-free change because it is a + type change between two compatible types and does not affect the nested field + `foo.keyword` that is in use by the API. Technically the outer field can be + assumed to be "new" because it was not being used at all. + +2. Then we make an API change to use `foo` directly instead of `foo.keyword`. + Any other accommodations to make use of `foo` can be made in this step. In + this case `foo` will be the same as `foo.keyword` so no other changes will be + needed. + +3. Removal of the `foo.keyword` field would now also be an API-free change + because the field would no longer be in use. + +## Migration process + +The entire migration process can be classified into 3 phases. + +```{mermaid} +flowchart TD + subgraph api[API] + API + end + + subgraph elasticsearch[Elasticsearch] + image --> image-old + image-filtered --> image-old-filtered + audio --> audio-old + audio-filtered --> audio-old-filtered + end + + API --> image + API --> image-filtered + API --> audio + API --> audio-filtered +``` + +### Create the new fields + +1. Merge [PR number 1](#pull-request-guidelines). +2. Perform a [manual index upgrade](/ingestion_server/guides/upgrade.md). + +At the close of this phase we have all the new information for the API to use. + +```{mermaid} +flowchart TD + subgraph api[API] + API + end + + subgraph elasticsearch[Elasticsearch] + image -.-> image-old + image-filtered -.-> image-old-filtered + audio -.-> audio-old + audio-filtered -.-> audio-old-filtered + image --> image-mid + image-filtered --> image-mid-filtered + audio --> audio-mid + audio-filtered --> audio-mid-filtered + end + + API --> image + API --> image-filtered + API --> audio + API --> audio-filtered + + style image-old opacity:0.3 + style image-old-filtered opacity:0.3 + style audio-old opacity:0.3 + style audio-old-filtered opacity:0.3 +``` + +### Use the new fields instead of the old + +1. Merge [PR number 2](#pull-request-guidelines). This will automatically deploy + the API to staging. +2. Verify that the staging API continues to work. +3. [Deploy the API](/api/guides/deploy.md) to production. +4. Verify that the production API continues to work. + +At the close of this phase the API is exclusively using the new fields and the +old ones have become unreferenced. + +```{mermaid} +flowchart TD + subgraph api[API] + old[API] + new[New API] + end + + subgraph elasticsearch[Elasticsearch] + image --> image-mid + image-filtered --> image-mid-filtered + audio --> audio-mid + audio-filtered --> audio-mid-filtered + end + + old -.-> image + old -.-> image-filtered + old -.-> audio + old -.-> audio-filtered + new --> image + new --> image-filtered + new --> audio + new --> audio-filtered + + style old opacity:0.3 +``` + +### Remove the old fields + +1. Merge [PR number 3](#pull-request-guidelines). +2. Perform a [manual index upgrade](/ingestion_server/guides/upgrade.md). + +```{mermaid} +flowchart TD + subgraph api[API] + new[New API] + end + + subgraph elasticsearch[Elasticsearch] + image -.-> image-mid + image-filtered -.-> image-mid-filtered + audio -.-> audio-mid + audio-filtered -.-> audio-mid-filtered + image --> image-final + image-filtered --> image-final-filtered + audio --> audio-final + audio-filtered --> audio-final-filtered + end + + new --> image + new --> image-filtered + new --> audio + new --> audio-filtered + + style image-mid opacity:0.3 + style image-mid-filtered opacity:0.3 + style audio-mid opacity:0.3 + style audio-mid-filtered opacity:0.3 +``` + +You're done! + +```{mermaid} +flowchart TD + subgraph api[API] + new[New API] + end + + subgraph elasticsearch[Elasticsearch] + image --> image-final + image-filtered --> image-final-filtered + audio --> audio-final + audio-filtered --> audio-final-filtered + end + + new --> image + new --> image-filtered + new --> audio + new --> audio-filtered +``` diff --git a/documentation/ingestion_server/guides/troubleshoot.md b/documentation/ingestion_server/guides/troubleshoot.md new file mode 100644 index 00000000000..7a50af95bfa --- /dev/null +++ b/documentation/ingestion_server/guides/troubleshoot.md @@ -0,0 +1,59 @@ +# Troubleshooting + +This guide describes various manual steps to troubleshoot issues with the +ingestion server's processes like database transfer, ES indexing. + +## Interrupt indexing + +The ingestion server performs indexing using indexer workers, whose primary +purpose it is to create documents from the API database and index them in +Elasticsearch. + +Indexer workers are EC2 instances that are stopped by default when indexing is +not taking place. The ingestion server raises them up, provides them with the +necessary information to perform the indexing and once they report back to the +ingestion server with a completion message, they are shut down again. + +Sometimes it is necessary to manually interrupt indexing, for example to limit +the size of a test/staging index. To do so, follow these steps. + +1. Determine the active ingestion worker machines from the AWS EC2 dashboard. + They will be named `indexer-worker-(dev|prod)` and will be in the "running" + state. + +2. SSH into the machine using it's public IP. + + ```console + $ ssh ec2-user@ + ``` + +3. Determine the name of the active `indexer_worker` container and pause it. + + ```console + $ docker ps + $ docker pause + ``` + +4. Repeat steps 2 and 3 for each active ingestion worker machine. Leave the SSH + sessions open. + +5. Wait for a few minutes and keep an eye on the document count in the + Elasticsearch index that was currently being created. It may increase a + little because of timing effects but should stop after a few minutes. + +6. From each of the open SSH sessions, send a completion notification to the + ingestion server's internal IP address. + + ```console + $ curl \ + -X POST \ + -H "Content-Type: application/json" \ + -d '{"error":false}' \ + http://:8001/worker_finished + ``` + +7. Terminate the SSH sessions and stop the indexer worker EC2 machines from the + AWS EC2 dashboard. + +8. The ingestion server will the instruct ES to start the next step of indexing, + i.e. replication. diff --git a/documentation/ingestion_server/guides/upgrade.md b/documentation/ingestion_server/guides/upgrade.md new file mode 100644 index 00000000000..08e7711ed0a --- /dev/null +++ b/documentation/ingestion_server/guides/upgrade.md @@ -0,0 +1,117 @@ +# Manual index upgrade runbook + +A manual index upgrade is similar to a data-refresh except for two key +differences. + +- Each step of the process, from index creation, filtered index creation and + then promotion is done manually by SSH-ing into the ingestion server and + indexer workers. +- It is faster than a complete data-refresh as there is no transfer of data from + the catalog database to the API database. + +## Steps + +### Staging deployment + +1. [Deploy the ingestion server](/ingestion_server/guides/deploy.md) to staging. + This step ensures that the latest schema will be used for the new indices. + +2. Determine the real names of the indices behind the following aliases. + + - `image` + - `image-filtered` + - `audio` + - `audio-filtered` + + This information is useful to know what index to use if a rollback is needed + and to know what index to delete once the upgrade is complete. You can use + [Elasticvue](https://elasticvue.com) for this. + + ```{tip} + In staging, the filtered indices may also point to the default index. + ``` + +3. Perform [reindexing](/ingestion_server/reference/task_api.md#reindex) of all + media types. Let's say you use the suffix `abcd` for these indices. New + indices for each media type, like `image-abcd` and `audio-abcd`, will be + created. + + ```{caution} + Staging indices are supposed to be smaller and should not have the same + number of documents as the production dataset. You can + [interrupt the indexing process](/ingestion_server/guides/troubleshoot.md#interrupt-indexing) + once a satisfactory fraction (like ~50%) has been indexed. + ``` + + Wait for the indices to be replicated (and status green) before proceeding. + +4. [Point aliases](/ingestion_server/reference/task_api.md#point_alias) (both + default and filtered) for each media type to the new index. + + - `image` → `image-abcd` + - `image-filtered` → `image-abcd` + - `audio` → `audio-abcd` + - `audio-filtered` → `audio-abcd` + +5. Verify that the staging API continues to work. + + - If the staging API reports errors, immediately switch back the aliases to + the old indices. + - If the staging API works, + [delete the old indices](/ingestion_server/reference/task_api.md#delete_index) + to recover the free space. + +### Production deployment + +1. [Deploy the ingestion server](/ingestion_server/guides/deploy.md) to + production. This step ensures that the latest schema will be used for the new + indices. + +2. Determine the real names of the indices behind the following aliases. + + - `image` + - `image-filtered` + - `audio` + - `audio-filtered` + + This information is useful to know what index to use if a rollback is needed + and to know what index to delete once the upgrade is complete. You can use + [Elasticvue](https://elasticvue.com) for this. + +3. Perform [reindexing](/ingestion_server/reference/task_api.md#reindex) of both + media types. Let's say you use the suffix `abcd` for these indices. New + indices for each media type, like `image-abcd` and `audio-abcd`, will be + created. + + Wait for the indices to be replicated (and status green) before proceeding. + +4. Perform + [creation of filtered indices](/ingestion_server/reference/task_api.md#create_and_populate_filtered_index) + for all media types. New filtered indices for each media type, like + `image-abcd-filtered` and `audio-abcd-filtered`, will be created. + + Wait for the indices to be replicated (and status green) before proceeding. + +5. [Point aliases](/ingestion_server/reference/task_api.md#point_alias) for each + media type to the new default and filtered indices. + + - `image` → `image-abcd` + - `image-filtered` → `image-abcd-filtered` + - `audio` → `audio-abcd` + - `audio-filtered` → `audio-abcd-filtered` + +6. Verify that the production API continues to work. + + - If the production API reports errors, immediately switch back the aliases + to the old indices. + - If the production API works, + [delete the old indices](/ingestion_server/reference/task_api.md#delete_index) + to recover the free space. + +## Rollback + +In this process, we are creating the new indices first, remapping the aliases, +and then removing the old indices. So if there is an issue with the new indices, +we can immediately switch back the aliases to the old ones and restore +functionality. Then we can investigate into the new indices as they will still +be present, just unused. diff --git a/documentation/ingestion_server/reference/index.md b/documentation/ingestion_server/reference/index.md index 8b5749210fe..d4d53c4b025 100644 --- a/documentation/ingestion_server/reference/index.md +++ b/documentation/ingestion_server/reference/index.md @@ -7,4 +7,5 @@ elasticsearch safety notifications data_refresh +task_api ``` diff --git a/documentation/ingestion_server/reference/task_api.md b/documentation/ingestion_server/reference/task_api.md new file mode 100644 index 00000000000..b2e23b3526f --- /dev/null +++ b/documentation/ingestion_server/reference/task_api.md @@ -0,0 +1,125 @@ +# Ingestion server API + +The ingestion server exposes an API at the `/task` endpoint to schedule various +tasks and get updates about their status and progress. + +New tasks can be created using the `POST` method and a payload as described +below. The response for this request provides an endpoint (containing a task's +unique ID) that can be used to retrieve the task information using the `GET` +method. + +## REINDEX + +If a complete data-refresh is not required, a new index can be created using the +`REINDEX` action. This action will create a new index for the given media type +using the data from the API database. A suffix can be provided for the index +otherwise a random UUID will be used. + +### Body + +```typescript +{ + model: "image" | "audio" + action: "REINDEX" + index_suffix: string +} +``` + +### Example + +```console +$ curl \ + -X POST \ + -H 'Content-Type: application/json' \ + -d '{"model": "image", "action": "REINDEX", "index_suffix": "20231106"}' \ + http://localhost:8001/task +``` + +## CREATE_AND_POPULATE_FILTERED_INDEX + +This endpoint creates a filtered index for a media type out of an existing +index. A `REINDEX` job must be followed by this job to ensure that the new index +has an associated filtered index as well before we promote it. + +### Body + +```typescript +{ + model: "image" | "audio" + action: "CREATE_AND_POPULATE_FILTERED_INDEX" + destination_index_suffix: string +} +``` + +```{caution} +Destination suffix here implies the suffix of the existing unfiltered index. +The filtered index will be created with "-filtered" appended to the destination +suffix. +``` + +### Example + +```console +$ curl \ + -X POST \ + -H 'Content-Type: application/json' \ + -d '{"model": "image", "action": "CREATE_AND_POPULATE_FILTERED_INDEX", "destination_index_suffix": "20231106"}' \ + http://localhost:8001/task +``` + +## POINT_ALIAS + +This endpoint maps the index to a given alias. When an index is aliased to the +name of the media type (`image` or `audio`) or name + "-filtered", it becomes +the default or filtered index for that media type respectively. + +### Body + +```typescript +{ + model: "image" | "audio" + action: "POINT_ALIAS" + index_suffix: string + alias: string // should be model or model + "-filtered" +} +``` + +### Example + +```console +$ curl \ + -X POST \ + -H 'Content-Type: application/json' \ + -d '{"model": "image", "action": "POINT_ALIAS", "index_suffix": "20231106", "alias": "image-filtered"}' \ + http://localhost:8001/task +``` + +## DELETE_INDEX + +This endpoint deletes the given index for a given media type. + +```{danger} +Index deletion is an irreversible destructive operation. Please ensure that you +do not delete an index that is currently in use as the default or filtered index +for a media type. +``` + +### Body + +```typescript +{ + model: "image" | "audio" + action: "DELETE_INDEX" + index_suffix: string +} +``` + +### Example + +```console +$ curl \ + -X POST \ + -H 'Content-Type: application/json' \ + -d '{"model": "image", "action": "DELETE_INDEX", "index_suffix": "20231106"}' \ + http://localhost:8001/task +```