The purpose of this service is to:
- API to provides searching of offender records in NOMIS via Elastic search (ES)
- Keep the Elastic Search (ES) prison index up to date with changes from Prison systems (NOMIS)
- Rebuild the index when required without an outage
This service subscribes to the prison offender events
When this event is received a message is put onto the event queue. The event queue then processes that message -
the latest offender record is retrieved via the prison-api
and upserted into the offender index.
If the message processing fails then the message is transferred onto the event dead letter queue (DLQ).
This service maintains two indexes prison-search-index-a
and prison-search-index-b
know in the code as INDEX_A
and INDEX_B
.
In normal running one of these indexes will be "active" while the other is dormant and not in use.
When we are ready to rebuild the index the "other" non-active index is transitioned into an in-progress
state of true
.
PUT /prisoner-index/build-index
The entire NOMIS offender base is retrieved and over several hours the other index is fully populated.
Once the index has finished, if there are no errors then the (housekeeping cronjob)[#housekeeping-cronjob] will mark the index as complete and switch to the new index.
If the index build fails - there are messages left on the index dead letter queue - then the new index will remain inactive until the DLQ is empty. It may take user intervention to clear the DLQ if some messages are genuinely unprocessable (rather than just failed due to e.g. network issues).
Two ES runtime exceptions, ElasticsearchException and ElasticSearchIndexingException, are caught during the re-indexing process to safeguard the integrity of the index status. Once caught, the inError status flag is set on the IndexStatus. The flag ensures that manipulation of the index is forbidden when in this state. Only cancelling the index process will reset the flag and subsequently allow a rebuild of the index to be invoked.
PUT /prisoner/index/cancel-index
Given the state of the each index is itself held in ES under the in-progress
index with a single "document" when the INDEX_A/INDEX_B indexes switch there are actually two changes:
- The document in
offender-index-status
to indicate which index is currently active - The ES
current-index
is switched to point at the active index. This means external clients can safely use theoffender
index without any knowledge of the INDEX_A/INDEX_B indexes.
Indexes can be switched without rebuilding, if they are both marked as "inProgress": false and "inError":false
PUT /prisoner/index/switch-index
There is a Kubernetes CronJob which runs on a schedule to perform the following tasks:
- Checks if an index build has completed and if so then marks the build as complete (which switches the search to the new index)
- A threshold is set for each environment (in the helm values file) and the index will not be marked as complete until this threshold is met. This is to prevent switching to an index that does not look correct and will require a manual intervention to complete the index build (e.g. calling the
/mark-complete
endpoint manually).
The CronJob calls the endpoint /prisoner-index/queue-housekeeping
which is not secured by Spring Security. To prevent external calls to the endpoint it has been secured in the ingress instead.
localstack
is used to emulate the AWS SQS and Elastic Search service.
Any commands in localstack/setup-sns.sh
and localstack/setup-es.sh
will be run when localstack
starts, so this contains commands to create the appropriate queues.
Localstack listens on two main ports - 4566 for sns and sqs and 4571 for elasticsearch.
Unfortunately localstack needs to be started differently depending on whether you are going to run prisoner offender search in a docker container, or in intellij and in tests.
If running search in docker ES_HOSTNAME
needs to be set to localstack
otherwise it should be set to localhost
.
This is because when clients connect it returns a url for subsequent calls and the hostname is then different when in docker versus connecting from a laptop.
The elasticsearch part of localstack takes a long time to start and will only be up and running fully until you see
[INFO] Running on http://0.0.0.0:4571 (CTRL + C to quit)
in the localstack logs.
Starting the services is therefore a two step process -
- Start everything apart from prisoner offender search and wait for localstack to start fully
- Start prisoner offender search
To start up localstack and other dependencies with prisoner offender search running in docker too:
docker-compose up localstack oauth prisonapi
Once localstack has started then run
docker-compose up --detach
to start prisoner offender search too. To check that it has all started correctly then go to http://localhost:8080/health and check that the status
is UP
.
To start up localstack and other dependencies with prisoner offender search running in intellij
ES_HOSTNAME=localhost docker-compose up --scale prisoner-offender-search=0
To then run prisoner offender search from the command line
SPRING_PROFILES_ACTIVE=dev ./gradlew bootRun
Alternatively create a Spring Boot run configuration with active profile of dev
and main class uk.gov.justice.digital.hmpps.prisonersearch.PrisonerOffenderSearch
.
If just running the tests then
docker-compose up -f docker-compose-localstack-tests.yml
will just start localstack as the other dependencies are mocked out.
./gradlew test
will then run all the tests.
Since localstack persists data between runs it maybe necessary to delete the localstack temporary data:
Mac
rm -rf $TMPDIR/data
Linux
sudo rm -rf /tmp/localstack
Please note the above will not work on a Mac using docker desktop since the docker network host mode is not supported on a Mac
For a Mac it recommended running all components except prisoner-offender-search (see below) then running prisoner-offender-search externally:
SPRING_PROFILES_ACTIVE=dev ./gradlew bootRun
TOKEN=$(curl --location --request POST "http://localhost:8090/auth/oauth/token?grant_type=client_credentials" --header "Authorization: Basic $(echo -n prisoner-offender-search-client:clientsecret | base64)" | jq -r .access_token)
curl --location --request PUT "http://localhost:8080/prisoner-index/build-index" --header "Authorization: Bearer $TOKEN" | jq -r
curl --location --request GET "http://localhost:8080/info" | jq -r
If 52 records then mark complete
curl --location --request PUT "http://localhost:8080/prisoner-index/mark-complete" --header "Authorization: Bearer $TOKEN" | jq -r
curl --location --request POST "http://localhost:8080/prisoner-search/match" --header "Authorization: Bearer $TOKEN" --header 'Content-Type: application/json' \
--data-raw '{
"lastName": "Smith"
}' | jq -r
curl --location --request POST "http://localhost:4571/prisoner-search-a/_search" | jq
Or to just run localstack
which is useful when running against an a non-local test system Env need to be spring.profiles.active=localstack
and sqs.provider=full-localstack
TMPDIR=/private$TMPDIR docker-compose up localstack
In all of the above the application should use the host network to communicate with localstack
since AWS Client will try to read messages from localhost rather than the localstack
network.
There are two handy scripts to add messages to the queue with data that matches either the dev environment or data in the test Docker version of the apps
Purging a local queue
aws --endpoint-url=http://localhost:4566 sqs purge-queue --queue-url http://localhost:4566/queue/prisoner_offender_index_queue
Recommended regression tests is as follows:
- A partial build of index - see the
Rebuilding an index
instructions below. The rebuild does not need to be completed but expect the info to show something like this:
"index-status": {
"id": "STATUS",
"currentIndex": "INDEX_A",
"startIndexTime": "2020-09-23T10:08:33",
"inProgress": true
},
"index-size": {
"INDEX_A": 579543,
"INDEX_B": 521
},
"index-queue-backlog": "578975"
So long as the index is being populated and the "index-queue-backlog"
figure is decreasing after some time (e.g. 10 minutes) it demonstrates the application is working.
Check the health endpoint to show the Index DLQ is not building up with errors e.g: https://prisoner-search-dev.hmpps.service.justice.gov.uk/health
"indexQueueHealth": {
"status": "UP",
"details": {
"MessagesOnQueue": 41834,
"MessagesInFlight": 4,
"dlqStatus": "UP",
"MessagesOnDLQ": 0
}
}
would be a valid state since the MessagesOnDLQ
would be zero
The build can either be left to run or cancelled using the following endpoint
curl --location --request PUT 'https://prisoner-search-dev.hmpps.service.justice.gov.uk/prisoner-index/cancel-index' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <some token>>'
Access to the raw Elastic Search indexes is only possible from the Cloud Platform prisoner-offender-search
family of namespaces.
For instance
curl http://aws-es-proxy-service:9200/_cat/indices
in any environment would return a list all indexes e.g.
green open prisoner-search-a tlGst8dmS2aE8knxfxJsfQ 5 1 2545309 1144511 1.1gb 578.6mb
green open offender-index-status v9traPPRS9uo7Ui0J6ixOQ 1 1 1 0 10.7kb 5.3kb
green open prisoner-search-b OMcdEir_TgmTP-tzybwp7Q 5 1 2545309 264356 897.6mb 448.7mb
green open .kibana_2 _rVcHdsYQAKyPiInmenflg 1 1 43 1 144.1kb 72kb
green open .kibana_1 f-CWilxMRyyihpBWBON1yw 1 1 39 6 176.3kb 88.1kb
To rebuild an index the credentials used must have the ROLE PRISONER_INDEX
therefore it is recommend to use client credentials with the ROLE_PRISONER_INDEX
added and pass in your username when getting a token.
In the test and local dev environments the prisoner-offender-search-client
has conveniently been given the ROLE_PRISONER_INDEX
.
The rebuilding of the index can be sped up by increasing the number of pods handling the reindex e.g.
kubectl -n prisoner-offender-search-dev scale --replicas=8 deployment/prisoner-offender-search
After obtaining a token for the environment invoke the reindex with a curl command or Postman e.g.
curl --location --request PUT 'https://prisoner-offender-search-dev.hmpps.service.justice.gov.uk/prisoner-index/build-index' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <some token>>'
For production environments where access is blocked by inclusion lists this will need to be done from within a Cloud Platform pod
Next monitor the progress of the rebuilding via the info endpoint e.g. https://prisoner-offender-search-dev.hmpps.service.justice.gov.uk/info This will return details like the following:
"index-status": {
"id": "STATUS",
"currentIndex": "INDEX_A",
"startIndexTime": "2020-09-23T10:08:33",
"inProgress": true
},
"index-size": {
"INDEX_A": 702344,
"INDEX_B": 2330
},
"index-queue-backlog": "700000"
when "index-queue-backlog": "0"
has reached zero then all indexing messages have been processed. Check the dead letter queue is empty via the health check e.g https://prisoner-offender-search-dev.hmpps.service.justice.gov.uk/health
This should show the queues DLQ count at zero, e.g.
"indexQueueHealth": {
"status": "UP",
"details": {
"MessagesOnQueue": 0,
"MessagesInFlight": 0,
"dlqStatus": "UP",
"MessagesOnDLQ": 0
}
},
The indexing is ready to marked as complete using another call to the service e.g
curl --location --request PUT 'https://prisoner-offender-search-dev.hmpps.service.justice.gov.uk/prisoner-index/mark-complete' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <some token>>'
One last check of the info endpoint should confirm the new state, e.g.
"index-status": {
"id": "STATUS",
"currentIndex": "INDEX_B",
"startIndexTime": "2020-09-23T10:08:33",
"endIndexTime": "2020-09-25T11:27:22",
"inProgress": false
},
"index-size": {
"INDEX_A": 702344,
"INDEX_B": 702344
},
"index-queue-backlog": "0"
Pay careful attention to "currentIndex": "INDEX_A"
- this shows the actual index being used by clients.
##Restore from a snapshot (if both indexes have become corrupt/empty)
If we are restoring from the snapshot it means that the current index and other index are broken, we need to delete them to be able to restore from the snapshot.
Every night we have a scheduled job that takes the snapshot of the whole cluster which is called latest
and this should be restored
-
To restore we need to port-forward to the es instance (replace NAMESPACE with the affected namespace)
kubectl -n <NAMESPACE> port-forward svc/es-proxy 9200:9200
-
Delete the current indexes
curl -XDELETE 'http://localhost:9200/_all'
-
Then we can start the restore (SNAPSHOT_NAME for the overnight snapshot is
latest
)curl -XPOST 'http://localhost:9200/_snapshot/<NAMESPACE>/<SNAPSHOT_NAME>/_restore' --data '{"include_global_state": true}'
The include_global_state: true
is set true so that we copy the global state of the cluster snapshot over. The default for restoring,
however, is include_global_state: False
. If only restoring a single index, it could be bad to overwrite the global state but as we are
restoring the full cluster we set it to true
###To view the state of the indexes while restoring from a snapshot
####cluster health
curl -XGET 'http://localhost:9200/_cluster/health'
The cluster health status is: green, yellow or red. On the shard level, a red status indicates that the specific shard is not allocated in the cluster, yellow means that the primary shard is allocated but replicas are not, and green means that all shards are allocated. The index level status is controlled by the worst shard status. The cluster status is controlled by the worst index status.
####Shards
curl -XGET 'http://localhost:9200/_cat/shards'
The shards command is the detailed view of what nodes contain which shards. It will tell you if it’s a primary or replica, the number of docs, the bytes it takes on disk, and the node where it’s located.
####Recovery
curl -XGET 'http://localhost:9200/_cat/recovery'
Returns information about ongoing and completed shard recoveries
###To take a manual snapshot, perform the following steps:
-
You can't take a snapshot if one is currently in progress. To check, run the following command:
curl -XGET 'http://localhost:9200/_snapshot/_status'
-
Run the following command to take a manual snapshot:
curl -XPUT 'http://localhost:9200/_snapshot/<NAMESPACE>/snapshot-name'
you can now use the restore commands above to restore the snapshot if needed
####to remove a snapshot
curl -XDELETE 'http://localhost:9200/_snapshot/<NAMESPACE>/snapshot-name'
###Other command which will help when looking at restoring a snapshot
To see all snapshot repositories, run the following command (normally there will only be one, as we have one per namespace):
curl -XGET 'http://localhost:9200/_snapshot?pretty'
To see all snapshots for the namespace run the following command:
curl -XGET 'http://localhost:9200/_snapshot/<NAMESPACE>/_all?pretty'
####General logs (filtering out the offender update)
traces
| where cloud_RoleName == "prisoner-offender-search"
| where message !startswith "Updating offender"
| order by timestamp desc
####General logs including spring startup
traces
| where cloud_RoleInstance startswith "prisoner-offender-search"
| order by timestamp desc
####Interesting exceptions
exceptions
| where cloud_RoleName == "prisoner-offender-search"
| where operation_Name != "GET /health"
| where customDimensions !contains "health"
| where details !contains "HealthCheck"
| order by timestamp desc
####Indexing requests
requests
| where cloud_RoleName == "prisoner-offender-search"
//| where timestamp between (todatetime("2020-08-06T18:20:00") .. todatetime("2020-08-06T18:22:00"))
| order by timestamp desc
####Prison API requests during index build
requests
requests
| where cloud_RoleName == "prison-api"
| where name == "GET OffenderResourceImpl/getOffenderNumbers"
| where customDimensions.clientId == "prisoner-offender-search-client"
requests
| where cloud_RoleName == "prison-api"
| where name == "GET OffenderResourceImpl/getOffender"
| where customDimensions.clientId == "prisoner-offender-search-client"
| order by timestamp desc ```