This component is a system which fetches existing documents from the Ocean Best Practices respository, extracts text if necessary, and indexes the document metadata into OpenSearch. The ingest component is also responsible for managing supported ontologies and vocabularies in a managed Neptune instance.
- Prerequisites
- Ingesting Documents
- Bulk Ingesting Documents
- Deleting Documents
- Updating Documents
While you can do everything you need to do in order to deploy via the AWS console, this documentation is written as though you're deploying via the AWS CLI.
You will need to have an AWS profile configured locally in order to interact with the CLI:
This assumes you are using an admin role within the account and do not yet have a user account.
- Create/Acquire a user and get the appropriate credentials
- There are two ways to add a profile into the AWS CLI
- By adding the user to
.aws/credentials
and.aws/config
- This can be achieved by opening them up with your text editor or IDE
vim ~/.aws/credientials
- This can be achieved by opening them up with your text editor or IDE
- You can also add a profile via the CLI by running
aws configure --profile name_of_profile
- You will be prompted for 4 things
- Access key
- Secret Access key
- Default region key
- Default output format
- You will be prompted for 4 things
- By adding the user to
- You can then replace {aws_profile_with_credentials} with name_of_profile
When triggering a function via the CLI you'll need its name. You can find the name of the function from the AWS Lambda Console or via the CLI:
or
aws lambda list-functions
The ingest component leverages a 3rd party text extractor which can be found here: (https://github.com/skylander86/lambda-text-extractor). There is also a fork available at the Element 84 organization in case anything were to change with the original repository: (https://github.com/Element84/lambda-text-extractor).
The Deploying documentation includes enough information to deploy the ingest component, however, it might be worth reading through some basic OpenSearch and Neptune documentation to become comfortable with the service that provides our main search and tagging indices.
Documents are ingested when they are added to DSpace. Ingest of a new document is triggered automatically when the DSpace RSS reader determines a new document has been added to the source respository and posts its UUID to our ingest SNS topic. The RSS reader is scheduled to run on a regular interval which defaults to every 5 minutes.
You can manually ingest a document by posting its UUID to SNS:
aws sns publish --topic-arn {AVAILBLE_DOCUMENT_TOPIC_ARN} --message cf05c46d-e1aa-4d95-bf44-4e9c0aaa7a37
Replace {AVAILABLE_DOCUMENT_TOPIC_ARN} with the ARN of the SNS topic found in the AWS console or via the AWS CLI
When a new document is queued for ingest the following occurs:
- Metadata is fetched from DSpace and saved to S3.
- Binary file (e.g. PDF) is fetched from DSpace and saved to S3.
- If a binary file exists the ingest component triggers text extraction and the raw text is saved to S3.
- The metadata and raw text are run through the tagging routine. This routine uses the "terms" index which is made up of keywords from our managed ontologies and vocabularies.
- Metadata, raw text, and matching tags are indexed into our search index.
The ingest component uses the lambda-text-extractor library to perform serverless and asynchronous OCR text extraction of PDF files. You can find details on how to install and deploy the library on its repository page.
The text extractor function can be shared across environments and only needs to be deployed once.
Assumes familiarity with OpenSearch terminology.
The search index (defaults to "documents") contains metadata, raw text, and matching tags for a document. The index and mapping are automically created by the indexer
function and should not be managed manually. If for some reason it needs to be manually created you can simply trigger the ingest of a document and the index will be created for you.
You can view the partial index mapping used to create the search index here. This is a partial mapping because if a field is not explicitly listed we just use the default OpenSearch mapping for it.
Assumes familiarity with OpenSearch terminology.
The terms index (defaults to "terms") contains the list of keywords extracted from our managed ontologies and vocabularies along with the queries used to match percolated documents. The index and mapping are automatically created when an ontology or vocabulary is indexed. Please see the Neptune for more information on how to manage ontolgies and vocabularies.
You can manually trigger a bulk ingest of the source repository with the bulk-ingester
function. You can do this via the AWS Lambda Console or the AWS CLI:
aws lambda invoke --function-name {STAGE}-bulk-ingester
Replace {STAGE} with the target stage name (e.g. prod-obp-cdk)
The bulk indexer queues all documents available in the source repository for ingest. It does this by posting document UUIDs to the ingest SNS topic. This means that every document will run through the entire ingest routine. This is an asynchronous and (can be) long process. After triggering a bulk index please allow time for documents to be ingested.
The bulk ingester does not remove old documents. This is due to a limitation in the DSpace API.
DSpace does not expose a reliable API for identifying when a document is withdrawn from the source repository, so there is no automated process to remove this document from the indexed documents in the search interface.
deleting documents in the search index is a manual process. If/when a document is marked as withdrawn in the source repository you can delete it from the search index by invoking the delete- document function:
aws lambda invoke --function-name arn:aws:lambda:us-east-1:063582114381:function:[STAGE]-delete-document --payload '{"uuid":"[DOCUMENT_UUID]"}' --cli-binary-format raw-in-base64-out [OUTPUT]
Replace
- [STAGE] with the target stage name (e.g. prod-obp-cdk)
- [DOCUMENT_UUID] with the correct document id
- [OUTPUT] with the filename where the json output(result) will be shown
if you have a document with a UUID of dc1ef50c-298c-409b-91ac-54a8be75f776 and assuming the function name (you can get this from the AWS Lambda Console ) of prod-obp-cdk-delete-document you would run:
aws lambda invoke --function-name arn:aws:lambda:us-east-1:063582114381:function:prod-obp-cdk-delete-document --payload '{"uuid":"dc1ef50c-298c-409b-91ac-54a8be75f776"}' --cli-binary-format raw-in-base64-out output.json
You should get a response
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
and the output file (in the exampele output.json) should contain:
- in case a file has been deleted
{"took":467,"timed_out":false,"total":1,"deleted":1,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
- in case no file has been deleted
{"took":17,"timed_out":false,"total":0,"deleted":0,"batches":0,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
The process for updating a document is exaclty the same as ingesting a new document. You can manually update a document by following the Ingesting Documents instructions.
Occasionally the search index may get out of sync with the source repository due to limitations in the DSpace API and how the search repository is notified of new documents. The index-rectifier
function is designed to perform a diff between the source and search repositories on a regular schedule (defaults to every 2 days). This function performs the following diff:
- If a document from the source repository has a more recent
lastModified
date the document is queued for ingest. - If a document from the source repository has a bitstream (binary file) with a more recent
lastModified
date the docuemnt is queued for ingest.
You can manually trigger the index-rectifier
via the AWS Lambda Console or the AWS CLI:
aws lambda invoke --function-name {STAGE}-index-rectifier
Replace {STAGE} with the target stage name (e.g. prod-obp-cdk)