Skip to content

Latest commit

 

History

History
210 lines (164 loc) · 9.66 KB

TROUBLESHOOTING.md

File metadata and controls

210 lines (164 loc) · 9.66 KB

Troubleshooting

This section outlines steps to assist you with resolving issues deploying, configuring and using the Amazon S3 Find and Forget solution.

If you're unable to resolve an issue using this information you can report the issue on GitHub.

Expected Results Not Found

If the Find phase does not identify the expected objects for the matches in the deletion queue, verify the following:

  • You have chosen the relevant data mappers for the matches in the deletion queue.
  • Your data mappers are referencing the correct S3 locations.
  • Your data mappers have been configured to search the correct columns.
  • All partitions have been loaded into the Glue Data Catalog.

Job appears stuck in QUEUED/RUNNING status

If a job remains in a QUEUED or RUNNING status for much longer than expected, there may be an issue relating to:

  • AWS Fargate accessing the ECR service endpoint. Enabling the required network access from the subnets/security groups in which Forget Fargate tasks are launched will unblock the job without requiring manual intervention. For more information see VPC Configuration in the User Guide.
  • Errors in job table stream processor. Check the logs of the stream processor Lambda function for errors.
  • Unhandled state machine execution errors. If there are no errors in the job event history which indicate an issue, check the state machine execution history of the execution with the same name as the blocked job ID.
  • The containers have exausted memory or vCPUs capacity while processing large (+GB size) files. See also Service Level Monitoring.

If the state machine is still executing but in a non-recoverable state, you can stop the state machine execution manually which will trigger an Exception job event — the job will enter a FAILED status.

If this doesn't resolve the issue or the execution isn't running, you can manually update the job status to FAILED or remove the job and any associated events from the Jobs table*.

* WARNING: You should manually intervene only when there as been a fatal error from which the system cannot recover.

Job appears stuck in FORGET_COMPLETED_CLEANUP_IN_PROGRESS

If you are running a job with a very large queue of matches (more than 100K entries), the Lambda function that removes the processed matches from the queue after a successful job may need to use a large amount of memory to process the deletions efficiently within the 15 minutes timeout. If you notice that the job status doesn't update after an hour, the queue size doesn't decrease, and a spike in memory usage for the Stream Processor Lambda function, it is likely that the system is stuck and you won't be able to run any other job.

To troubleshoot this scenario:

  1. Edit the DynamoDB item for the given JobID and change the Status to COMPLETED_CLEANUP_FAILED in order to be able to start a new job.
    • Using the DynamoDB AWS Console, choose the JobTable (the name will be something like <stackName>-DDBStack-XXXXXXXXXXXXX-JobTableXXXXXX-XXXXXXXXXXXX) and choose Explore table Items.
    • Insert the Job ID in the ID (Partition Key) field.
    • In the Sk (Sort Key) choose Equal to and insert the Job ID in the text field, then select Run.
    • Select the item in the results table, and choose Edit item.
    • Deselect View DynamoDB JSON and then modify the JobStatus value to COMPLETED_CLEANUP_FAILED.
    • Choose Save changes.
  2. Update the solution CloudFormation stack and specify a larger LambdaJobsMemorySize parameter value. See the AWS Lambda Operator Guide for valid values.
  3. Run a new Job and monitor the Stream Processor Lambda memory usage during the next jobs.

Job status: COMPLETED_CLEANUP_FAILED

A COMPLETED_CLEANUP_FAILED status indicates that the job has completed, but an error occurred when removing the processed matches from the deletion queue.

Some possible causes for this are:

  • The stream processor Lambda function does not have permissions to manipulate the DynamoDB table.
  • The item has been manually removed from the deletion queue table via a direct call to the DynamoDB API.

You can find more details of the cause by checking the job event history for a CleanupFailed event, then viewing the event data.

As the processed matches will still be on the queue, you can choose to either:

  • Manually remove the processed matches via the solution web interface or APIs.
  • Take no action — the matches will remain in the queue and be re-processed during the next deletion job run.

Job status: FAILED

A FAILED status indicates that the job has terminated due to a generic exception.

Some possible causes for this are:

  • One of the tasks in the main step function failed.
  • There was a permissions issue encountered in one of the solution components.
  • The state machine execution time has timed out, or has exceeded the service quota for state machine execution history.

To find information on what caused the failure, check the deletion job log for an Exception event and inspect that event's event data.

Errors relating to Step Functions such as timeouts or exceeding the permitted execution history length, may be resolvable by increasing the waiter configuration as described in Performance Configuration.

Job status: FIND_FAILED

A FIND_FAILED status indicates that the job has terminated because one or more data mapper queries failed to execute.

If you are using Athena and Glue as data mappers, you should first verify the following:

  • You have granted permissions to the Athena IAM role for access to the S3 buckets referenced by your data mappers and any AWS KMS keys used to encrypt the S3 objects. For more information see Permissions Configuration in the User Guide.
  • The concurrency setting for the solution does not exceed the limits for concurrent Athena queries for your AWS account or the Athena workgroup the solution is configured to use. For more information see Performance Configuration in the User Guide.
  • Your data is compatible within the solution limits.

If you made any changes whilst verifying the prior points, you should attempt to run a new deletion job.

To find further details of the cause of the failure you should inspect the deletion job log and inspect the event data for any QueryFailed events.

Athena queries may fail if the length of a query sent to Athena exceed the Athena query string length limit (see Athena Service Quotas). If queries are failing for this reason, you will need to reduce the number of matches queued when running a deletion job.

To troubleshoot Athena queries further, find the QueryId from the event data and match this to the query in the Athena Query History. You can use the Athena Troubleshooting guide for Athena troubleshooting steps.

Job status: FORGET_FAILED

A FORGET_FAILED status indicates that the job has terminated because a fatal error occurred during the forget phase of the job. S3 objects may have been modified.

Check the job log for a ForgetPhaseFailed event. Examining the event data for this event will provide you with more information about the underlying cause of the failure.

Job status: FORGET_PARTIALLY_FAILED

A FORGET_PARTIALLY_FAILED status indicates that the job has completed, but that the forget phase was unable to process one or more objects.

Each object that was not correctly processed will result in a message sent to the object dead letter queue ("DLQ"; see DLQUrl in the CloudFormation stack outputs) and an ObjectUpdateFailed event in the job event history containing error information. Check the content of any ObjectUpdateFailed events to ascertain the root cause of an issue.

Verify the following:

  • No other processes created a new version of existing objects while the job was running. When the system creates a new version of a object, an integrity check is performed to verify that during processing, no new versions of an object were created and that a delete marker for the object was not created. If either case is detected, an ObjectUpdateFailed event will be present in the job event history and a rollback will be attempted. If the rollback will fail, an ObjectRollbackFailed event will be present in the job event history containing error information.
  • You have granted permissions to the Fargate task IAM role for access to the S3 buckets referenced by your data mappers and any AWS KMS keys used to encrypt the data. For more information see Permissions Configuration in the User Guide.
  • You have configured the VPC used for the Fargate tasks according to the VPC Configuration section.
  • Your data is compatible within the solution limits.
  • Your data is not corrupted.

To reprocess the objects, run a new deletion job.