Skip to content

CUMULUS-3959: Add documentation to resolve failing db migrations #3896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added a `/granules` [endpoint](https://nasa.github.io/cumulus-api/#bulk-update-granules) `PATCH/bulkPatch` which applies PATCH to a list of granules. For its payload, this endpoint takes a list of granules (the updates to be made to the granule, similar to the pre-existing `PATCH`), a `dbConcurrency` and `dbMaxPool` variables for configuring concurrency and database thoroughput for postgres to tailor to performance and database needs.
- **CUMULUS-3919**
- Added terraform variables `disableSSL` and `rejectUnauthorized` to `tf-modules/cumulus-rds-tf` module.
- **CUMULUS-3959**
- Added documentation to help DAACs troubleshoot database migration issues.
- **CUMULUS-3978**
- Added `iops` and `throughput` options to `elasticsearch_config` variable
in `tf-modules/data-persistence`; These two options are necessary for gp3 EBS volume type.
Expand Down
1 change: 1 addition & 0 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,7 @@ const sidebars = {
'troubleshooting/troubleshooting-deployment',
'troubleshooting/rerunning-workflow-executions',
'troubleshooting/reindex-elasticsearch',
'troubleshooting/troubleshooting-database-migrations',
],
},
{
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
id: troubleshooting-database-migrations
title: Troubleshooting Database Migrations
hide_title: false
---

This document provides guidance on how to troubleshoot database migration when upgrading to v18.5.1.

:::info

This document is specifically intended to provide a potential resolution for issues with the database migration to v18.5.1+.
The advice in this document may be used for migrations to other versions, but please be mindful they may not provide the same resolution.
:::

## Failing Database Migrations

When trying to deploy the v18.5.1 of Cumulus, specifically the `data-persistence-tf` module, there is a chance the terraform deployment task times out. The timeout usually occurs when the `<prefix>-postgres-db-migration` lambda is invoked. It will attempt to run the migrations, which may take longer than the Lambda's timeout allows. This occurs when the lambda is trying to run a new migration that has not already been applied (e.g. is NOT in the PostgreSQL `Migrations Table`).

### Discerning Failing Migrations

The issue can be resolved by simply skipping the offending migration. To see which migration that is; you can query the PostgreSQL `knex_migrations` table and compare them to the ones you should have in your release (in this case 18.5.1). If one or more migrations is missing, that migration is potentially taking too long for the Lambda and causing issues.

### Next Steps

When you discover which migration(s) are missing from the `Migrations Table`. These steps can be taken to resolve:
- Manually apply the migration (ex. `update_executions_deletion_constraint`) using SQL applied directly to your database. The SQL commands can be found in the `db` package of the Cumulus codebase
- Manually add the entry into the `knex_migrations` table. Subsequent runs of the `<prefix>-postgres-db-migration` Lambda will check this table and skip any migrations listed. This means that your manual SQL updates will be respected and the Lambda will not try to process the same migration.
- Re-deploy `data-persistence-tf` after verifying the above successfully completed

### Other potential issues

Migration Lambda timeouts (or other related timeouts) may be caused by `VACUUM` statements in the migration. If you find your deployment not completing despite the migration table and the migrations being consistent, this may be the issue. In v18.5.2, some `VACUUM` statements have been removed to tailor for this issue. If you find yourself struggling with v18.5.1 despite completing the above steps, please upgrade to v18.5.2 and try deployment again. The removal of potentially long-running migrations may resolve the issue.

### Miscellaneous

If the instructions in this document are not helping you with your issues(s) reagarding troubleshooting database migration, please contact the Cumulus Team for support.