Sync 1.29.1 changes #33

BenderArenadata · 2023-07-24T13:05:36Z

No description provided.

To support specific use cases, currently mainly gpcopy, if the correct flags are provided gpbackup should place its timestamp+pid lockfile in the backupdir, instead of in /tmp. This is to facilitate the highly-parallel metadata-only backups that gpcopy uses when dumping metadata. The flags required to elicit this behavior are: metadata-only, no-history, and backup-dir.

We're updating our CI to validate against the newest version of datadomain. Update vault vars to new point to the new instances.

This test relies on a flag that was only added to gpbackup 1.28.0, so skip if we're testing with an older version.

Releng team changed some "rhel" artifacts to "el" to reflect more general approach to RHEL binary compatibility. Update our CI to pull from the correct locations.

We test most of our plugin functionality in the gpbackup CI, so add a test to our plugin suite to cover the delete_replica function added to our ddboost plugin.

We have historically written out history records only at the end of the process, and in many failure cases this means no record at all is available. This commit restructures to write out the full history entry very early on, but leaves Status as "In progress" until the cleanup phase, where it will be updated to show whether the backup was successful or a failure

Our current cleanup mechanisms rely on a recover() call to pick up FatalOnError. However, this mechanism does not work if the panic is called from a separate goroutine. As such, our cleanup has some gaps in it that need to be plugged up. We identify four cases where a panic can be invoked from a goroutine. In these cases we set up a channel for bubbling the panics up to the main process, and invoke gplog.Fatal() at the earliest opportunity. We structure the defers such that waitgroups are still marked Done() even in case of a panic, so that we can keep the deadlock protection they offer. In the future any we will avoid invoking panic() from within goroutines, so that this handling will not be necessary.

GPDB7 has removed support for QuickLZ. Any tables backed up with QuickLZ will cause an error on restore, unless this GUC is set. This GUC will allow the server to silently fall back to using zstd, and allow for successful restores.

Previously our CI plugin tests had been calling noops for the cleanup step. Replace these with calls to delete_directory where possible. An additional cleanup job will be scheduled to run occasionally to handle things that are not easily cleaned up from plugin storage in this manner.

Currently our plugin jobs leave behind a lot of data and don't clean it up at all. Prior commit added some in-test cleanup, but it still leaves behind a fair bit. Add a CI job that will run weekly and drop all contents from our plugin storage. Note that this requires the job to be in a pipeline sitting around for a week. This may not be the case for short-lived dev pipelines. Might be worth considering making a skinny pipeline to leave sitting in the DP concourse so that the DP storage also gets cleaned automatically.

A previous change to this code restructured how we track which tables are being backed up, and their status. This led to the case where, if a piped copy command failed, the error was captured but never displayed. Instead we'd spin forever on an "Unknown" state in our sync.Map, causing permanent hangs. This commit restores the original behavior, where if a pipe fails the backup of the table is marked "Complete" in the sync.map, and the error is handled correctly downstream

Given the issues with this function misbehaving, it needs better coverage. This commit adds some basic test coverage of it. We also tweak how FilePathInfo is structured, so that it is simpler to mock out where data is copied out to during backup.

If tables are included in a backup set that inherit from, or are inherited by, other tables and those other parent or child tables are not included in the backup set, gpbackup should ignore those inheritance links when constructing CREATE TABLE statements for included tables. This commit also adds a --no-inherit flag, which will cause filtered backups to ignore child tables of included tables to preserve the old behavior.

The syntax and output of "go test cover" has changed since this script was originally written, so this commit updates those to maintain the nice formatted script output we expect. It also adds the rest of the packages with unit test files to the list of those for which coverage is generated.

Some of our CI systems have a different "not found" error for this test than others. Expect a more general message to keep CI happy.

The recent local-lockfile changes neglect to consider that the target directory may not be created before running gpbackup, and occurs before gpbackup would normally make the backup-dir directories. In this instance, make the required directory on COORDINATOR so that lockfile can be taken.

Gppkg is deprecated, and has been removed from GPDB7 CI images. Our CI needs to be updated to use the new version, which is kept in a separate repo. This commit adds logic to build GPDB7 gppkg's with the new gppkg version, and also conditionally checks for those gppkgs when running our GPDB7 test jobs in CI. Also, we pin the ccp_src version for now. A recent change broke testing on Ubuntu 18.04, and we still need to test on that.

If we're performing a normal restore of a backup taken with a version below 1.26.0, the SegmentCount value is missing from the config file and so it is set to 0. This can causes problems down the line, so this commit sets it to be equal to the current cluster size and adds some tests to catch similar issues in the future.

Previous changes in support distributed snapshots for backing up data introduced bugs in how we handle errors when piping copy commands to the server. A prior attempt at fixing this was insufficient, so here we overhaul how the goroutines doing distributed data backup inform of errors, and how those errors are handled. Also, some changes are necessary to our integration test suite to support testing of this code path. Specifically, the distributed code path is only invoked if there are multiple connections to the cluster available. So, we update to run all our tests with two connections open.

Recent change missed this one, add it.

It seems that Concourse does not enforce isolation between tasks in the same job. This means that the build_binaries tasks were having a bug where the go_components.tar.gz file pulled in for each platform was non-deterministic, and we were getting GLIBC version errors because the gppkgs job was, for example, building the RHEL7 gppkg with the RHEL8 binaries. This commit refactors our build_binaries step so that there is a separate job for each platform, and the correct binaries will land where expected.

The recent gppkg changes missed a step when updating the regression pipeline. It requires the input to be called out in the task yaml for it to appear in the docker filesystem.

We've had issues in our CI with dummy_seclabel not getting installed, which causes several downstream tests to fail but only after they've run for a long time. Since we don't have complete control over gpconfig, try to bypass it. It's only tweaking the postgressql.conf files, simple enough to do locally with invoking ssh.

This test has been a bit flaky in our CI. This is being addressed in 2 ways. First, we're going to introduce usage of the FlakeAttempts test decorater to automatically retry to reduce red jobs. Second, restructure the transaction rollback/commit to use defer so that subsequent tests aren't broken by having an orphan transaction left over.

The timeout added when doing restore can occasionally cause failures of otherwise successful restores, in cases where the COPY commands from server lag too far behind the writer opening its half of the pipe. A full RCA of the causes and implications of this is not complete, but for now a smarter timeout helps resolve errors seen in CI. The scheme put in place is an exponential backoff approach, starting at 10ms and scaling up to about 5m27s, for a total of 15 retries and about 12 minutes. Logs of the retry delay begin after the 10s wait, to help keep users informed of what's happening.

This catalog column was not present in upstream and maybe not very useful so it was removed from the catalog in GP7. The grammar will still be accepted in GP7 but ignored.

Also fix pipeline path for output to use the current directory rather than a hardcoded location that may not be accurate

These are supposed to go inside spawned goroutines, not outside. Move it.

In GPDB7 gp_toolkit was refactored to be an extension. Some of our extension tests trip over this change, so this commit updates them as needed.

Add more table rows to ensure stats get generated properly and update expected values as needed.

When including a given table in the filter, we now include that table's parent tables (and their parents, and so on), that table's children, and any other parents of those children, such that all INHERITS statements in any related tables are satisfied. This is a tweak to the behavior introduced in b290ade, so that we don't follow *every* inheritance link and collect any "sibling" tables of the originally-included table that do not share common children with it.

In a scenario where a user is restoring a backup in a read-only backup directory, gprestore will currently fail at the end because it cannot write the restore report file to that directory. This is a legitimate use case, so we should treat this as a warning rather than an error.

We frequently see spurious failures when a single test fails and leaves a role or other object around and thus causes many subsequent tests to fail, so this commit adds DROP statements for the most common offenders to make debugging failed tests much easier.

The CPU_HARD_QUOTA_LIMIT and CPU_SOFT_PRIORITY parameters for resource groups have been respectively renamed to CPU_MAX_PERCENT and CPU_WEIGHT, with no change in underlying functionality, so this commit updates related code accordingly.

The gp_toolkit extension should be installed by default in GPDB 7, but some tests are failing intermittently because it seems not to exist for them. For now, we'll create it manually before running tests to avoid the issue.

GPDB 7 does not currently support distributing by enum columns, so we remove the table with an enum column from the set of objects we use for the end-to-end plugin tests in 7.

As a follow-up to 0896ede, we don't want to fail a restore at the end because an error file can't be written if there is a legitimate reason for that, such as restoring a backup on a read-only filesystem, so we treat that as a warning rather than an error.

Stolb27 · 2023-07-25T14:03:23Z

I'm going to reapply ac8731e above 8bdcb48 and 7346404

Stolb27 · 2023-07-25T15:40:53Z

Tests failed because greenplum-db#692. Let's wait for reaction, otherwise repack it by ourselves.

Alternatively, we can sync 1.28.1 first and mask problem.

to reapply above 8bdcb48 and 7346404 This reverts commit ac8731e.

Implement --report-dir option for gprestore New option allows to create report files in directory different from --backup-dir. Main code changes were made for filepath class. Now it takes new option as deafult path for reports with fallback to content dir. Unit and end_to_end tests were updated to show mentioned behavior. Cherry-picked-from: ac8731e to reapply above 8bdcb48 and 7346404

The merge-base changed after approval.

AJR-VMware and others added 30 commits April 24, 2023 13:22

Update datadomain Vault vars

78bfb34

We're updating our CI to validate against the newest version of datadomain. Update vault vars to new point to the new instances.

Add skip to an end to end test

4f30f34

This test relies on a flag that was only added to gpbackup 1.28.0, so skip if we're testing with an older version.

Update CI artifact names

fcd0c6c

Releng team changed some "rhel" artifacts to "el" to reflect more general approach to RHEL binary compatibility. Update our CI to pull from the correct locations.

Add test for delete_replica

fada3b6

We test most of our plugin functionality in the gpbackup CI, so add a test to our plugin suite to cover the delete_replica function added to our ddboost plugin.

Set gp_quicklz_fallback GUC for GPDB7+

60430ba

GPDB7 has removed support for QuickLZ. Any tables backed up with QuickLZ will cause an error on restore, unless this GUC is set. This GUC will allow the server to silently fall back to using zstd, and allow for successful restores.

Don't print "no data" warning for metadata-only backups

9473841

Fix expected error message in test

11ff552

Some of our CI systems have a different "not found" error for this test than others. Expect a more general message to keep CI happy.

restore_helper: handle files with dots in the name

98c7d82

Add missing get statement to regression pipeline

5987f5f

Recent change missed this one, add it.

Fix regression pipeline usage of gppgkv2

efb42ca

The recent gppkg changes missed a step when updating the regression pipeline. It requires the input to be called out in the task yaml for it to appear in the docker filesystem.

Change "no-inherit" flag to "no-inherits"

8bdcb48

Remove prodataaccess for GP7 and above

f5f3f7e

This catalog column was not present in upstream and maybe not very useful so it was removed from the catalog in GP7. The grammar will still be accepted in GP7 but ignored.

Allow gen_pipeline.py to support python2 or python3

657173c

Also fix pipeline path for output to use the current directory rather than a hardcoded location that may not be accurate

AJR-VMware and others added 12 commits June 6, 2023 14:56

Move ginkgorecover call

c0e3e66

These are supposed to go inside spawned goroutines, not outside. Move it.

Update tests to support gp_toolkit extension

d5eb8bc

In GPDB7 gp_toolkit was refactored to be an extension. Some of our extension tests trip over this change, so this commit updates them as needed.

Fix statistics tests for GP7

4a7cfb4

Add more table rows to ensure stats get generated properly and update expected values as needed.

Prevent report test flakes due to test pollution

d1774c9

Update resource group parameters for GPDB 7

11de4b5

The CPU_HARD_QUOTA_LIMIT and CPU_SOFT_PRIORITY parameters for resource groups have been respectively renamed to CPU_MAX_PERCENT and CPU_WEIGHT, with no change in underlying functionality, so this commit updates related code accordingly.

Explicitly create gp_toolkit in end-to-end tests

9c0ac44

The gp_toolkit extension should be installed by default in GPDB 7, but some tests are failing intermittently because it seems not to exist for them. For now, we'll create it manually before running tests to avoid the issue.

Ignore public.legacy_enum in plugin tests

ed05d39

GPDB 7 does not currently support distributing by enum columns, so we remove the table with an enum column from the set of objects we use for the end-to-end plugin tests in 7.

Fix extension tests for GPDB 7

61acef3

BenderArenadata requested a review from Stolb27 July 24, 2023 13:05

Stolb27 requested review from InnerLife0 and dnskvlnk July 25, 2023 14:08

InnerLife0 previously approved these changes Jul 26, 2023

View reviewed changes

Stolb27 mentioned this pull request Jul 26, 2023

Sync 1.28.1 changes #34

Merged

Stolb27 force-pushed the 1.29.1-sync branch from 8be63d5 to 8cba6d5 Compare July 26, 2023 09:24

Stolb27 and others added 3 commits July 26, 2023 11:29

Revert "Implement --report-dir option for gprestore (#30)"

4ea2f68

to reapply above 8bdcb48 and 7346404 This reverts commit ac8731e.

Merge branch '1.29.1-conflicts' into 1.29.1-sync

bf4485e

Stolb27 force-pushed the 1.29.1-sync branch from 8cba6d5 to 718cd8b Compare July 26, 2023 09:33

InnerLife0 self-requested a review July 26, 2023 12:21

InnerLife0 approved these changes Jul 26, 2023

View reviewed changes

dnskvlnk approved these changes Jul 27, 2023

View reviewed changes

Stolb27 merged commit 29c117b into master Jul 27, 2023
1 of 2 checks passed

Stolb27 deleted the 1.29.1-sync branch July 27, 2023 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync 1.29.1 changes #33

Sync 1.29.1 changes #33

BenderArenadata commented Jul 24, 2023

Stolb27 commented Jul 25, 2023

Stolb27 commented Jul 25, 2023 •

edited

Loading

Sync 1.29.1 changes #33

Sync 1.29.1 changes #33

Conversation

BenderArenadata commented Jul 24, 2023

Stolb27 commented Jul 25, 2023

Stolb27 commented Jul 25, 2023 • edited Loading

Stolb27 commented Jul 25, 2023 •

edited

Loading