Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync 1.29.1 changes #33

Merged
merged 46 commits into from
Jul 27, 2023
Merged

Sync 1.29.1 changes #33

merged 46 commits into from
Jul 27, 2023

Conversation

BenderArenadata
Copy link

No description provided.

AJR-VMware and others added 30 commits April 24, 2023 13:22
To support specific use cases, currently mainly gpcopy, if the correct flags are provided gpbackup
should place its timestamp+pid lockfile in the backupdir, instead of in /tmp.

This is to facilitate the highly-parallel metadata-only backups that gpcopy uses when dumping
metadata.

The flags required to elicit this behavior are: metadata-only, no-history, and backup-dir.
We're updating our CI to validate against the newest version of datadomain.  Update vault vars to
new point to the new instances.
This test relies on a flag that was only added to gpbackup 1.28.0, so skip if we're testing with an
older version.
Releng team changed some "rhel" artifacts to "el" to reflect more general approach to RHEL binary
compatibility.  Update our CI to pull from the correct locations.
We test most of our plugin functionality in the gpbackup CI, so add a
test to our plugin suite to cover the delete_replica function added to
our ddboost plugin.
We have historically written out history records only at the end of the
process, and in many failure cases this means no record at all is
available. This commit restructures to write out the full history entry
very early on, but leaves Status as "In progress" until the cleanup phase,
where it will be updated to show whether the backup was successful or a
failure
Our current cleanup mechanisms rely on a recover() call to pick up
FatalOnError.  However, this mechanism does not work if the panic is
called from a separate goroutine.  As such, our cleanup has some gaps
in it that need to be plugged up.

We identify four cases where a panic can be invoked from a goroutine.
In these cases we set up a channel for bubbling the panics up to the
main process, and invoke gplog.Fatal() at the earliest opportunity.

We structure the defers such that waitgroups are still marked Done()
even in case of a panic, so that we can keep the deadlock protection
they offer.

In the future any we will avoid invoking panic() from within goroutines,
so that this handling will not be necessary.
GPDB7 has removed support for QuickLZ.  Any tables backed up with
QuickLZ will cause an error on restore, unless this GUC is set.  This
GUC will allow the server to silently fall back to using zstd, and allow
for successful restores.
Previously our CI plugin tests had been calling noops for the cleanup
step.  Replace these with calls to delete_directory where possible.

An additional cleanup job will be scheduled to run occasionally to
handle things that are not easily cleaned up from plugin storage in
this manner.
Currently our plugin jobs leave behind a lot of data and don't clean it
up at all.  Prior commit added some in-test cleanup, but it still leaves
behind a fair bit.  Add a CI job that will run weekly and drop all
contents from our plugin storage.

Note that this requires the job to be in a pipeline sitting around for a
week.  This may not be the case for short-lived dev pipelines.  Might be
worth considering making a skinny pipeline to leave sitting in the DP
concourse so that the DP storage also gets cleaned automatically.
A previous change to this code restructured how we track which tables
are being backed up, and their status.  This led to the case where, if a
piped copy command failed, the error was captured but never displayed.
Instead we'd spin forever on an "Unknown" state in our sync.Map, causing
permanent hangs.

This commit restores the original behavior, where if a pipe fails the
backup of the table is marked "Complete" in the sync.map, and the error
is handled correctly downstream
Given the issues with this function misbehaving, it needs better
coverage.  This commit adds some basic test coverage of it.  We also
tweak how FilePathInfo is structured, so that it is simpler to mock out
where data is copied out to during backup.
If tables are included in a backup set that inherit from, or are
inherited by, other tables and those other parent or child tables are
not included in the backup set, gpbackup should ignore those inheritance
links when constructing CREATE TABLE statements for included tables.

This commit also adds a --no-inherit flag, which will cause filtered
backups to ignore child tables of included tables to preserve the old
behavior.
The syntax and output of "go test cover" has changed since this script
was originally written, so this commit updates those to maintain the
nice formatted script output we expect.

It also adds the rest of the packages with unit test files to the list
of those for which coverage is generated.
Some of our CI systems have a different "not found" error for this test
than others.  Expect a more general message to keep CI happy.
The recent local-lockfile changes neglect to consider that the target
directory may not be created before running gpbackup, and occurs before
gpbackup would normally make the backup-dir directories.  In this
instance, make the required directory on COORDINATOR so that lockfile
can be taken.
Gppkg is deprecated, and has been removed from GPDB7 CI images.  Our CI
needs to be updated to use the new version, which is kept in a separate
repo.

This commit adds logic to build GPDB7 gppkg's with the new gppkg
version, and also conditionally checks for those gppkgs when running our
GPDB7 test jobs in CI.

Also, we pin the ccp_src version for now.  A recent change broke testing
on Ubuntu 18.04, and we still need to test on that.
If we're performing a normal restore of a backup taken with a version
below 1.26.0, the SegmentCount value is missing from the config file and
so it is set to 0.  This can causes problems down the line, so this
commit sets it to be equal to the current cluster size and adds some
tests to catch similar issues in the future.
Previous changes in support distributed snapshots for backing up data
introduced bugs in how we handle errors when piping copy commands to
the server. A prior attempt at fixing this was insufficient, so here we
overhaul how the goroutines doing distributed data backup inform of
errors, and how those errors are handled.

Also, some changes are necessary to our integration test suite to
support testing of this code path. Specifically, the distributed code
path is only invoked if there are multiple connections to the cluster
available. So, we update to run all our tests with two connections
open.
Recent change missed this one, add it.
It seems that Concourse does not enforce isolation between tasks in the
same job.  This means that the build_binaries tasks were having a bug
where the go_components.tar.gz file pulled in for each platform was
non-deterministic, and we were getting GLIBC version errors because the
gppkgs job was, for example, building the RHEL7 gppkg with the RHEL8
binaries.

This commit refactors our build_binaries step so that there is a
separate job for each platform, and the correct binaries will land where
expected.
The recent gppkg changes missed a step when updating the regression
pipeline.  It requires the input to be called out in the task yaml for
it to appear in the docker filesystem.
We've had issues in our CI with dummy_seclabel not getting installed,
which causes several downstream tests to fail but only after they've run
for a long time.

Since we don't have complete control over gpconfig, try to bypass it.
It's only tweaking the postgressql.conf files, simple enough to do
locally with invoking ssh.
This test has been a bit flaky in our CI.  This is being addressed in 2
ways.

First, we're going to introduce usage of the FlakeAttempts test
decorater to automatically retry to reduce red jobs.

Second, restructure the transaction rollback/commit to use defer so that
subsequent tests aren't broken by having an orphan transaction left
over.
The timeout added when doing restore can occasionally cause failures of
otherwise successful restores, in cases where the COPY commands from
server lag too far behind the writer opening its half of the pipe.  A
full RCA of the causes and implications of this is not complete, but for
now a smarter timeout helps resolve errors seen in CI.

The scheme put in place is an exponential backoff approach, starting at
10ms and scaling up to about 5m27s, for a total of 15 retries and about
12 minutes.  Logs of the retry delay begin after the 10s wait, to help
keep users informed of what's happening.
This catalog column was not present in upstream and maybe not very useful so it
was removed from the catalog in GP7. The grammar will still be accepted in GP7
but ignored.
Also fix pipeline path for output to use the current directory rather than a
hardcoded location that may not be accurate
AJR-VMware and others added 12 commits June 6, 2023 14:56
These are supposed to go inside spawned goroutines, not outside.  Move
it.
In GPDB7 gp_toolkit was refactored to be an extension. Some of our
extension tests trip over this change, so this commit updates them as
needed.
Add more table rows to ensure stats get generated properly and update expected
values as needed.
When including a given table in the filter, we now include that table's
parent tables (and their parents, and so on), that table's children, and
any other parents of those children, such that all INHERITS statements
in any related tables are satisfied.

This is a tweak to the behavior introduced in b290ade, so that we don't
follow *every* inheritance link and collect any "sibling" tables of the
originally-included table that do not share common children with it.
In a scenario where a user is restoring a backup in a read-only backup
directory, gprestore will currently fail at the end because it cannot
write the restore report file to that directory.  This is a legitimate
use case, so we should treat this as a warning rather than an error.
We frequently see spurious failures when a single test fails and leaves
a role or other object around and thus causes many subsequent tests to
fail, so this commit adds DROP statements for the most common offenders
to make debugging failed tests much easier.
The CPU_HARD_QUOTA_LIMIT and CPU_SOFT_PRIORITY parameters for resource
groups have been respectively renamed to CPU_MAX_PERCENT and CPU_WEIGHT,
with no change in underlying functionality, so this commit updates
related code accordingly.
The gp_toolkit extension should be installed by default in GPDB 7, but
some tests are failing intermittently because it seems not to exist for
them.  For now, we'll create it manually before running tests to avoid
the issue.
GPDB 7 does not currently support distributing by enum columns, so we
remove the table with an enum column from the set of objects we use for
the end-to-end plugin tests in 7.
As a follow-up to 0896ede, we don't want to fail a restore at the end
because an error file can't be written if there is a legitimate reason
for that, such as restoring a backup on a read-only filesystem, so we
treat that as a warning rather than an error.
@Stolb27
Copy link

Stolb27 commented Jul 25, 2023

I'm going to reapply ac8731e above 8bdcb48 and 7346404

@Stolb27
Copy link

Stolb27 commented Jul 25, 2023

Tests failed because greenplum-db#692. Let's wait for reaction, otherwise repack it by ourselves.

Alternatively, we can sync 1.28.1 first and mask problem.

InnerLife0
InnerLife0 previously approved these changes Jul 26, 2023
Stolb27 and others added 3 commits July 26, 2023 11:29
Implement --report-dir option for gprestore

New option allows to create report files in directory different from --backup-dir.
Main code changes were made for filepath class. Now it takes new option as deafult path for reports with fallback to content dir.
Unit and end_to_end tests were updated to show mentioned behavior.

Cherry-picked-from: ac8731e
to reapply above 8bdcb48 and 7346404
@Stolb27 Stolb27 dismissed InnerLife0’s stale review July 26, 2023 09:33

The merge-base changed after approval.

@InnerLife0 InnerLife0 self-requested a review July 26, 2023 12:21
@Stolb27 Stolb27 merged commit 29c117b into master Jul 27, 2023
1 of 2 checks passed
@Stolb27 Stolb27 deleted the 1.29.1-sync branch July 27, 2023 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants