Add fsfreeze functionality to snapshot #1083

ejweber · 2024-04-24T21:06:20Z

Which issue(s) this PR fixes:

longhorn/longhorn#2187

What this PR does / why we need it:

Attempt to bind mount the file system on the root partition of a Longhorn volume and freeze it with fsfreeze if requested by snapshot command line argument or snapshot gRPC command. Otherwise, just sync like we have always done.
Attempt to unfreeze the file system on the root partition during frontend shutdown in case we are in the middle of a snapshot.
Attempt to unfreeze the file system on the root partion of a Longhorn volume during frontend startup in case its engine previously crashed before an unfreeze was issued.

Additional documentation or context

Depends on longhorn/types#8.

We need to roll back the debugging commit and fix the imports before merging.

ejweber · 2024-04-26T21:03:50Z

I found a pretty convenient way to modify the integration test that was written for this feature's original implementation to work with the current implementation. It tests basic functionality (e.g., when Snapshot is called with shouldFreeze == true, we can successfully detect the mounted file system, do the bind mound, and attempt a freeze).

Otherwise, I manually tested the following scenarios using this modified engine and the modified instance manager. In the interest of time, I didn't write up an exhaustive description of how the tests were executed, but here are the basic commands involved:

Run a simple Longhorn volume in a Docker container.
docker run --privileged -v /:/host --rm --net host --name test longhornio/longhorn-engine:2149c73d launch-simple-longhorn test 10g tgt-blockdev
Make a file system on the volume and mount it.
mkfs.ext4 /dev/longhorn/test
mount /dev/longhorn/test /mnt/test
Constantly write data to the file system while snapshotting, etc.
while true; do dd of=/mnt/test/file if=/dev/urandom bs=16M; done
Watch the dirty page cache while freeze is ongoing.
while true; do cat /proc/meminfo | grep -i dirty; sleep 0.5; done
Check inside the Docker container for mounted volumes.
docker exec test bash -c " while true; do mount | grep test; sleep 0.5; done"
Create a snapshot with or without a request to freeze.
bin/longhorn -url localhost:10015 snapshot create --freeze-fs
bin/longhorn -url localhost:10015 snapshot create
Check for interesting processes and determine if they are stuck and why.
while true; do ps -eo pid,ppid,pgid,user,stat,pcpu,comm,wchan:32 | grep -e freeze -e dmsetup -e " dd" -e laun; sleep 1; done
Delete and recreate Longhorn processes using instance-manager.
docker exec test longhorn-instance-manager process delete --name test-e
docker exec test longhorn-instance-manager process delete --name test-r
docker exec test longhorn-instance-manager process create --name test-r --binary /usr/local/bin/longhorn --port-count 15 --port-args --listen,localhost: -- replica /volume/ --size 10g
docker exec test longhorn-instance-manager process create --name test-e --binary /usr/local/bin/longhorn --port-count 1 --port-args --listen,localhost: -- controller test --frontend tgt-blockdev --size 10g --current-size 10g --replica tcp://localhost:10000
Kill the whole Longhorn volume immediately.
kill -9 -<process_group of the launch process>

Here are the test cases. All tests were executed on a known "good" kernel (Rocky 9.3 with an upstream 6.8.7-1.el9.elrepo.x86_64 kernel installed).

Take a snapshot without --freeze-fs.

Check:

Logs clearly indicate system sync.
Dirty pages trend toward zero before the snapshot is finished (but don't necessarily reach it).

Take a snapshot with --freeze-fs.

Check:
Logs clearly indicate freeze and unfreeze.
Logs do not indicate system sync.
Loop running mount command clearly shows mount being created and destroyed.
Loop running cat command clearly show cache being flushed.

Take a snapshot with --freeze-fs. Ctrl+C the instance manager container while the freeze is ongoing.

Check:

Logs clearly indicate freeze.
Logs indicate a timeout unfreezing. (This is because we automatically try to unfreeze during shutdown, but unfreeze is blocked by freeze. We return anyway.)
Logs may indicate a later attempt to unfreeze. (This is because our freeze attempt finally fails when I/O errors return and then we attempt to undo it. It's really not particularly helpful or interesting, but it is expected.)
The snapshot fails with an indication that freeze failed.
The dd command gets unstuck (once I/O errors hit after the iSCSI timeout).
The file system can be unmounted.

Take a snapshot with --freeze-fs. kill -9 the instance manager container while the freeze is ongoing.

Check:

Logs clearly indicate freeze.
The snapshot fails with an error that includes an EOF.
The dd command gets unstuck (once I/O errors hit after the iSCSI timeout).
The file system can be unmounted.

Take a snapshot with --freeze-fs. kill -9 the instance manager container BETWEEN FREEZE AND LOCK (need code modification). Then, restart instance-manager.

Check:

Logs clearly indicate freeze.
The snapshot fails with an error that includes an EOF.
After instance-manager is restarted:
- Logs clearly indicate unfreeze (from the instance-manager startup logic).
- The dd command gets unstuck.
- The file system can be unmounted.

Take a snapshot with --freeze-fs. Stop the engine using a call to instance-manager BETWEEN FREEZE AND LOCK (need code modification).

Check:

Logs clearly indicate freeze.
The snapshot fails with an error that includes an EOF.
Logs clearly indicate unfreeze (from the shutdown frontend logic).
The dd command gets unstuck.
The file system can be unmounted.

Take a snapshot with --freeze-fs. kill -9 the engine process (but not instance-manager) BETWEEN FREEZE AND LOCK (need code modification).

Logs clearly indicate freeze.
The snapshot fails with an error that includes an EOF.
To recover:
- Use a call to instance-manager to delete the engine and replica processes.
- Use a call to instance-manager to start the engine and replica processes.
- After this:
  - Logs clearly indicate unfreeze (from the startup frontend logic).
  - The dd command gets unstuck.
  - The file system can be unmounted.

Take a snapshot with --freeze-fs. Then, quickly take another snapshot.

Logs clearly indicate freeze and unfreeze for first snapshot.
The snapshot fails with an indication that the file system is already being frozen.

mergify · 2024-04-27T14:10:08Z

This pull request is now in conflict. Could you fix it @ejweber? 🙏

pkg/util/fsfreeze.go

pkg/controller/control.go

pkg/frontend/tgt/frontend.go

ejweber · 2024-05-02T21:01:56Z

Reworking some things in response to @PhanLe1010's feedback. Will reopen again when I am back to fairly confident.

Squash the POC implementation history to avoid confusion. Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

TODO: Remove this before merging. Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Stop Codefactor from complaining. Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber · 2024-05-03T21:12:20Z

Moved back to ready for review @PhanLe1010. Sorry for the delay. I responded to all of your comments and made changes with respect to some of them.

I also fixed an issue Codefactor had with my integration test and ensured we actually cleaned up the directories within the instance-manager container we bind mounted to (in case a long running instance-manager used too many!).

I reran #1083 (comment) with the changes.

mergify · 2024-05-04T05:39:50Z

This pull request is now in conflict. Could you fix it @ejweber? 🙏

PhanLe1010 · 2024-05-08T05:25:59Z

ensured we actually cleaned up the directories within the instance-manager container we bind mounted to (in case a long running instance-manager used too many!)

Could you point me to this? Thank you

ejweber · 2024-05-08T14:36:47Z

Sorry for the confusion. This commit uses CleanupMountPoint in two places:

One where we previously just used Unmount (so did not clean up the directory).
One where we previously did not Unmount at all. (This was an oversight that could have led to mounts for shut down engines accumulating over time.)

PhanLe1010

LGTM

Thank you for the great implementation!

Dockerfile.dapper

pkg/controller/control.go

james-munson

Just a couple of nits in comments, no issues. LGTM

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber · 2024-05-10T22:11:34Z

Since longhorn/types#8 merged and I have two approving reviews, I will remove my debug commits and rebase in preparation for merge soon.

pkg/controller/control.go

Dockerfile.dapper

pkg/frontend/tgt/frontend.go

pkg/util/fsfreeze.go

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

This reverts commit 2240844. Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber · 2024-05-13T15:48:21Z

Added 66fa2e6 for @shuo-wu's comments, removed debugging helpers, and pointed back to upstream types. Rebase is a mess given the number of commits / time in development and the dependency changes since I started, so we can use squash instead. Ready to go pending @shuo-wu's approval.

shuo-wu · 2024-05-14T01:50:15Z

After resolving the rebase and conflicts, it's good to go

ejweber changed the title ~~Add fsfreeze functionality to snapshot.~~ Add fsfreeze functionality to snapshot Apr 24, 2024

ejweber mentioned this pull request Apr 24, 2024

Add fsFreeze field to VolumeSnapshot longhorn/longhorn-instance-manager#477

Merged

ejweber force-pushed the 2187-fsfreeze branch 2 times, most recently from ad96ec2 to 2149c73 Compare April 25, 2024 20:19

ejweber mentioned this pull request Apr 25, 2024

Add a global FreezeFS setting and a volume FreezeFS field longhorn/longhorn-manager#2744

Merged

ejweber force-pushed the 2187-fsfreeze branch from ead07e0 to fa7f15a Compare April 26, 2024 20:49

ejweber force-pushed the 2187-fsfreeze branch from fa7f15a to 434c08b Compare April 26, 2024 21:20

ejweber mentioned this pull request Apr 29, 2024

[IMPROVEMENT] Use fsfreeze instead of sync before snapshot longhorn/longhorn#2187

Closed

ejweber force-pushed the 2187-fsfreeze branch 3 times, most recently from c1fb482 to 94f3cdb Compare April 29, 2024 16:57

PhanLe1010 reviewed Apr 29, 2024

View reviewed changes

ejweber marked this pull request as draft May 2, 2024 21:01

ejweber force-pushed the 2187-fsfreeze branch 2 times, most recently from a9d3226 to 5af2582 Compare May 3, 2024 19:08

ejweber added 11 commits May 3, 2024 19:57

Add fsfreeze functionality to Snapshot

8ceac31

Squash the POC implementation history to avoid confusion. Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Handle near-simultaneous snapshots more safely

5c6277a

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Add debugging helpers

2240844

TODO: Remove this before merging. Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Add fsfreeze related integration test

2827d2e

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Reword file system to filesystem

fcd0a8d

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Use ExecuteNoTimeout instead of freezeTimeout

18e841c

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Ensure freezePoints are removed and prefer to unfreeze freezePoints

d57877e

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Log snapshot errors server-side

24adad2

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Move engine startup unfreeze logic earlier

efa24fd

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Don't fail on engine startup if unfreeze is stuck

848ce5a

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Use a secure temporary directory during integration tests

21e2a65

Stop Codefactor from complaining. Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber force-pushed the 2187-fsfreeze branch from 5af2582 to 21e2a65 Compare May 3, 2024 21:07

ejweber marked this pull request as ready for review May 3, 2024 21:10

Merge remote-tracking branch 'origin/master' into 2187-fsfreeze

0495a64

PhanLe1010 previously approved these changes May 8, 2024

View reviewed changes

james-munson reviewed May 9, 2024

View reviewed changes

Dockerfile.dapper Outdated Show resolved Hide resolved

james-munson reviewed May 10, 2024

View reviewed changes

pkg/controller/control.go Outdated Show resolved Hide resolved

james-munson previously approved these changes May 10, 2024

View reviewed changes

Fix typos in comments

1167971

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber dismissed stale reviews from james-munson and PhanLe1010 via 1167971 May 10, 2024 21:49

shuo-wu reviewed May 11, 2024

View reviewed changes

ejweber added 2 commits May 13, 2024 15:12

Use upstream types

e382878

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

Merge remote-tracking branch 'origin/master' into 2187-fsfreeze

b36caa9

ejweber force-pushed the 2187-fsfreeze branch from 5945485 to 8305e21 Compare May 13, 2024 15:25

Revert "Add debugging helpers"

f879631

This reverts commit 2240844. Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber force-pushed the 2187-fsfreeze branch from 8305e21 to b06a43e Compare May 13, 2024 15:31

Clarify log and return empty from helper functions

66fa2e6

Longhorn 2187 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber force-pushed the 2187-fsfreeze branch from b06a43e to 66fa2e6 Compare May 13, 2024 15:43

shuo-wu approved these changes May 14, 2024

View reviewed changes

ejweber merged commit e39b7f0 into longhorn:master May 14, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fsfreeze functionality to snapshot #1083

Add fsfreeze functionality to snapshot #1083

ejweber commented Apr 24, 2024

ejweber commented Apr 26, 2024 •

edited

Loading

mergify bot commented Apr 27, 2024

ejweber commented May 2, 2024

ejweber commented May 3, 2024 •

edited

Loading

mergify bot commented May 4, 2024

PhanLe1010 commented May 8, 2024

ejweber commented May 8, 2024

PhanLe1010 left a comment

james-munson left a comment

ejweber commented May 10, 2024

ejweber commented May 13, 2024

shuo-wu commented May 14, 2024

Add fsfreeze functionality to snapshot #1083

Add fsfreeze functionality to snapshot #1083

Conversation

ejweber commented Apr 24, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

Additional documentation or context

ejweber commented Apr 26, 2024 • edited Loading

mergify bot commented Apr 27, 2024

ejweber commented May 2, 2024

ejweber commented May 3, 2024 • edited Loading

mergify bot commented May 4, 2024

PhanLe1010 commented May 8, 2024

ejweber commented May 8, 2024

PhanLe1010 left a comment

Choose a reason for hiding this comment

james-munson left a comment

Choose a reason for hiding this comment

ejweber commented May 10, 2024

ejweber commented May 13, 2024

shuo-wu commented May 14, 2024

ejweber commented Apr 26, 2024 •

edited

Loading

ejweber commented May 3, 2024 •

edited

Loading