Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fsfreeze functionality to snapshot #1083

Merged
merged 17 commits into from
May 14, 2024
Merged

Conversation

ejweber
Copy link
Collaborator

@ejweber ejweber commented Apr 24, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#2187

What this PR does / why we need it:

  • Attempt to bind mount the file system on the root partition of a Longhorn volume and freeze it with fsfreeze if requested by snapshot command line argument or snapshot gRPC command. Otherwise, just sync like we have always done.
  • Attempt to unfreeze the file system on the root partition during frontend shutdown in case we are in the middle of a snapshot.
  • Attempt to unfreeze the file system on the root partion of a Longhorn volume during frontend startup in case its engine previously crashed before an unfreeze was issued.

Additional documentation or context

Depends on longhorn/types#8.

We need to roll back the debugging commit and fix the imports before merging.

@ejweber ejweber changed the title Add fsfreeze functionality to snapshot. Add fsfreeze functionality to snapshot Apr 24, 2024
@ejweber ejweber force-pushed the 2187-fsfreeze branch 2 times, most recently from ad96ec2 to 2149c73 Compare April 25, 2024 20:19
@ejweber
Copy link
Collaborator Author

ejweber commented Apr 26, 2024

I found a pretty convenient way to modify the integration test that was written for this feature's original implementation to work with the current implementation. It tests basic functionality (e.g., when Snapshot is called with shouldFreeze == true, we can successfully detect the mounted file system, do the bind mound, and attempt a freeze).

Otherwise, I manually tested the following scenarios using this modified engine and the modified instance manager. In the interest of time, I didn't write up an exhaustive description of how the tests were executed, but here are the basic commands involved:

  • Run a simple Longhorn volume in a Docker container.
    docker run --privileged -v /:/host --rm --net host --name test longhornio/longhorn-engine:2149c73d launch-simple-longhorn test 10g tgt-blockdev
  • Make a file system on the volume and mount it.
    mkfs.ext4 /dev/longhorn/test
    mount /dev/longhorn/test /mnt/test
  • Constantly write data to the file system while snapshotting, etc.
    while true; do dd of=/mnt/test/file if=/dev/urandom bs=16M; done
  • Watch the dirty page cache while freeze is ongoing.
    while true; do cat /proc/meminfo | grep -i dirty; sleep 0.5; done
  • Check inside the Docker container for mounted volumes.
    docker exec test bash -c " while true; do mount | grep test; sleep 0.5; done"
  • Create a snapshot with or without a request to freeze.
    bin/longhorn -url localhost:10015 snapshot create --freeze-fs
    bin/longhorn -url localhost:10015 snapshot create
  • Check for interesting processes and determine if they are stuck and why.
    while true; do ps -eo pid,ppid,pgid,user,stat,pcpu,comm,wchan:32 | grep -e freeze -e dmsetup -e " dd" -e laun; sleep 1; done
  • Delete and recreate Longhorn processes using instance-manager.
    docker exec test longhorn-instance-manager process delete --name test-e
    docker exec test longhorn-instance-manager process delete --name test-r
    docker exec test longhorn-instance-manager process create --name test-r --binary /usr/local/bin/longhorn --port-count 15 --port-args --listen,localhost: -- replica /volume/ --size 10g
    docker exec test longhorn-instance-manager process create --name test-e --binary /usr/local/bin/longhorn --port-count 1 --port-args --listen,localhost: -- controller test --frontend tgt-blockdev --size 10g --current-size 10g --replica tcp://localhost:10000
  • Kill the whole Longhorn volume immediately.
    kill -9 -<process_group of the launch process>

Here are the test cases. All tests were executed on a known "good" kernel (Rocky 9.3 with an upstream 6.8.7-1.el9.elrepo.x86_64 kernel installed).


Take a snapshot without --freeze-fs.

Check:

  • Logs clearly indicate system sync.
  • Dirty pages trend toward zero before the snapshot is finished (but don't necessarily reach it).

Take a snapshot with --freeze-fs.

Check:
Logs clearly indicate freeze and unfreeze.
Logs do not indicate system sync.
Loop running mount command clearly shows mount being created and destroyed.
Loop running cat command clearly show cache being flushed.


Take a snapshot with --freeze-fs. Ctrl+C the instance manager container while the freeze is ongoing.

Check:

  • Logs clearly indicate freeze.
  • Logs indicate a timeout unfreezing. (This is because we automatically try to unfreeze during shutdown, but unfreeze is blocked by freeze. We return anyway.)
  • Logs may indicate a later attempt to unfreeze. (This is because our freeze attempt finally fails when I/O errors return and then we attempt to undo it. It's really not particularly helpful or interesting, but it is expected.)
  • The snapshot fails with an indication that freeze failed.
  • The dd command gets unstuck (once I/O errors hit after the iSCSI timeout).
  • The file system can be unmounted.

Take a snapshot with --freeze-fs. kill -9 the instance manager container while the freeze is ongoing.

Check:

  • Logs clearly indicate freeze.
  • The snapshot fails with an error that includes an EOF.
  • The dd command gets unstuck (once I/O errors hit after the iSCSI timeout).
  • The file system can be unmounted.

Take a snapshot with --freeze-fs. kill -9 the instance manager container BETWEEN FREEZE AND LOCK (need code modification). Then, restart instance-manager.

Check:

  • Logs clearly indicate freeze.
  • The snapshot fails with an error that includes an EOF.
  • After instance-manager is restarted:
    • Logs clearly indicate unfreeze (from the instance-manager startup logic).
    • The dd command gets unstuck.
    • The file system can be unmounted.

Take a snapshot with --freeze-fs. Stop the engine using a call to instance-manager BETWEEN FREEZE AND LOCK (need code modification).

Check:

  • Logs clearly indicate freeze.
  • The snapshot fails with an error that includes an EOF.
  • Logs clearly indicate unfreeze (from the shutdown frontend logic).
  • The dd command gets unstuck.
  • The file system can be unmounted.

Take a snapshot with --freeze-fs. kill -9 the engine process (but not instance-manager) BETWEEN FREEZE AND LOCK (need code modification).

  • Logs clearly indicate freeze.
  • The snapshot fails with an error that includes an EOF.
  • To recover:
    • Use a call to instance-manager to delete the engine and replica processes.
    • Use a call to instance-manager to start the engine and replica processes.
    • After this:
      • Logs clearly indicate unfreeze (from the startup frontend logic).
      • The dd command gets unstuck.
      • The file system can be unmounted.

Take a snapshot with --freeze-fs. Then, quickly take another snapshot.

  • Logs clearly indicate freeze and unfreeze for first snapshot.
  • The snapshot fails with an indication that the file system is already being frozen.

Copy link

mergify bot commented Apr 27, 2024

This pull request is now in conflict. Could you fix it @ejweber? 🙏

pkg/util/fsfreeze.go Outdated Show resolved Hide resolved
pkg/controller/control.go Show resolved Hide resolved
pkg/controller/control.go Outdated Show resolved Hide resolved
pkg/frontend/tgt/frontend.go Outdated Show resolved Hide resolved
pkg/frontend/tgt/frontend.go Outdated Show resolved Hide resolved
pkg/frontend/tgt/frontend.go Outdated Show resolved Hide resolved
@ejweber ejweber marked this pull request as draft May 2, 2024 21:01
@ejweber
Copy link
Collaborator Author

ejweber commented May 2, 2024

Reworking some things in response to @PhanLe1010's feedback. Will reopen again when I am back to fairly confident.

@ejweber ejweber force-pushed the 2187-fsfreeze branch 2 times, most recently from a9d3226 to 5af2582 Compare May 3, 2024 19:08
ejweber added 11 commits May 3, 2024 19:57
Squash the POC implementation history to avoid confusion.

Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
TODO: Remove this before merging.

Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Stop Codefactor from complaining.

Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber ejweber marked this pull request as ready for review May 3, 2024 21:10
@ejweber
Copy link
Collaborator Author

ejweber commented May 3, 2024

Moved back to ready for review @PhanLe1010. Sorry for the delay. I responded to all of your comments and made changes with respect to some of them.

I also fixed an issue Codefactor had with my integration test and ensured we actually cleaned up the directories within the instance-manager container we bind mounted to (in case a long running instance-manager used too many!).

I reran #1083 (comment) with the changes.

Copy link

mergify bot commented May 4, 2024

This pull request is now in conflict. Could you fix it @ejweber? 🙏

@PhanLe1010
Copy link
Contributor

ensured we actually cleaned up the directories within the instance-manager container we bind mounted to (in case a long running instance-manager used too many!)

Could you point me to this? Thank you

@ejweber
Copy link
Collaborator Author

ejweber commented May 8, 2024

Sorry for the confusion. This commit uses CleanupMountPoint in two places:

  • One where we previously just used Unmount (so did not clean up the directory).
  • One where we previously did not Unmount at all. (This was an oversight that could have led to mounts for shut down engines accumulating over time.)

PhanLe1010
PhanLe1010 previously approved these changes May 8, 2024
Copy link
Contributor

@PhanLe1010 PhanLe1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you for the great implementation!

Dockerfile.dapper Outdated Show resolved Hide resolved
james-munson
james-munson previously approved these changes May 10, 2024
Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nits in comments, no issues. LGTM

Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber ejweber dismissed stale reviews from james-munson and PhanLe1010 via 1167971 May 10, 2024 21:49
@ejweber
Copy link
Collaborator Author

ejweber commented May 10, 2024

Since longhorn/types#8 merged and I have two approving reviews, I will remove my debug commits and rebase in preparation for merge soon.

pkg/controller/control.go Outdated Show resolved Hide resolved
pkg/controller/control.go Outdated Show resolved Hide resolved
Dockerfile.dapper Outdated Show resolved Hide resolved
pkg/frontend/tgt/frontend.go Outdated Show resolved Hide resolved
pkg/util/fsfreeze.go Outdated Show resolved Hide resolved
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
This reverts commit 2240844.

Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 2187

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber
Copy link
Collaborator Author

ejweber commented May 13, 2024

Added 66fa2e6 for @shuo-wu's comments, removed debugging helpers, and pointed back to upstream types. Rebase is a mess given the number of commits / time in development and the dependency changes since I started, so we can use squash instead. Ready to go pending @shuo-wu's approval.

@shuo-wu
Copy link
Contributor

shuo-wu commented May 14, 2024

After resolving the rebase and conflicts, it's good to go

@ejweber ejweber merged commit e39b7f0 into longhorn:master May 14, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants