Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kernel hang when snapshot cleaner thread is not stopped properly #156

Open
ahuja-gautam opened this issue Mar 18, 2024 · 0 comments
Open

Comments

@ahuja-gautam
Copy link

When nova is unmounted, the snapshot cleaner kthread is stopped with
kthread_stop() in nova_save_snapshots(). If schedule() is called within
kthread_stop()'s wait_for_completion(), the kthread will go to sleep
forever waiting for an interrupt, resulting in a hang.

linux-nova/fs/nova/snapshot.c

Lines 1301 to 1306 in 976a4d1

int nova_save_snapshots(struct super_block *sb)
{
struct nova_sb_info *sbi = NOVA_SB(sb);
if (sbi->snapshot_cleaner_thread)
kthread_stop(sbi->snapshot_cleaner_thread);

linux-nova/fs/nova/snapshot.c

Lines 1319 to 1326 in 976a4d1

static void snapshot_cleaner_try_sleeping(struct nova_sb_info *sbi)
{
DEFINE_WAIT(wait);
prepare_to_wait(&sbi->snapshot_cleaner_wait, &wait, TASK_INTERRUPTIBLE);
schedule();
finish_wait(&sbi->snapshot_cleaner_wait, &wait);
}

Reproduction:

  1. Mount a fresh nova instance using the 'mount -t NOVA -o init' command

  2. Unmount nova

  3. Remount nova at the same mount point

  4. Repeat steps 2 and 3 in a tight loop until the kernel hangs. In our
    experiments, we’re able to reproduce this within a range of 40 - 480
    seconds with an average of 254 seconds.

We wrote a script and helper C program to reproduce the bug
(Makefile and driver.c).

Fix:
In the try-sleeping loop, the kthread is not scheduled out if
kthread_should_stop() evaluates to true.

prepare_to_wait(&sbi->snapshot_cleaner_wait, &wait, TASK_INTERRUPTIBLE);
if (!kthread_should_stop())
    schedule();
finish_wait(&sbi->snapshot_cleaner_wait, &wait);

This fix follows standard practices found in other linux filesystems like
UBIFS and NFS.

The patch linked fixes this bug. We ran the same scripts above for 10
million times and 17 hours, and the bug did not trigger. The bug was
discovered using a new tool for finding f/s bugs using model checking,
called Metis.

Signed-off-by: Gautam Ahuja <gaahuja@cs.stonybrook.edu>
Signed-off-by: Yifei Liu <yifeliu@cs.stonybrook.edu>
Signed-off-by: Erez Zadok <ezk@cs.stonybrook.edu>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant