Skip to content

Commit 0c7325f

Browse files
committed
Merged PR 12878455: Fix for the Rego "metadata desync" bug
[cherry-picked from 421b12249544a334e36df33dc4846673b2a88279] This set of changes fixes the [Metadata Desync with UVM State](https://msazure.visualstudio.com/One/_workitems/edit/33232631/) bug, by reverting the Rego policy state on mount and some types of unmount failures. For mounts, a minor cleanup code is added to ensure we close down the dm-crypt device if we fails to mount it. Aside from this, it is relatively straightforward - if we get a failure, the clean up functions will remove the directory, remove any dm-devices, and we will revert the Rego metadata. For unmounts, careful consideration needs to be taken, since if the directory has been unmounted successfully (or even partially successful?), and we get an error, we cannot just revert the policy state, as this may allow the host to use a broken / empty mount as one of the image layer. See 615c9a0bdf's commit message for more detailed thoughts. The solution I opted for is, for non-trivial unmount failure cases (i.e. not policy denial, not invalid mountpoint), if it fails, then we will block all further mount, unmount, container creation and deletion attempts. I think this is OK since we really do not expect unmounts to fail - this is especially the case for us since the only writable disk mount we have is the shared scratch disk, which we do not unmount at all unless we're about to kill the UVM. Testing ------- The "Rollback policy state on mount errors" commit message has some instruction for making a deliberately corrupted (but still passes the verifyinfo extraction) VHD that will cause a mount error. The other way we could make mount / unmount fail, and thus test this change, is by mounting a tmpfs or ro bind in relevant places: To make unmount fail: mkdir /run/gcs/c/.../rootfs/a && mount -t tmpfs none /run/gcs/c/.../rootfs/a or mkdir /run/gcs/mounts/scsi/m1/a && mount -t tmpfs none /run/gcs/mounts/scsi/m1/a To make mount fail: mount -o ro --bind /run/mounts/scsi /run/mounts/scsi or mount --bind -o ro /run/gcs/c /run/gcs/c Once failure is triggered, one can make them work again by `umount`ing the tmpfs or ro bind. What about other operations? ---------------------------- This PR covers mount and unmount of disks, overlays and 9p. Aside from setting `metadata.matches` as part of the narrowing scheme, and except for `metadata.started` to prevent re-using a container ID, Rego does not use persistent state for any other operations. Since it's not clear whether reverting the state would be semantically correct (we would need to carefully consider exactly what are the side effects of say, failing to execute a process, start a container, or send a signal, etc), and adding the revert to those operations does not currently affect much behaviour, I've opted not to apply the metadata revert to those for now. Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
1 parent 314dc3d commit 0c7325f

File tree

12 files changed

+932
-42
lines changed

12 files changed

+932
-42
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
//go:build linux
2+
// +build linux
3+
4+
package gcs
5+
6+
import (
7+
"context"
8+
"fmt"
9+
"os"
10+
"runtime"
11+
"time"
12+
13+
"github.com/Microsoft/hcsshim/internal/log"
14+
"github.com/Microsoft/hcsshim/pkg/amdsevsnp"
15+
"github.com/sirupsen/logrus"
16+
)
17+
18+
// UnrecoverableError logs the error and then puts the current thread into an
19+
// infinite sleep loop. This is to be used instead of panicking, as the
20+
// behaviour of GCS panics is unpredictable. This function can be extended to,
21+
// for example, try to shutdown the VM cleanly.
22+
func UnrecoverableError(err error) {
23+
buf := make([]byte, 300*(1<<10))
24+
stackSize := runtime.Stack(buf, true)
25+
stackTrace := string(buf[:stackSize])
26+
27+
errPrint := fmt.Sprintf(
28+
"Unrecoverable error in GCS: %v\n%s",
29+
err, stackTrace,
30+
)
31+
isSnp := amdsevsnp.IsSNP()
32+
if isSnp {
33+
errPrint += "\nThis thread will now enter an infinite loop."
34+
}
35+
log.G(context.Background()).WithError(err).Logf(
36+
logrus.FatalLevel,
37+
"%s",
38+
errPrint,
39+
)
40+
41+
if !isSnp {
42+
panic("Unrecoverable error in GCS: " + err.Error())
43+
} else {
44+
fmt.Fprintf(os.Stderr, "%s\n", errPrint)
45+
for {
46+
time.Sleep(time.Hour)
47+
}
48+
}
49+
}

0 commit comments

Comments
 (0)