Skip to content

Conversation

@micromaomao micromaomao requested a review from a team as a code owner November 11, 2025 18:50
@micromaomao micromaomao requested a review from anmaxvl November 11, 2025 18:52
@micromaomao micromaomao force-pushed the tingmao_github/merge-msrc-to-main branch 6 times, most recently from 041e99f to fb118fe Compare November 14, 2025 17:18
@micromaomao micromaomao force-pushed the tingmao_github/merge-msrc-to-main branch from fb118fe to 9fca8e9 Compare November 24, 2025 17:57
[Cherry-picked from ba849d8464ff019fe751872d458168bc9aec2256]

While the entire PR contains several (perhaps unrelated) changes, the individual commits in this PR can be reviewed on their own, and there is some background in the commit messages.

I've tested this with azcri-containerd as well as with the functional tests in hcsshim and some of the tests fail due to unrelated reasons (can't do VPMem, no GPU, WCOW) - I haven't noticed anything that could be caused by these changes. I also deployed these changes to a VM and can run some confidential workloads manually.

Outstanding work that will not be done in this PR:

- Deny unmounting of in-use layers (this should already be impossible due to directories being in use, but can result in rego metadata being inconsistent)

Missing go tests that will not be part of this PR - I have local stash for these, but they are not fully working. Can do separately:

- Functional test for checking that a bad OCISpec.Rootfs is rejected (tested manually)
- Functional test for checking that a bad containerID is rejected (tested manually)

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
Since the virtual pod ID is used to construct paths, it needs to be validated in
the same way we validate container/sandbox IDs.  This only applies to
confidential.

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
…il for confidential

[cherry-picked from 73a7151ad8f15aad9f0e4eef6b86ed1a1a5a0953]

This fixes a vulnerability where the host can provide arbitrary commands as
hooks in the OCISpecification, which will get executed when passed to runc.

Normally, ociSpec.Hooks is cleared explicitly by the host, and so we should
never get passed hooks from a genuine host.  The only case where OCI hooks would
be set is if we are running a GPU container, in which case guest-side code in
GCS will add hooks to set up the device - see addNvidiaDeviceHook.

This restriction will also apply to the hook set this way by ourselves, however
this is fine for now as we do not support Confidential GPUs.  When we do support
GPU confidential containers, we will need to pull this enforcement into Rego,
and allow the policy to determine whether the hook should be allowed (along with
enforcing which devices are allowed to be exposed to the container).

Even after this fix, we still have unenforced fields in the OCISpecification
that we receive from the host, that may allow the host to, for example, add
device nodes, and we should harden this further in a future release.

Tested using the following host-side reproduction patch:

```diff
diff --git a/internal/hcsoci/hcsdoc_lcow.go b/internal/hcsoci/hcsdoc_lcow.go
index db94d73..a270c13ac 100644
--- a/internal/hcsoci/hcsdoc_lcow.go
+++ b/internal/hcsoci/hcsdoc_lcow.go
@@ -94,6 +94,22 @@ func createLinuxContainerDocument(ctx context.Context, coi *createOptionsInterna
 		return nil, err
 	}

+	isSandbox := spec.Annotations[annotations.KubernetesContainerType] == "sandbox"
+
+	if !isSandbox {
+		if spec.Hooks == nil {
+			spec.Hooks = &specs.Hooks{}
+		}
+		timeout := 5000000
+		spec.Hooks.StartContainer = append(spec.Hooks.StartContainer, specs.Hook{
+			Path:    "/bin/sh",
+			Args:    []string{"/bin/sh", "-c", "echo hacked by hooks > /hacked.txt"},
+			Env:     []string{"PATH=/usr/local/bin:/usr/bin:/bin:/sbin"},
+			Timeout: &timeout,
+		})
+		log.G(ctx).Info("Injected hook into OCISpec")
+	}
+
 	log.G(ctx).WithField("guestRoot", guestRoot).Debug("hcsshim::createLinuxContainerDoc")
 	return &linuxHostedSystem{
 		SchemaVersion:    schemaversion.SchemaV21(),
```

Non-confidential scenarios should not be affected.

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
[cherry-picked from 5469e5c013e08dfca8aaf31dae8f617831fd5ac7]

For an arbitrary fragment offered by the host, currently we need to interpret the fragment code as a new Rego module in order for the framework to check it.  However, we did not check that the namespace of the fragment, which is used as the namespace of the module, does not conflict with any of our own namespaces (e.g. framework, policy, or api).  This means that as soon as we load it for policy checking, a malicious fragment could override enforcement points defined in the framework, including "load_fragment", and allow itself in, even if we would otherwise have denied it because of wrong issuer, etc.

This can be tested with the below fragment:

	package framework

	svn := "1"
	framework_version := "0.3.1"

	load_fragment := {"allowed": true, "add_module": true}
	enforcement_point_info := {
		"available": true,
		"unknown": false,
		"invalid": false,
		"version_missing": false,
		"default_results": {"allowed": true},
		"use_framework": true
	}

	mount_device := {"allowed": true}
	rw_mount_device := {"allowed": true}
	mount_overlay := {"allowed": true}
	create_container := {"allowed": true, "env_list": null, "allow_stdio_access": true}
	unmount_device := {"allowed": true}
	rw_unmount_device := {"allowed": true}
	unmount_overlay := {"allowed": true}
	exec_in_container := {"allowed": true, "env_list": null}
	exec_external := {"allowed": true, "env_list": null, "allow_stdio_access": true}
	shutdown_container := {"allowed": true}
	signal_container_process := {"allowed": true}
	plan9_mount := {"allowed": true}
	plan9_unmount := {"allowed": true}
	get_properties := {"allowed": true}
	dump_stacks := {"allowed": true}
	runtime_logging := {"allowed": true}
	scratch_mount := {"allowed": true}
	scratch_unmount := {"allowed": true}

However, our namespace parsing was also suspectable to attacks, and we might end up concluding that the fragment namespace is safe even if when the Rego engine parses it, it sees a package of "framework". This can be demonstrated with the following fragment:

	package #aa
	framework

	svn := "0"
	framework_version := "0.3.1"

	load_fragment := {"allowed": true, "add_module": true}

And so we also ensure that the `package ...` line can't contain anything unusual.

Furthermore, since it can still be risky to inject arbitrary, untrusted Rego code into our execution context, we add another check prior to loading the fragment as a module, where we make sure that the fragment has to first pass the issuer and feed check, which do not require loading it.

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
[cherry-picked from 0ca40bb4f130b3508f4a130011463070328d40d0]

- rego: Fix missing error reason when mounting a rw device to an existing mount point.

  This fixes a missing error message introduced in the last round of security
  fixes.  It's not hugely important, but eases debugging if we get policy
  denials on mounting the scratch, for whatever reason.  Also adds test for it.

- Remove a no-op from rego

  Checked with @<Matthew Johnson (AR)> earlier that this basically does nothing
  and is just something left over.  However I will not actually add a remove op
  for `metadata.started` for now.

This PR is targeting the conf-aci branch on ADO because the commit being fixed
is not on main yet.  This should be backported to main together with the fixes
from last month.

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
[cherry-picked from 421b12249544a334e36df33dc4846673b2a88279]

This set of changes fixes the [Metadata Desync with UVM
State](https://msazure.visualstudio.com/One/_workitems/edit/33232631/) bug, by
reverting the Rego policy state on mount and some types of unmount failures.

For mounts, a minor cleanup code is added to ensure we close down the dm-crypt
device if we fails to mount it.  Aside from this, it is relatively
straightforward - if we get a failure, the clean up functions will remove the
directory, remove any dm-devices, and we will revert the Rego metadata.

For unmounts, careful consideration needs to be taken, since if the directory
has been unmounted successfully (or even partially successful?), and we get an
error, we cannot just revert the policy state, as this may allow the host to use
a broken / empty mount as one of the image layer. See 615c9a0bdf's commit
message for more detailed thoughts.

The solution I opted for is, for non-trivial unmount failure cases (i.e. not
policy denial, not invalid mountpoint), if it fails, then we will block all
further mount, unmount, container creation and deletion attempts.  I think this
is OK since we really do not expect unmounts to fail - this is especially the
case for us since the only writable disk mount we have is the shared scratch
disk, which we do not unmount at all unless we're about to kill the UVM.

Testing
-------

The "Rollback policy state on mount errors" commit message has some instruction
for making a deliberately corrupted (but still passes the verifyinfo extraction)
VHD that will cause a mount error.  The other way we could make mount / unmount
fail, and thus test this change, is by mounting a tmpfs or ro bind in relevant
places:

To make unmount fail:

    mkdir /run/gcs/c/.../rootfs/a && mount -t tmpfs none /run/gcs/c/.../rootfs/a

or

    mkdir /run/gcs/mounts/scsi/m1/a && mount -t tmpfs none  /run/gcs/mounts/scsi/m1/a

To make mount fail:

    mount -o ro --bind /run/mounts/scsi /run/mounts/scsi

or

    mount --bind -o ro /run/gcs/c /run/gcs/c

Once failure is triggered, one can make them work again by `umount`ing the tmpfs
or ro bind.

What about other operations?
----------------------------

This PR covers mount and unmount of disks, overlays and 9p.  Aside from setting
`metadata.matches` as part of the narrowing scheme, and except for
`metadata.started` to prevent re-using a container ID, Rego does not use
persistent state for any other operations.  Since it's not clear whether
reverting the state would be semantically correct (we would need to carefully
consider exactly what are the side effects of say, failing to execute a process,
start a container, or send a signal, etc), and adding the revert to those
operations does not currently affect much behaviour, I've opted not to apply the
metadata revert to those for now.

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
…fidential containers

[cherry-picked from f81b450894206a79fff4d63182ff034ba503ebdb]

This PR contains 2 commits. The first one is the fix:

**bridge: Force sequential message handling for confidential containers**

This fixes a vulnerability (and reduces the surface for other similar potential
vulnerabilities) in confidential containers where if the host sends a
mount/unmount modification request concurrently with an ongoing CreateContainer
request, the host could re-order or skip image layers for the container to be
started.

While this could be fixed by adding mutex lock/unlock around the individual
modifyMappedVirtualDisk/modifyCombinedLayers/CreateContainer functions, we
decided that in order to prevent any more of this class of issues, the UVM, when
running in confidential mode, should just not allow concurrent requests (with
exception for any actually long-running requests, which for now is just
waitProcess).

The second one adds a log entry for when the processing thread blocks.  This
will make it easier to debug should the gcs "hung" on a request.

This PR is created on ADO targeting the conf branch as this security
vulnerability is not public yet. This fix should be backported to main once
deployed.

Related work items: #33357501, #34327300
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
[cherry-picked from 055ee5eb4a802cb407575fb6cc1e9b07069d3319]

guest/network: Restrict hostname to valid characters

Because we write this hostname to /etc/hosts, without proper validation the host
can trick us into writing arbitrary data to /etc/hosts, which can, for example,
redirect things like ip6-localhost (but likely not localhost itself) to an
attacker-controlled IP address.

We implement a check here that the host-provided DNS name in the OCI spec is
valid.

ACI actually restricts this to 5-63 characters of a-zA-Z0-9 and '-', where the
first and last characters can not be '-'.  This aligns with the Kubernetes
restriction.  c.f. IsValidDnsLabel in Compute-ACI.  However, there is no
consistent official agreement on what a valid hostname can contain.  RFC 952
says that "Domain name" can be up to 24 characters of A-Z0-9 '.' and '-', RFC
1123 expands this to 255 characters, but RFC 1035 claims that domain names can
in fact contain anything if quoted (as long as the length is within 255
characters), and this is confirmed again in RFC 2181.  In practice we see names
with underscopes, most commonly \_dmarc.example.com.  curl allows 0-9a-zA-Z and
-.\_|~ and any other codepoints from \u0001-\u001f and above \u007f:
https://github.com/curl/curl/blob/master/lib/urlapi.c#L527-L545

With the above in mind, this commit allows up to 255 characters of a-zA-Z0-9 and
'_', '-' and '.'.  This change is applied to all LCOW use cases.

This fix can be tested with the below code to bypass any host-side checks:

	+++ b/internal/hcsoci/hcsdoc_lcow.go
	@@ -52,6 +52,10 @@ func createLCOWSpec(ctx context.Context, coi *createOptionsInternal) (*specs.Spe
			spec.Linux.Seccomp = nil
		}

	+	if spec.Annotations[annotations.KubernetesContainerType] == "sandbox" {
	+		spec.Hostname = "invalid-hostname\n1.1.1.1 ip6-localhost ip6-loopback localhost"
	+	}
	+
		return spec, nil
	}

Output:

	time="2025-10-01T15:13:41Z" level=fatal msg="run pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: failed to create container f2209bb2960d5162fc9937d3362e1e2cf1724c56d1296ba2551ce510cb2bcd43: guest RPC failure: hostname \"invalid-hostname\\n1.1.1.1 ip6-localhost ip6-loopback localhost\" invalid: must match ^[a-zA-Z0-9_\\-\\.]{0,999}$: unknown"

Related work items: #34370598
Closes: https://msazure.visualstudio.com/One/_workitems/edit/34370598
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
…nts, and prevent unmounting or deleting in-use things

[cherry-picked from d0334883cd43eecbb401a6ded3e0317179a3e54b]

This set of changes adds some checks (when running with a confidential policy)
to prevent the host from trying to clean up mounts, overlays, or the container
states dir when the container is running (or when the overlay has not been
unmounted yet).  This is through enhancing the existing `hostMounts` utility, as
well as adding a `terminated` flag to the Container struct.

The correct order of operations should always be:

- mount read-only layers and scratch (in any order, and individual containers
(not the sandbox) might not have their own scratch) - mount the overlay - start
the container - container terminates - unmount overlay - unmount read-only
layers and scratch

The starting up order is implied, and we now explicitly deny e.g. unmounting
layer/scratch before unmounting overlay, or unmounting the overlay while
container has not terminated.

We also deny deleteContainerState requests when the container is running or when
the overlay is mounted.  Doing so when a container is running can result in
unexpectedly deleting its files, which breaks it in unpredictable ways and is
bad.

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
[cherry-picked from 1dd0b7ea0b0f91d3698f6008fb0bd5b0de777da6]

Blocks mount option passing for 9p (which is accidental) and SCSI disks.

- guest: Restrict plan9 share names to digits only on Confidential mode
- hcsv2/uvm: Restrict SCSI mount options in confidential mode
   (The only one we allow is `ro`)

Related work items: #34370380
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
@micromaomao micromaomao force-pushed the tingmao_github/merge-msrc-to-main branch from 9fca8e9 to 0bd76ae Compare November 24, 2025 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant