Skip to content

Commit

Permalink
Improve pgalloc huge page awareness.
Browse files Browse the repository at this point in the history
This CL addresses the following major issues:

- Both MM and the application page allocator (pgalloc) are agnostic as to
  whether the underlying memory file will be THP-backed. Instead, both attempt
  to align hugepage-sized and larger allocations to hugepage boundaries, such
  that if the memory file happens to support THP then such allocations will be
  appropriately aligned to use THP. This is suboptimal since many allocations
  do not benefit from THP, resulting in memory underutilization.

- When an application releases memory to the sentry, the sentry unconditionally
  releases that memory to the host, rather than allowing it to be reused for
  future allocations, in order to ensure that new allocations are uniformly
  decommitted (use no memory): cl/145016083. In most cases, this should have
  relatively little performance impact; since releasing memory from the
  application to the OS is expensive even outside of gVisor, application memory
  allocators optimizing for performance already limit the rate at which they
  release memory to the OS. However, in applications that involve frequent
  process creation and exit (e.g. build systems), this practice prevents reuse
  of memory deallocated by exiting processes for memory allocated by new
  processes, resulting in both performance degradation and a spike in memory
  usage (since the sentry may not have released all deallocated memory to the
  host by the time new allocations occur).

These issues are especially relevant to platforms based on hardware
virtualization, where acquiring memory from the host is significantly more
expensive due to EPT/NPT fault overhead; when effective, THP reduces the
frequency with which said cost is incurred by a factor of 512, and page reuse
avoids incurring it at all.

Thus:

- Instead of inferring whether THP use is desired from allocation size,
  indicate this explicitly as AllocOpts.Huge, and only set it to true for
  allocations for non-stack private anonymous mappings.

- Add AllocateCallerCommit, a new possible value for AllocOpts.Mode that
  indicates that the caller will commit all pages in the allocation. In such
  cases, pgalloc can reuse deallocated pages without risking increased memory
  usage. AllocateCallerCommit is used primarily for page faults on a THP-backed
  region.

- Allow different chunks in pgalloc.MemoryFile's backing file to have varying
  THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE.

- Split pgalloc.MemoryFile's existing page metadata set into two sets tracking
  deallocated pages for small/huge-page-backed regions respectively; two sets
  tracking in-use pages for small/huge-page-backed regions respectively; and a
  fifth set tracking memory accounting state.

- Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for
  pgalloc tests, but may also be applicable to disk-backed MemoryFiles.

Cleanup:

- Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled,
  described in updateUsageLocked(), was based on the condition that
  MemoryFile.mu would be locked throughout the call to updateUsageLocked(),
  which was invalidated by cl/337865250.

- Remove MemoryFileOpts.ManualZeroing, which is unused.

Using THP for application memory requires setting
/sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to
allow runsc to request THP from the kernel.

After this CL, pgalloc.MemoryFile still releases memory to the host as fast as
possible, limiting the effectiveness of page recycling. A following CL adds
optional memory release throttling to improve this.

PiperOrigin-RevId: 481741148
  • Loading branch information
nixprime authored and gvisor-bot committed Oct 20, 2023
1 parent 57606c7 commit 16ffda3
Show file tree
Hide file tree
Showing 81 changed files with 3,561 additions and 2,417 deletions.
2 changes: 2 additions & 0 deletions nogo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -190,8 +190,10 @@ analyzers:
- pkg/gohacks/noescape_unsafe.go # Special case.
- pkg/ring0/pagetables/allocator_unsafe.go # Special case.
- pkg/sentry/fsutil/host_file_mapper_unsafe.go # Special case.
- pkg/sentry/pgalloc/pgalloc_unsafe.go # Special case.
- pkg/sentry/platform/kvm/bluepill_unsafe.go # Special case.
- pkg/sentry/platform/kvm/machine_unsafe.go # Special case.
- pkg/sentry/platform/pgalloc/pgalloc_unsafe.go # Special case.
- pkg/sentry/platform/systrap/stub_unsafe.go # Special case.
- pkg/sentry/platform/systrap/syscall_thread_unsafe.go # Special case.
- pkg/sentry/platform/systrap/sysmsg_thread_unsafe.go # Special case.
Expand Down
17 changes: 6 additions & 11 deletions pkg/context/context.go
Original file line number Diff line number Diff line change
Expand Up @@ -63,17 +63,12 @@ type Blocker interface {
BlockWithTimeoutOn(waiter.Waitable, waiter.EventMask, time.Duration) (time.Duration, bool)

// UninterruptibleSleepStart indicates the beginning of an uninterruptible
// sleep state (equivalent to Linux's TASK_UNINTERRUPTIBLE). If deactivate
// is true and the Context represents a Task, the Task's AddressSpace is
// deactivated.
UninterruptibleSleepStart(deactivate bool)
// sleep state (equivalent to Linux's TASK_UNINTERRUPTIBLE).
UninterruptibleSleepStart()

// UninterruptibleSleepFinish indicates the end of an uninterruptible sleep
// state that was begun by a previous call to UninterruptibleSleepStart. If
// activate is true and the Context represents a Task, the Task's
// AddressSpace is activated. Normally activate is the same value as the
// deactivate parameter passed to UninterruptibleSleepStart.
UninterruptibleSleepFinish(activate bool)
// state that was begun by a previous call to UninterruptibleSleepStart.
UninterruptibleSleepFinish()
}

// NoTask is an implementation of Blocker that does not block.
Expand Down Expand Up @@ -147,10 +142,10 @@ func (nt *NoTask) BlockWithTimeoutOn(w waiter.Waitable, mask waiter.EventMask, d
}

// UninterruptibleSleepStart implmenents Blocker.UninterruptedSleepStart.
func (*NoTask) UninterruptibleSleepStart(bool) {}
func (*NoTask) UninterruptibleSleepStart() {}

// UninterruptibleSleepFinish implmenents Blocker.UninterruptibleSleepFinish.
func (*NoTask) UninterruptibleSleepFinish(bool) {}
func (*NoTask) UninterruptibleSleepFinish() {}

// Context represents a thread of execution (hereafter "goroutine" to reflect
// Go idiosyncrasy). It carries state associated with the goroutine across API
Expand Down
1 change: 1 addition & 0 deletions pkg/hostarch/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ go_library(
"addr.go",
"addr_range.go",
"addr_range_seq_unsafe.go",
"addr_unsafe.go",
"hostarch.go",
"hostarch_arm64.go",
"hostarch_x86.go",
Expand Down
35 changes: 32 additions & 3 deletions pkg/hostarch/addr.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,10 @@ type Addr uintptr
// expected to ever come up in practice.
func (v Addr) AddLength(length uint64) (end Addr, ok bool) {
end = v + Addr(length)
// The second half of the following check is needed in case uintptr is
// smaller than 64 bits.
ok = end >= v && length <= uint64(^Addr(0))
// As of this writing (Go 1.19), addrAtLeast64b is required to prevent the
// compiler from generating a tautological `length <= MaxUint64` check on
// 64-bit architectures.
ok = end >= v && (addrAtLeast64b || length <= uint64(^Addr(0)))
return
}

Expand Down Expand Up @@ -75,6 +76,16 @@ func (v Addr) IsPageAligned() bool {
return IsPageAligned(v)
}

// HugePageOffset returns the offset of v into the current huge page.
func (v Addr) HugePageOffset() uint64 {
return uint64(HugePageOffset(v))
}

// IsHugePageAligned returns true if v.HugePageOffset() == 0.
func (v Addr) IsHugePageAligned() bool {
return IsHugePageAligned(v)
}

// AddrRange is a range of Addrs.
//
// type AddrRange <generated by go_generics>
Expand All @@ -85,12 +96,30 @@ func (v Addr) ToRange(length uint64) (AddrRange, bool) {
return AddrRange{v, end}, ok
}

// MustToRange is equivalent to ToRange, but panics if the end of the range
// wraps around.
//
//go:nosplit
func (v Addr) MustToRange(length uint64) AddrRange {
ar, ok := v.ToRange(length)
if !ok {
panic("hostarch.Addr.ToRange() wraps")
}
return ar
}

// IsPageAligned returns true if ar.Start.IsPageAligned() and
// ar.End.IsPageAligned().
func (ar AddrRange) IsPageAligned() bool {
return ar.Start.IsPageAligned() && ar.End.IsPageAligned()
}

// IsHugePageAligned returns true if ar.Start.IsHugePageAligned() and
// ar.End.IsHugePageAligned().
func (ar AddrRange) IsHugePageAligned() bool {
return ar.Start.IsHugePageAligned() && ar.End.IsHugePageAligned()
}

// String implements fmt.Stringer.String.
func (ar AddrRange) String() string {
return fmt.Sprintf("[%#x, %#x)", ar.Start, ar.End)
Expand Down
22 changes: 22 additions & 0 deletions pkg/hostarch/addr_unsafe.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
// Copyright 2022 The gVisor Authors.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package hostarch

import (
"unsafe"
)

// This is used in addr.go:Addr.AddLength().
const addrAtLeast64b = unsafe.Sizeof(Addr(0)) >= 8
8 changes: 4 additions & 4 deletions pkg/lisafs/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -290,9 +290,9 @@ func (c *Client) CloseFD(ctx context.Context, fd FDID, flush bool) {

req := CloseReq{FDs: toClose}
var resp CloseResp
ctx.UninterruptibleSleepStart(false)
ctx.UninterruptibleSleepStart()
err := c.SndRcvMessage(Close, uint32(req.SizeBytes()), req.MarshalBytes, resp.CheckedUnmarshal, nil, req.String, resp.String)
ctx.UninterruptibleSleepFinish(false)
ctx.UninterruptibleSleepFinish()
if err != nil {
log.Warningf("lisafs: batch closing FDs returned error: %v", err)
}
Expand All @@ -305,9 +305,9 @@ func (c *Client) SyncFDs(ctx context.Context, fds []FDID) error {
}
req := FsyncReq{FDs: fds}
var resp FsyncResp
ctx.UninterruptibleSleepStart(false)
ctx.UninterruptibleSleepStart()
err := c.SndRcvMessage(FSync, uint32(req.SizeBytes()), req.MarshalBytes, resp.CheckedUnmarshal, nil, req.String, resp.String)
ctx.UninterruptibleSleepFinish(false)
ctx.UninterruptibleSleepFinish()
return err
}

Expand Down
Loading

0 comments on commit 16ffda3

Please sign in to comment.