Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-16: New capability for the PMU #22

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
210 changes: 210 additions & 0 deletions src/proposed/0160-pmu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
<!--
SPDX-License-Identifier: CC-BY-SA-4.0
Copyright 2024 UNSW
-->

# New capability for the PMU

- Author: Krishnan Winter
- Proposed: 2024-02-02

## Summary

This RFC proposes a new kernel object to seL4 to provide secure access for
user-space processes to the Performance Management Unit (PMU) hardware.

## Motivation

Present profiling support uses the PMU through an ad-hoc interface that is
designed for debugging and is consequently only available in a specific
benchmarking configuration of the kernel. The same interface cannot be used in a
production system as it is inherently insecure.

However, PMU access is required by (sufficiently privileged) user-level
components even in production systems. Specific use cases are:

- Thermal management (i.e. preventing the processor from overheating)
- Energy management (controlling clock rate, on/off-lining cores based on
current computational needs)

Such resource management requires utilisation information that is only
accessible through the PMU. Obviously the PMU presents a covert channel that
exposes information about execution of user-level components (as well as the
kernel). Therefore, PMU access needs to be explicitly authorised, which means we
need an access-control model for the PMU.

Once such an access-control model is in place, the developer-focussed profiling
support should be adapted to using this model, rather than relying on a specific
build of the kernel.

## Guide-level explanation

We propose the addition of a PMU object, `seL4_PMU`, and a new object invocation
API call, `seL4_PMU_Set()`.

### New Concepts

#### seL4_PMU

This new object will be responsible for managing the PMU itself. Accesses to the
PMU will be marshalled through invocations on this object. This will provide
fine-grained access control over the PMU hardware and functionality.

Capabilities to this object are badged, and the badge will represent the
specific PMU counters authorised. A cap will need to be handed to each process
that wishes to access the PMU.

#### seL4_PMU_Set()

seL4_PMU_Set() is the invocation on the PMU object. The exact name of this API
call is undecided.

There are different possible models for interacting with the PMU. For example,
there could be an asynchronous model where a PMU operation is requested, and the
PMU sends a notification when the operation is completed, allowing the PMU user
to request the next operation(s). This requires two system calls for obtaining
each PMU event.

We instead propose a synchronous model, which uses a single, blocking system
call for requesting operations and obtaining the result.

Specifically the invoker provides information on the events it wants to monitor
on which counter(s), which starts the PMU operation. When the PMU generates an
interrupt from a counter overflow, the kernel returns from this blocking call to
the application, and returns a reference to the overflowing counter. The
application can then repeat the call, indicating which (if any) counters to
reset or leave unchanged.

Potentially, the user will be able to set up a shared memory region with the
kernel, where the kernel can place all the data it has collected, such as the
counter values, IP and call-stack trace.

This functionality can be used to implement statistical profilers in user-space
which records these events, comparable to the functionality perf provides for
Linux systems.

## Reference-level explanation

The PMU object will abstract over the PMU hardware itself, allowing us to set up
PMU counters to count on a certain event, and set overflows to occur after a
certain amount of events have occurred, and additionally, starting and halting
the PMU. This will be done through an invocation on the PMU object, with the
relevant arguments and return variables.

PMUs are implemented differently across architectures. The register maps and
control mechanisms differ, therefore we will need a kernel implementation per
architecture. We will also have to ensure that for each micro-architecture, the
event the user has requested is actually implemented on the SoC.

The following is a brief description on the state of the PMU hardware for each
architecture that seL4 supports:

1. On ARM these basic mechanisms exist, however the number of counters
available, and the selection of hardware/software events differs between
implementations.  Additionally, some implementations have more powerful
features, such as snapshot registers, which may be useful to leverage.

2. On RISC-V there does not seem to be a single agreed-upon design of the PMU at
this time. The current privileged specification [1] describes very limited
PMU support. The spec offers a number of counters and events, however, does
not support generating interrupts once an overflow has occurred. The ratified
“Sscofpmf” extension [2] provides support for these overflow interrupts, but
is not required to be implemented. Currently on Linux, perf record checks to
see if this extension is implemented before enabling interrupts on RISC-V
[3]. At this time, it seems too early to say what PMU hardware will generally
be supported by RISC-V implementations.

3. x86 PMU implementations different between manufacturers. For instance, Intel
have “Performance Event Based Sampling” (PEBS) and AMD have their
“Instruction Based Sampling” (IBS) tool. Additionally, there are subtle
differences in the commonly supported features, such as different mappings of
registers, and different naming conventions. We can determine which system is
in use by checking the CPU ID vendor string, and leverage features such as
PEBS’ precise IP tracking.

## Drawbacks

One potentially major drawback in making a generic PMU interface, is that we may
not support certain features that are available on different
architectures/micro-architectures. For instance, supporting all the unique
features available on AMD and Intel x86 processors will lead to a fairly complex
implementation. Even for ARM SoCs, each board can have a number of additional
PMU features implemented, which will similarly lead to an increase in complexity
if we try to account for all of these different configurations.

## Rationale and alternatives

Other alternatives have been proposed and tested. One such was to use a process
similar to VPPI events, where a PMU IRQ is first handled in the kernel, and then
sent to a user-space fault handler (the profiler). However, this idea is
certainly flawed, as this means that the fault handler of every process in the
system has to be the profiler, and issues arise when we generate an interrupt
when the idle thread is running. A flag could be added to the TCB, and with an
additional syscall, the user can register a TCB to be profiled, and we can
discard any samples that were taken whilst an “un-profiled” thread was running.
However, this is not an optimal workaround.

Another proposition was to add a stage in the interrupt handling of just PMU
events, and use the existing benchmark log buffer to pass sample data. The
interrupt was ‘intercepted’ in the interrupt handling logic, and we saved the IP
and call-stack here, then handed the interrupt to its handler, which is our
profiler. And the same method of adding a flag to the TCB as above. However,
issues arose surrounding setup of the log buffer.

These are both rather hacky approaches, and not solutions that you would want to
have in a production build.

## Prior art

Current approaches for benchmarking and profiling in seL4 do not meet our
requirements. These are particularly focused on profiling the kernel rather than
user-space applications.

The current infrastructure is focused on tracking utilisation and kernel
entries, and also providing tracepoints within the kernel. This is not
particularly useful for our application. However, some features can be
informative, such as the kernel log buffer. We do not plan on replacing or
modifying any of the existing benchmarking infrastructure.

There is also a kernel profiling system present, which records the number of
samples for each IP. This is not applicable for our application of allowing PMU
access to user-space applications.

Additionally, on ARM systems, the only way to get access to the PMU from
user-space is to configure the kernel to export access to the PMU registers,
making the PMU an uncontrolled resource.

All these implementations rely on a specific benchmarking configuration of the
kernel to be built, meaning that they are not desirable for production systems.

## Unresolved questions

1. There are several unresolved questions related to supporting different
architectures. As RISC-V’s PMU support is not mature, we are not sure how it
will develop. Whilst there is an extension that provides overflow interrupt
support, it is not currently mandatory for implementations.

2. For x86, determining how to support advanced features such as PEBS/IBS is
also undecided.

3. How we will track which events are implemented on different SoC’s is
undecided. In Linux, they store these events in JSON files, and have a tool
to generate C files from these.

4. What the API will actually look like is still under debate and not finalised.

5. How we will pass sample data back to userspace is not fully decided.

6. How will the PMU object affect verification? Initially it will not be
available in verification builds of seL4, but it is not clear whether this
functionality will be able to be verified at a future stage.

7. How will we multiplex PMU counters?

## References

[1] <https://drive.google.com/file/d/1RiAIOVoN1E7bv6_kEzcgATkhbeUdqu5t/view>

[2] <https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/v2.0/riscv-sbi.pdf>

[3] <https://github.com/torvalds/linux/blob/861c0981648f5b64c86fd028ee622096eb7af05a/drivers/perf/riscv_pmu_sbi.c#L810>