New capability for the PMU

Author: Krishnan Winter
Proposed: 2024-02-02

Summary

This RFC proposes a new kernel object to seL4 to provide secure access for user-space processes to the Performance Management Unit (PMU) hardware.

Motivation

Present profiling support uses the PMU through an ad-hoc interface that is designed for debugging and is consequently only available in a specific benchmarking configuration of the kernel. The same interface cannot be used in a production system as it is inherently insecure.

However, PMU access is required by (sufficiently privileged) user-level components even in production systems. Specific use cases are:

Thermal management (i.e. preventing the processor from overheating)
Energy management (controlling clock rate, on/off-lining cores based on current computational needs)

Such resource management requires utilisation information that is only accessible through the PMU. Obviously the PMU presents a covert channel that exposes information about execution of user-level components (as well as the kernel). Therefore, PMU access needs to be explicitly authorised, which means we need an access-control model for the PMU.

Once such an access-control model is in place, the developer-focussed profiling support should be adapted to using this model, rather than relying on a specific build of the kernel.

Guide-level explanation

We propose the addition of a PMU object, seL4_PMU, and a new object invocation API call, seL4_PMU_Set().

New Concepts

seL4_PMU

This new object will be responsible for managing the PMU itself. Accesses to the PMU will be marshalled through invocations on this object. This will provide fine-grained access control over the PMU hardware and functionality.

Capabilities to this object are badged, and the badge will represent the specific PMU counters authorised. A cap will need to be handed to each process that wishes to access the PMU.

seL4_PMU_Set()

seL4_PMU_Set() is the invocation on the PMU object. The exact name of this API call is undecided.

There are different possible models for interacting with the PMU. For example, there could be an asynchronous model where a PMU operation is requested, and the PMU sends a notification when the operation is completed, allowing the PMU user to request the next operation(s). This requires two system calls for obtaining each PMU event.

We instead propose a synchronous model, which uses a single, blocking system call for requesting operations and obtaining the result.

Specifically the invoker provides information on the events it wants to monitor on which counter(s), which starts the PMU operation. When the PMU generates an interrupt from a counter overflow, the kernel returns from this blocking call to the application, and returns a reference to the overflowing counter. The application can then repeat the call, indicating which (if any) counters to reset or leave unchanged.

Potentially, the user will be able to set up a shared memory region with the kernel, where the kernel can place all the data it has collected, such as the counter values, IP and call-stack trace.

This functionality can be used to implement statistical profilers in user-space which records these events, comparable to the functionality perf provides for Linux systems.

Reference-level explanation

The PMU object will abstract over the PMU hardware itself, allowing us to set up PMU counters to count on a certain event, and set overflows to occur after a certain amount of events have occurred, and additionally, starting and halting the PMU. This will be done through an invocation on the PMU object, with the relevant arguments and return variables.

PMUs are implemented differently across architectures. The register maps and control mechanisms differ, therefore we will need a kernel implementation per architecture. We will also have to ensure that for each micro-architecture, the event the user has requested is actually implemented on the SoC.

The following is a brief description on the state of the PMU hardware for each architecture that seL4 supports:

On ARM these basic mechanisms exist, however the number of counters available, and the selection of hardware/software events differs between implementations. Additionally, some implementations have more powerful features, such as snapshot registers, which may be useful to leverage.
On RISC-V there does not seem to be a single agreed-upon design of the PMU at this time. The current privileged specification [1] describes very limited PMU support. The spec offers a number of counters and events, however, does not support generating interrupts once an overflow has occurred. The ratified “Sscofpmf” extension [2] provides support for these overflow interrupts, but is not required to be implemented. Currently on Linux, perf record checks to see if this extension is implemented before enabling interrupts on RISC-V [3]. At this time, it seems too early to say what PMU hardware will generally be supported by RISC-V implementations.
x86 PMU implementations different between manufacturers. For instance, Intel have “Performance Event Based Sampling” (PEBS) and AMD have their “Instruction Based Sampling” (IBS) tool. Additionally, there are subtle differences in the commonly supported features, such as different mappings of registers, and different naming conventions. We can determine which system is in use by checking the CPU ID vendor string, and leverage features such as PEBS’ precise IP tracking.

Drawbacks

One potentially major drawback in making a generic PMU interface, is that we may not support certain features that are available on different architectures/micro-architectures. For instance, supporting all the unique features available on AMD and Intel x86 processors will lead to a fairly complex implementation. Even for ARM SoCs, each board can have a number of additional PMU features implemented, which will similarly lead to an increase in complexity if we try to account for all of these different configurations.

Rationale and alternatives

Other alternatives have been proposed and tested. One such was to use a process similar to VPPI events, where a PMU IRQ is first handled in the kernel, and then sent to a user-space fault handler (the profiler). However, this idea is certainly flawed, as this means that the fault handler of every process in the system has to be the profiler, and issues arise when we generate an interrupt when the idle thread is running. A flag could be added to the TCB, and with an additional syscall, the user can register a TCB to be profiled, and we can discard any samples that were taken whilst an “un-profiled” thread was running. However, this is not an optimal workaround.

Another proposition was to add a stage in the interrupt handling of just PMU events, and use the existing benchmark log buffer to pass sample data. The interrupt was ‘intercepted’ in the interrupt handling logic, and we saved the IP and call-stack here, then handed the interrupt to its handler, which is our profiler. And the same method of adding a flag to the TCB as above. However, issues arose surrounding setup of the log buffer.

These are both rather hacky approaches, and not solutions that you would want to have in a production build.

Prior art

Current approaches for benchmarking and profiling in seL4 do not meet our requirements. These are particularly focused on profiling the kernel rather than user-space applications.

The current infrastructure is focused on tracking utilisation and kernel entries, and also providing tracepoints within the kernel. This is not particularly useful for our application. However, some features can be informative, such as the kernel log buffer. We do not plan on replacing or modifying any of the existing benchmarking infrastructure.

There is also a kernel profiling system present, which records the number of samples for each IP. This is not applicable for our application of allowing PMU access to user-space applications.

Additionally, on ARM systems, the only way to get access to the PMU from user-space is to configure the kernel to export access to the PMU registers, making the PMU an uncontrolled resource.

All these implementations rely on a specific benchmarking configuration of the kernel to be built, meaning that they are not desirable for production systems.

Unresolved questions

There are several unresolved questions related to supporting different architectures. As RISC-V’s PMU support is not mature, we are not sure how it will develop. Whilst there is an extension that provides overflow interrupt support, it is not currently mandatory for implementations.
For x86, determining how to support advanced features such as PEBS/IBS is also undecided.
How we will track which events are implemented on different SoC’s is undecided. In Linux, they store these events in JSON files, and have a tool to generate C files from these.
What the API will actually look like is still under debate and not finalised.
How we will pass sample data back to userspace is not fully decided.
How will the PMU object affect verification? Initially it will not be available in verification builds of seL4, but it is not clear whether this functionality will be able to be verified at a future stage.
How will we multiplex PMU counters?

References

[1] https://drive.google.com/file/d/1RiAIOVoN1E7bv6_kEzcgATkhbeUdqu5t/view

[2] https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/v2.0/riscv-sbi.pdf

[3] https://github.com/torvalds/linux/blob/861c0981648f5b64c86fd028ee622096eb7af05a/drivers/perf/riscv_pmu_sbi.c#L810

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0160-pmu.md

0160-pmu.md

New capability for the PMU

Summary

Motivation

Guide-level explanation

New Concepts

seL4_PMU

seL4_PMU_Set()

Reference-level explanation

Drawbacks

Rationale and alternatives

Prior art

Unresolved questions

References

Files

0160-pmu.md

Latest commit

History

0160-pmu.md

File metadata and controls

New capability for the PMU

Summary

Motivation

Guide-level explanation

New Concepts

seL4_PMU

seL4_PMU_Set()

Reference-level explanation

Drawbacks

Rationale and alternatives

Prior art

Unresolved questions

References