From 2037b994e9bc22b99f49a3583a45ddb9f12960ea Mon Sep 17 00:00:00 2001 From: Gerwin Klein Date: Fri, 14 Jun 2024 19:37:33 +1000 Subject: [PATCH] import RFC-16 See https://sel4.atlassian.net/browse/RFC-16 Signed-off-by: Gerwin Klein --- src/proposed/0160-pmu.md | 210 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 210 insertions(+) create mode 100644 src/proposed/0160-pmu.md diff --git a/src/proposed/0160-pmu.md b/src/proposed/0160-pmu.md new file mode 100644 index 0000000..2f9f0c3 --- /dev/null +++ b/src/proposed/0160-pmu.md @@ -0,0 +1,210 @@ + + +# New capability for the PMU + +- Author: Krishnan Winter +- Proposed: 2024-02-02 + +## Summary + +This RFC proposes a new kernel object to seL4 to provide secure access for +user-space processes to the Performance Management Unit (PMU) hardware. + +## Motivation + +Present profiling support uses the PMU through an ad-hoc interface that is +designed for debugging and is consequently only available in a specific +benchmarking configuration of the kernel. The same interface cannot be used in a +production system as it is inherently insecure. + +However, PMU access is required by (sufficiently privileged) user-level +components even in production systems. Specific use cases are: + +- Thermal management (i.e. preventing the processor from overheating) +- Energy management (controlling clock rate, on/off-lining cores based on + current computational needs) + +Such resource management requires utilisation information that is only +accessible through the PMU. Obviously the PMU presents a covert channel that +exposes information about execution of user-level components (as well as the +kernel). Therefore, PMU access needs to be explicitly authorised, which means we +need an access-control model for the PMU. + +Once such an access-control model is in place, the developer-focussed profiling +support should be adapted to using this model, rather than relying on a specific +build of the kernel. + +## Guide-level explanation + +We propose the addition of a PMU object, `seL4_PMU`, and a new object invocation +API call, `seL4_PMU_Set()`. + +### New Concepts + +#### seL4_PMU + +This new object will be responsible for managing the PMU itself. Accesses to the +PMU will be marshalled through invocations on this object. This will provide +fine-grained access control over the PMU hardware and functionality. + +Capabilities to this object are badged, and the badge will represent the +specific PMU counters authorised. A cap will need to be handed to each process +that wishes to access the PMU. + +#### seL4_PMU_Set() + +seL4_PMU_Set() is the invocation on the PMU object. The exact name of this API +call is undecided. + +There are different possible models for interacting with the PMU. For example, +there could be an asynchronous model where a PMU operation is requested, and the +PMU sends a notification when the operation is completed, allowing the PMU user +to request the next operation(s). This requires two system calls for obtaining +each PMU event. + +We instead propose a synchronous model, which uses a single, blocking system +call for requesting operations and obtaining the result. + +Specifically the invoker provides information on the events it wants to monitor +on which counter(s), which starts the PMU operation. When the PMU generates an +interrupt from a counter overflow, the kernel returns from this blocking call to +the application, and returns a reference to the overflowing counter. The +application can then repeat the call, indicating which (if any) counters to +reset or leave unchanged. + +Potentially, the user will be able to set up a shared memory region with the +kernel, where the kernel can place all the data it has collected, such as the +counter values, IP and call-stack trace. + +This functionality can be used to implement statistical profilers in user-space +which records these events, comparable to the functionality perf provides for +Linux systems. + +## Reference-level explanation + +The PMU object will abstract over the PMU hardware itself, allowing us to set up +PMU counters to count on a certain event, and set overflows to occur after a +certain amount of events have occurred, and additionally, starting and halting +the PMU. This will be done through an invocation on the PMU object, with the +relevant arguments and return variables. + +PMUs are implemented differently across architectures. The register maps and +control mechanisms differ, therefore we will need a kernel implementation per +architecture. We will also have to ensure that for each micro-architecture, the +event the user has requested is actually implemented on the SoC. + +The following is a brief description on the state of the PMU hardware for each +architecture that seL4 supports: + +1. On ARM these basic mechanisms exist, however the number of counters + available, and the selection of hardware/software events differs between + implementations.  Additionally, some implementations have more powerful + features, such as snapshot registers, which may be useful to leverage. + +2. On RISC-V there does not seem to be a single agreed-upon design of the PMU at + this time. The current privileged specification [1] describes very limited + PMU support. The spec offers a number of counters and events, however, does + not support generating interrupts once an overflow has occurred. The ratified + “Sscofpmf” extension [2] provides support for these overflow interrupts, but + is not required to be implemented. Currently on Linux, perf record checks to + see if this extension is implemented before enabling interrupts on RISC-V + [3]. At this time, it seems too early to say what PMU hardware will generally + be supported by RISC-V implementations. + +3. x86 PMU implementations different between manufacturers. For instance, Intel + have “Performance Event Based Sampling” (PEBS) and AMD have their + “Instruction Based Sampling” (IBS) tool. Additionally, there are subtle + differences in the commonly supported features, such as different mappings of + registers, and different naming conventions. We can determine which system is + in use by checking the CPU ID vendor string, and leverage features such as + PEBS’ precise IP tracking. + +## Drawbacks + +One potentially major drawback in making a generic PMU interface, is that we may +not support certain features that are available on different +architectures/micro-architectures. For instance, supporting all the unique +features available on AMD and Intel x86 processors will lead to a fairly complex +implementation. Even for ARM SoCs, each board can have a number of additional +PMU features implemented, which will similarly lead to an increase in complexity +if we try to account for all of these different configurations. + +## Rationale and alternatives + +Other alternatives have been proposed and tested. One such was to use a process +similar to VPPI events, where a PMU IRQ is first handled in the kernel, and then +sent to a user-space fault handler (the profiler). However, this idea is +certainly flawed, as this means that the fault handler of every process in the +system has to be the profiler, and issues arise when we generate an interrupt +when the idle thread is running. A flag could be added to the TCB, and with an +additional syscall, the user can register a TCB to be profiled, and we can +discard any samples that were taken whilst an “un-profiled” thread was running. +However, this is not an optimal workaround. + +Another proposition was to add a stage in the interrupt handling of just PMU +events, and use the existing benchmark log buffer to pass sample data. The +interrupt was ‘intercepted’ in the interrupt handling logic, and we saved the IP +and call-stack here, then handed the interrupt to its handler, which is our +profiler. And the same method of adding a flag to the TCB as above. However, +issues arose surrounding setup of the log buffer. + +These are both rather hacky approaches, and not solutions that you would want to +have in a production build. + +## Prior art + +Current approaches for benchmarking and profiling in seL4 do not meet our +requirements. These are particularly focused on profiling the kernel rather than +user-space applications. + +The current infrastructure is focused on tracking utilisation and kernel +entries, and also providing tracepoints within the kernel. This is not +particularly useful for our application. However, some features can be +informative, such as the kernel log buffer. We do not plan on replacing or +modifying any of the existing benchmarking infrastructure. + +There is also a kernel profiling system present, which records the number of +samples for each IP. This is not applicable for our application of allowing PMU +access to user-space applications. + +Additionally, on ARM systems, the only way to get access to the PMU from +user-space is to configure the kernel to export access to the PMU registers, +making the PMU an uncontrolled resource. + +All these implementations rely on a specific benchmarking configuration of the +kernel to be built, meaning that they are not desirable for production systems. + +## Unresolved questions + +1. There are several unresolved questions related to supporting different + architectures. As RISC-V’s PMU support is not mature, we are not sure how it + will develop. Whilst there is an extension that provides overflow interrupt + support, it is not currently mandatory for implementations. + +2. For x86, determining how to support advanced features such as PEBS/IBS is + also undecided. + +3. How we will track which events are implemented on different SoC’s is + undecided. In Linux, they store these events in JSON files, and have a tool + to generate C files from these. + +4. What the API will actually look like is still under debate and not finalised. + +5. How we will pass sample data back to userspace is not fully decided. + +6. How will the PMU object affect verification? Initially it will not be + available in verification builds of seL4, but it is not clear whether this + functionality will be able to be verified at a future stage. + +7. How will we multiplex PMU counters? + +## References + +[1] + +[2] + +[3]