seL4 · lsf37 · Jun 14, 2024 · Aug 4, 2024
diff --git a/src/proposed/0160-pmu.md b/src/proposed/0160-pmu.md
@@ -0,0 +1,210 @@
+<!--
+  SPDX-License-Identifier: CC-BY-SA-4.0
+  Copyright 2024 UNSW
+-->
+
+# New capability for the PMU
+
+- Author: Krishnan Winter
+- Proposed: 2024-02-02
+
+## Summary
+
+This RFC proposes a new kernel object to seL4 to provide secure access for
+user-space processes to the Performance Management Unit (PMU) hardware.
+
+## Motivation
+
+Present profiling support uses the PMU through an ad-hoc interface that is
+designed for debugging and is consequently only available in a specific
+benchmarking configuration of the kernel. The same interface cannot be used in a
+production system as it is inherently insecure.
+
+However, PMU access is required by (sufficiently privileged) user-level
+components even in production systems. Specific use cases are:
+
+- Thermal management (i.e. preventing the processor from overheating)
+- Energy management (controlling clock rate, on/off-lining cores based on
+  current computational needs)
+
+Such resource management requires utilisation information that is only
+accessible through the PMU. Obviously the PMU presents a covert channel that
+exposes information about execution of user-level components (as well as the
+kernel). Therefore, PMU access needs to be explicitly authorised, which means we
+need an access-control model for the PMU.
+
+Once such an access-control model is in place, the developer-focussed profiling
+support should be adapted to using this model, rather than relying on a specific
+build of the kernel.
+
+## Guide-level explanation
+
+We propose the addition of a PMU object, `seL4_PMU`, and a new object invocation
+API call, `seL4_PMU_Set()`.
+
+### New Concepts
+
+#### seL4_PMU
+
+This new object will be responsible for managing the PMU itself. Accesses to the
+PMU will be marshalled through invocations on this object. This will provide
+fine-grained access control over the PMU hardware and functionality.
+
+Capabilities to this object are badged, and the badge will represent the
+specific PMU counters authorised. A cap will need to be handed to each process
+that wishes to access the PMU.
+
+#### seL4_PMU_Set()
+
+seL4_PMU_Set() is the invocation on the PMU object. The exact name of this API
+call is undecided.
+
+There are different possible models for interacting with the PMU. For example,
+there could be an asynchronous model where a PMU operation is requested, and the
+PMU sends a notification when the operation is completed, allowing the PMU user
+to request the next operation(s). This requires two system calls for obtaining
+each PMU event.
+
+We instead propose a synchronous model, which uses a single, blocking system
+call for requesting operations and obtaining the result.
+
+Specifically the invoker provides information on the events it wants to monitor
+on which counter(s), which starts the PMU operation. When the PMU generates an
+interrupt from a counter overflow, the kernel returns from this blocking call to
+the application, and returns a reference to the overflowing counter. The
+application can then repeat the call, indicating which (if any) counters to
+reset or leave unchanged.
+
+Potentially, the user will be able to set up a shared memory region with the
+kernel, where the kernel can place all the data it has collected, such as the
+counter values, IP and call-stack trace.
+
+This functionality can be used to implement statistical profilers in user-space
+which records these events, comparable to the functionality perf provides for
+Linux systems.
+
+## Reference-level explanation
+
+The PMU object will abstract over the PMU hardware itself, allowing us to set up
+PMU counters to count on a certain event, and set overflows to occur after a
+certain amount of events have occurred, and additionally, starting and halting
+the PMU. This will be done through an invocation on the PMU object, with the
+relevant arguments and return variables.
+
+PMUs are implemented differently across architectures. The register maps and
+control mechanisms differ, therefore we will need a kernel implementation per
+architecture. We will also have to ensure that for each micro-architecture, the
+event the user has requested is actually implemented on the SoC.
+
+The following is a brief description on the state of the PMU hardware for each
+architecture that seL4 supports:
+
+1. On ARM these basic mechanisms exist, however the number of counters
+   available, and the selection of hardware/software events differs between
+   implementations.  Additionally, some implementations have more powerful
+   features, such as snapshot registers, which may be useful to leverage.
+
+2. On RISC-V there does not seem to be a single agreed-upon design of the PMU at
+   this time. The current privileged specification [1] describes very limited
+   PMU support. The spec offers a number of counters and events, however, does
+   not support generating interrupts once an overflow has occurred. The ratified
+   “Sscofpmf” extension [2] provides support for these overflow interrupts, but
+   is not required to be implemented. Currently on Linux, perf record checks to
+   see if this extension is implemented before enabling interrupts on RISC-V
+   [3]. At this time, it seems too early to say what PMU hardware will generally
+   be supported by RISC-V implementations.
+
+3. x86 PMU implementations different between manufacturers. For instance, Intel
+   have “Performance Event Based Sampling” (PEBS) and AMD have their
+   “Instruction Based Sampling” (IBS) tool. Additionally, there are subtle
+   differences in the commonly supported features, such as different mappings of
+   registers, and different naming conventions. We can determine which system is
+   in use by checking the CPU ID vendor string, and leverage features such as
+   PEBS’ precise IP tracking.
+
+## Drawbacks
+
+One potentially major drawback in making a generic PMU interface, is that we may
+not support certain features that are available on different
+architectures/micro-architectures. For instance, supporting all the unique
+features available on AMD and Intel x86 processors will lead to a fairly complex
+implementation. Even for ARM SoCs, each board can have a number of additional
+PMU features implemented, which will similarly lead to an increase in complexity
+if we try to account for all of these different configurations.
+
+## Rationale and alternatives
+
+Other alternatives have been proposed and tested. One such was to use a process
+similar to VPPI events, where a PMU IRQ is first handled in the kernel, and then
+sent to a user-space fault handler (the profiler). However, this idea is
+certainly flawed, as this means that the fault handler of every process in the
+system has to be the profiler, and issues arise when we generate an interrupt
+when the idle thread is running. A flag could be added to the TCB, and with an
+additional syscall, the user can register a TCB to be profiled, and we can
+discard any samples that were taken whilst an “un-profiled” thread was running.
+However, this is not an optimal workaround.
+
+Another proposition was to add a stage in the interrupt handling of just PMU
+events, and use the existing benchmark log buffer to pass sample data. The
+interrupt was ‘intercepted’ in the interrupt handling logic, and we saved the IP
+and call-stack here, then handed the interrupt to its handler, which is our
+profiler. And the same method of adding a flag to the TCB as above. However,
+issues arose surrounding setup of the log buffer.
+
+These are both rather hacky approaches, and not solutions that you would want to
+have in a production build.
+
+## Prior art
+
+Current approaches for benchmarking and profiling in seL4 do not meet our
+requirements. These are particularly focused on profiling the kernel rather than
+user-space applications.
+
+The current infrastructure is focused on tracking utilisation and kernel
+entries, and also providing tracepoints within the kernel. This is not
+particularly useful for our application. However, some features can be
+informative, such as the kernel log buffer. We do not plan on replacing or
+modifying any of the existing benchmarking infrastructure.
+
+There is also a kernel profiling system present, which records the number of
+samples for each IP. This is not applicable for our application of allowing PMU
+access to user-space applications.
+
+Additionally, on ARM systems, the only way to get access to the PMU from
+user-space is to configure the kernel to export access to the PMU registers,
+making the PMU an uncontrolled resource.
+
+All these implementations rely on a specific benchmarking configuration of the
+kernel to be built, meaning that they are not desirable for production systems.
+
+## Unresolved questions
+
+1. There are several unresolved questions related to supporting different
+   architectures. As RISC-V’s PMU support is not mature, we are not sure how it
+   will develop. Whilst there is an extension that provides overflow interrupt
+   support, it is not currently mandatory for implementations.
+
+2. For x86, determining how to support advanced features such as PEBS/IBS is
+   also undecided.
+
+3. How we will track which events are implemented on different SoC’s is
+   undecided. In Linux, they store these events in JSON files, and have a tool
+   to generate C files from these.
+
+4. What the API will actually look like is still under debate and not finalised.
+
+5. How we will pass sample data back to userspace is not fully decided.
+
+6. How will the PMU object affect verification? Initially it will not be
+   available in verification builds of seL4, but it is not clear whether this
+   functionality will be able to be verified at a future stage.
+
+7. How will we multiplex PMU counters?
+
+## References
+
+[1] <https://drive.google.com/file/d/1RiAIOVoN1E7bv6_kEzcgATkhbeUdqu5t/view>
+
+[2] <https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/v2.0/riscv-sbi.pdf>
+
+[3] <https://github.com/torvalds/linux/blob/861c0981648f5b64c86fd028ee622096eb7af05a/drivers/perf/riscv_pmu_sbi.c#L810>