Stack scribbling does not work for tasks with pow2 stack and no other RAM. #1876
Labels
developer-experience
Fixing this would have a positive impact on developer experience
kernel
Relates to the Hubris kernel
tl;dr: there is a subtle bug in the kernel's code for initializing task stacks, which breaks the operation of
humility stackmargin
in the case where a task has a power-of-two-sized stack and no other data in RAM.Deets
I've been analyzing stack usage on production firmware, preparing for an oft-delayed toolchain upgrade, and I noticed that
idle
tends to show zero margin. Okay, fine,idle
is the world's simplest task, maybe it uses all of its stack (typically 256 bytes). Since it contains no conditional behavior, that won't change.So then I noticed that
eeprom
on the PSC also reports zero stackmargin. I'm less familiar with that task, so I dug into it. It turns out that the task is, in fact, very nearly out of stack, but not 100% out of stack. The problem comes back to howstackmargin
is implemented (see also #1872): it assumes that the stack memory is initialized with a recognizable bit pattern, and scans the stack for the lowest-addressed word that isn't that bit pattern to determine the deepest stack write that has occurred in a task's life.Here's what
eeprom
's stack looks like:The task's stack pointer is currently
0x24001510
, which tracks, because the 32 bytes starting at that address look like a standard exception frame that was deposited when the kernel context-switched away from the task.The 16 bytes of free stack (!) above it are... gobbledygook? But not the gobbledygook that we expect, which should be
baddcafe
.I resized the task to have 512 bytes of stack, and the behavior persisted: the task parked with 240 bytes used, but now 272 bytes were apparently random.
This strongly suggested that the stack initialization code wasn't working right, since SRAM comes out of reset in a random-ish state, which persists until we initialize it.
The stack initialization code is in the kernel, and --- because the kernel doesn't trust tasks in general --- is done best-effort: the kernel looks to see if the task owns a chunk of memory that contains its claimed initial stack pointer, and if so, the kernel will initialize it. If not, the kernel just moves on, which is what appeared to be happening here.
eeprom
andidle
have a property in common, which is unusual: neither uses RAM outside the stack. Both tasks are also configured (generally) to have a power-of-two-sized stack (256 bytes). On ARM, where stacks are "full descending" (the stack pointer points to the last used word of stack), the initial stack pointer is four bytes above the end of the stack. Because we use a stack-then-data layout to avoid stack clash, normally this means the initial stack pointer points (harmlessly) at the first data word.But if there are no data words, it points outside the RAM region.
Which makes this code misfire:
contains
here should be applied to the word just below the initial stack pointer, not the initial stack pointer itself. It's a classic off-by-one error (or in this case, off-by-four). Whoops!Impact
This doesn't actually affect running systems in any way: the contents of the unused part of the stack are undefined, correct programs do not reference them, and nothing inside the system relies on the
baddcafe
initialization pattern. Unlike many systems, stack initialization is not a security mechanism in Hubris, since any data in the uninitialized stack area will be from the prior incarnation of the same task.However, it does break the
stackmargin
diagnostic tool, so we should fix it.The text was updated successfully, but these errors were encountered: