Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack scribbling does not work for tasks with pow2 stack and no other RAM. #1876

Closed
cbiffle opened this issue Sep 20, 2024 · 0 comments · Fixed by #1882
Closed

Stack scribbling does not work for tasks with pow2 stack and no other RAM. #1876

cbiffle opened this issue Sep 20, 2024 · 0 comments · Fixed by #1882
Labels
developer-experience Fixing this would have a positive impact on developer experience kernel Relates to the Hubris kernel

Comments

@cbiffle
Copy link
Collaborator

cbiffle commented Sep 20, 2024

tl;dr: there is a subtle bug in the kernel's code for initializing task stacks, which breaks the operation of humility stackmargin in the case where a task has a power-of-two-sized stack and no other data in RAM.

Deets

I've been analyzing stack usage on production firmware, preparing for an oft-delayed toolchain upgrade, and I noticed that idle tends to show zero margin. Okay, fine, idle is the world's simplest task, maybe it uses all of its stack (typically 256 bytes). Since it contains no conditional behavior, that won't change.

So then I noticed that eeprom on the PSC also reports zero stackmargin. I'm less familiar with that task, so I dug into it. It turns out that the task is, in fact, very nearly out of stack, but not 100% out of stack. The problem comes back to how stackmargin is implemented (see also #1872): it assumes that the stack memory is initialized with a recognizable bit pattern, and scans the stack for the lowest-addressed word that isn't that bit pattern to determine the deepest stack write that has occurred in a task's life.

Here's what eeprom's stack looks like:

$ humility readmem -w 0x24001500 256
humility: attached to 0483:3754:001A002D4D4B500D20373831 via ST-Link V3
                   \/        4        8        c
0x24001500 | e79ebdcb 7bed052f 6ca0fa34 d35182f8 | ..../..{4..l..Q.
0x24001510 | 240015b2 00000003 00000000 240015d8 | ...$...........$
0x24001520 | 00000000 08013335 08013566 41000000 | ....53..f5.....A
0x24001530 | 00000000 00000000 00000000 00000000 | ................
0x24001540 | 00000000 00000000 00000000 00000000 | ................
0x24001550 | 00000000 00000000 00000000 00000000 | ................
0x24001560 | 00000000 00000000 00000000 00000000 | ................
0x24001570 | 00000000 00000000 240015d8 240015b2 | ...........$...$
0x24001580 | 00000000 240015f8 00000000 080135f4 | .......$.....5..
0x24001590 | 00020001 0000ffff 240015d8 00020001 | ...........$....
0x240015a0 | 00000000 00000000 00000000 ffffffff | ................
0x240015b0 | 000032a1 01000000 00000000 00000000 | .2..............
0x240015c0 | 00000000 00000000 00000000 00000000 | ................
0x240015d0 | 00000000 00000000 00000000 00000000 | ................
0x240015e0 | 00000000 00000000 00000000 00000000 | ................
0x240015f0 | 00000000 00000000 00000000 080132ef | .............2..

The task's stack pointer is currently 0x24001510, which tracks, because the 32 bytes starting at that address look like a standard exception frame that was deposited when the kernel context-switched away from the task.

The 16 bytes of free stack (!) above it are... gobbledygook? But not the gobbledygook that we expect, which should be baddcafe.

I resized the task to have 512 bytes of stack, and the behavior persisted: the task parked with 240 bytes used, but now 272 bytes were apparently random.

This strongly suggested that the stack initialization code wasn't working right, since SRAM comes out of reset in a random-ish state, which persists until we initialize it.

The stack initialization code is in the kernel, and --- because the kernel doesn't trust tasks in general --- is done best-effort: the kernel looks to see if the task owns a chunk of memory that contains its claimed initial stack pointer, and if so, the kernel will initialize it. If not, the kernel just moves on, which is what appeared to be happening here.

eeprom and idle have a property in common, which is unusual: neither uses RAM outside the stack. Both tasks are also configured (generally) to have a power-of-two-sized stack (256 bytes). On ARM, where stacks are "full descending" (the stack pointer points to the last used word of stack), the initial stack pointer is four bytes above the end of the stack. Because we use a stack-then-data layout to avoid stack clash, normally this means the initial stack pointer points (harmlessly) at the first data word.

But if there are no data words, it points outside the RAM region.

Which makes this code misfire:

    // Before we set our frame, find the region that contains our initial stack
    // pointer, and zap the region from the base to the stack pointer with a
    // distinct (and storied) pattern.
    if let Some(region) = task
        .region_table()
        .iter()
        .find(|region| region.contains(initial_stack))
    {

contains here should be applied to the word just below the initial stack pointer, not the initial stack pointer itself. It's a classic off-by-one error (or in this case, off-by-four). Whoops!

Impact

This doesn't actually affect running systems in any way: the contents of the unused part of the stack are undefined, correct programs do not reference them, and nothing inside the system relies on the baddcafe initialization pattern. Unlike many systems, stack initialization is not a security mechanism in Hubris, since any data in the uninitialized stack area will be from the prior incarnation of the same task.

However, it does break the stackmargin diagnostic tool, so we should fix it.

@cbiffle cbiffle added kernel Relates to the Hubris kernel developer-experience Fixing this would have a positive impact on developer experience labels Sep 20, 2024
cbiffle added a commit that referenced this issue Sep 30, 2024
Right now, if a task has no non-stack RAM usage and a power-of-two
stack, the kernel's stack scribbling code skips it. This causes humility
stackmargin to emit bogus values, but has no other implications at the
moment.

The cause is a classic off-by-one error, testing the address just above
the stack instead of within the bounds of the stack. This hits a valid
region if there is at least one byte of non-stack data in RAM, but only
then.

This fixes that by testing the top word in the stack.

Fixes #1876.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
developer-experience Fixing this would have a positive impact on developer experience kernel Relates to the Hubris kernel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant