Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bounce #1550

Merged
merged 2 commits into from
Jan 16, 2025
Merged

Fix bounce #1550

merged 2 commits into from
Jan 16, 2025

Conversation

vext01
Copy link
Contributor

@vext01 vext01 commented Jan 16, 2025

We've been experiencing non-deterministic segfaults, assertion failures
and Lua-level errors in bounce.lua for a while:
#1544

After a couple of days of systematically narrowing down the cause, it
appears to be something to do with the re-use of the PT resources in
collect.c.

In theory reusing the PT stuff for subsequent tracing sessions on the
same thread should be just fine -- I've even found examples on the
internet where people do this -- however, something is wrong with our
implementation, so I'm disabling the caching for now.

This commit does several things related to this:

  • Fixes a bug in the aux buffer read code in the case that the ring
    buffer wraps around. We were incorrectly bumping the length of the
    recorded trace, which would cause trace corruption in the form of
    uninitialised reads when we come to decode the trace:

    -    trace->len += size + head;
    +    trace->len += head;

    Later changes (not re-using the aux buffer) will actually mean that
    the wraparound case can't happen (since the aux buffer is the same
    size as the allocated copyout buffer), but I'm leaving it in in case
    we ever decide to re-use perf stuff again.

    Sadly this fix alone isn't enough to fix bounce.lua.

  • Removes caching of perf-related mmap buffers and the perf file
    descriptor itself. This basically reverts bedafd9. This means each
    tracing session now allocates fresh memory and opens a fresh perf fd.
    As part of this change we re-introduce PacketParserState::Init and
    add a couple of other permissible packets that I've seen come up when
    using try_repeat on the test suite.

  • Ensures that the collector's resources are freed by calling
    hwt_perf_free_collector(), even if collecting a trace is aborted by
    (e.g.) "trace too long". We were failing to do this, meaning that
    resources would potentially leak. Since we no longer cache the perf
    fd, and this is closed in hwt_perf_free_collector(), if we don't
    call that, then the thread is effectively blocked from ever tracing
    again: trying to open perf agaiin would give "resource busy".

This appears to fix the original bounce.lua and the reduced one that me an Lukas devised.

I did 3 try_ci runs of this and all suceeded! Fingers crossed.

@@ -765,6 +742,11 @@ bool hwt_perf_free_collector(struct hwt_perf_ctx *tr_ctx,
if (tr_ctx->stop_fds[0] != -1) {
close(tr_ctx->stop_fds[0]);
}
if (tr_ctx->perf_fd >= 0) {
/* fprintf(stderr, "close perf\n"); */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gah.

@ltratt
Copy link
Contributor

ltratt commented Jan 16, 2025

Please force push an update.

We've been experiencing non-deterministic segfaults, assertion failures
and Lua-level errors in bounce.lua for a while:
ykjit#1544

After a couple of days of systematically narrowing down the cause, it
appears to be something to do with the re-use of the PT resources in
collect.c.

In theory reusing the PT stuff for subsequent tracing sessions on the
same thread *should* be just fine -- I've even found examples on the
internet where people do this -- however, something is wrong with our
implementation, so I'm disabling the caching for now.

This commit does several things related to this:

 - Fixes a bug in the aux buffer read code in the case that the ring
   buffer wraps around. We were incorrectly bumping the length of the
   recorded trace, which would cause trace corruption in the form of
   uninitialised reads when we come to decode the trace:

   ```diff
   -    trace->len += size + head;
   +    trace->len += head;
   ```

   Later changes (not re-using the aux buffer) will actually mean that
   the wraparound case can't happen (since the aux buffer is the same
   size as the allocated copyout buffer), but I'm leaving it in in case
   we ever decide to re-use perf stuff again.

   Sadly this fix alone isn't enough to fix bounce.lua.

 - Removes caching of perf-related mmap buffers and the perf file
   descriptor itself. This basically reverts bedafd9. This means each
   tracing session now allocates fresh memory and opens a fresh perf fd.
   As part of this change we re-introduce `PacketParserState::Init` and
   add a couple of other permissible packets that I've seen come up when
   using try_repeat on the test suite.

 - Ensures that the collector's resources are freed by calling
   `hwt_perf_free_collector()`, even if collecting a trace is aborted by
   (e.g.) "trace too long". We were failing to do this, meaning that
   resources would potentially leak. Since we no longer cache the perf
   fd, and this is closed in `hwt_perf_free_collector()`, if we don't
   call that, then the thread is effectively blocked from ever tracing
   again: trying to open perf agaiin would give "resource busy".
@vext01
Copy link
Contributor Author

vext01 commented Jan 16, 2025

Force pushed the fix. Sorry about that!

@ltratt ltratt added this pull request to the merge queue Jan 16, 2025
Merged via the queue into ykjit:master with commit 3ddf2b3 Jan 16, 2025
2 checks passed
@vext01 vext01 deleted the fix-bounce branch January 16, 2025 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants