Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect system info + available BPF features on startup #70

Merged
merged 25 commits into from
Oct 1, 2024

Conversation

patnebe
Copy link
Collaborator

@patnebe patnebe commented Sep 17, 2024

Context

  • Early detection of system info and BPF features to ensure the required runtime dependencies are met.
  • This information will also provide a better UX through improved error reporting on startup.

Changes

  • Add a new BPF program to detect what BPF features are available on the host.
  • Check that procfs and tracefs are mounted.
  • Check support for tracepoints, perf_events, and a few bpf helpers.

Test Plan

  • Manual tests + Unit(? / integration?) tests.

@patnebe patnebe changed the title WIP: Detect system info + available BPF features on startup Detect system info + available BPF features on startup Sep 23, 2024
@patnebe patnebe marked this pull request as ready for review September 23, 2024 09:21
Copy link
Owner

@javierhonduco javierhonduco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good overall!! Great progress so far!

We might want to look into the kernel test errors, and another thing we could do was moving all this code that is pretty much self-contained (and potentially) reusable to a top level crate similarly to how lightswitch-proto is implemented. What do you think?

}

fn tracefs_mount_detected() -> bool {
return Path::new(PROCFS_PATH).exists();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! very clean


fn get_trace_sched_event_id(trace_event: &str) -> Result<u32> {
if !tracefs_mount_detected() {
return Err(anyhow!("Failed to detect tracefs"));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have custom errors. I appreciate that currently the project abuses anyhow, and we should change this, but let's try to make this errors into their own variants of an enum. Happy to send some hints if you need them!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Added a few custom errors to this file

err
)),
},
Err(_) => Err(anyhow!("Failed to read event={} id", trace_event)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using or_else + ? might simply the code

}

fn tracepoints_detected() -> bool {
let mut tracepoints_supported = true;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: not need for this variable, feel free to just return either true or false directly

return tracepoints_supported;
}

if unsafe { close(fd) } != 0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can look into wrapping this into a type that on drop() attempts to close this. The Drop trait is akin to RAII in C++

@patnebe
Copy link
Collaborator Author

patnebe commented Sep 26, 2024

We might want to look into the kernel test errors,

FWIW, the error seemed to be related to the unavailability of the hrtimer_start_range_ns kprobe (or maybe kprobes in general on the VM?). Perhaps some kernel flag is missing(?). So instead of spending time figuring out how to get that to work, I decided to hook the BPF feature detection program to a tracepoint instead. The main reason is that we already rely on tracepoints and currently don't use kprobes anywhere else.

another thing we could do was moving all this code that is pretty much self-contained (and potentially) reusable to a top level crate similarly to how lightswitch-proto is implemented. What do you think?

That sounds reasonable. Maybe lightswitch_sys_probe would be a good name for this crate? Open to suggestions :)

&& self.software_perfevents_support_detected
&& self.tracepoints_support_detected
&& bpf_features.can_load_trivial_bpf_program
&& bpf_features.has_ring_buf
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use ring buffers now, and when we do, most likely it will be dynamically chosen depending on the availability of ring buffers

Copy link
Collaborator Author

@patnebe patnebe Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I've relaxed the ringbuf requirement.

})
}

pub fn has_minimal_requirements(&self) -> bool {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -0,0 +1,7 @@
#ifdef __TARGET_ARCH_x86
#include "vmlinux_x86.h"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how large vmlinux is and that we might want to keep just on copy, what do you think of using either symlinks or including the file from the other crate directly here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Changed this to a symlink

@javierhonduco
Copy link
Owner

FWIW, the error seemed to be related to the unavailability of the hrtimer_start_range_ns kprobe (or maybe kprobes in general on the VM?). Perhaps some kernel flag is missing(?). So instead of spending time figuring out how to get that to work, I decided to hook the BPF feature detection program to a tracepoint instead. The main reason is that we already rely on tracepoints and currently don't use kprobes anywhere else.

Makes sense!

That sounds reasonable. Maybe lightswitch_sys_probe would be a good name for this crate? Open to suggestions :)

What about lightswitch-features-probing / lightswitch-capabilities or something like this? -sys is typically used for C bindings

}
}

// TODO: How can we make this an integration/system test?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's totally fine to leave it here. Personally I don't think the difference between purely no-IO unittests and integration tests matters that much, unless we need to test a binary, in that case we should use the tool level folder instead. The standard in the Rust world is to add integration tests in test/ so here would go on lightswitch-capabilities/test.

Copy link
Owner

@javierhonduco javierhonduco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! LGTM

@patnebe patnebe merged commit d510e79 into javierhonduco:main Oct 1, 2024
4 checks passed
@patnebe patnebe deleted the feature-detection branch October 1, 2024 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants