You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Populate client side trace's local address via tcp kprobes (#1989)
Summary: Populate client side trace's local address via tcp kprobes
This change populates client side trace's `local_addr` and `local_port`
columns for the following use cases:
1. To provide more consistency for the protocol data tables. Having
columns that are empty make it difficult for end users to understand
what is being traced and make them less useful
2. To facilitate addressing a portion of the short lived process
problems (#1638)
For 2, the root of the issue is that `df.ctx["pod"]` syntax relies on
the
[px.upid_to_pod_name](https://docs.px.dev/reference/pxl/udf/upid_to_pod_name/)
function. If a PEM misses the short lived process during its metadata
update, this function fails to resolve the pod name. For client side
traces where the pod is making an outbound connection (non localhost),
the `local_addr` column provides an alternative pod name lookup for
short lived processes when the pod is long lived. This means the
following would be equivalent to the `df.ctx["pod"]` lookup:
`px.pod_id_to_pod_name(px.ip_to_pod_id(df.local_addr))`.
I intend to follow this PR with a compiler change that will make
`df.ctx["pod"]` try both methods should `px.upid_to_pod_name` fail to
resolve. This will allow the existing pxl scripts to display the
previously missed short lived processes.
**Alternatives**
Another approach I considered was expanding our use of the `sock_alloc`
kprobe. I used ftrace on a simple curl command to see what other options
could be used (`sudo trace-cmd record -F -p function_graph
http://google.com`). The `socket` syscall calls `sock_alloc`, which
would be another mechanism for accessing the `struct sock`. I decided
against this approach because I don't think its viable to assume that
the same thread/process that calls `socket` will be the one that does
the later syscalls (how our BPF maps are set up). It's common to have a
forking web server model, which means a different process/thread can
call `socket` than the ones that later read/write to it.
**Probe stability**
These probes appear to be stable from our oldest and newest supported
kernel. These functions exist in the
[tcp_prot](https://elixir.bootlin.com/linux/v4.14.336/source/net/ipv4/tcp_ipv4.c#L2422),
[tcpv6_prot](https://elixir.bootlin.com/linux/v4.14.336/source/net/ipv6/tcp_ipv6.c#L1941)
structs and I've seen that other projects and bcc tools use these
probes. This makes me believe that these functions have a pretty well
defined interface.
Relevant Issues: #1829, #1638
Type of change: /kind feature
Test Plan: New tests verify that ipv4 and ipv6 cases work
- [x] Ran `for i in $(seq 0 1000); do curl http://google.com/$i; sleep
2; done` within a pod and verified that `local_addr` is populated with
this change and `px.pod_id_to_pod_name(px.ip_to_pod_id(df.local_addr))`
works for pod name resolution.
- [x] Verified the above curl test results in traces without
`local_addr` without this change

- Tested on the following k8s offerings and machine images
- [x] GKE COS and Ubuntu
- [x] EKS Amazon Linux 2
Changelog Message: Populate socket tracer data table `local_addr` and
`local_port` column for client side traces.
---------
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
0 commit comments