-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARP collector error: rtnetlink NeighMessage has a wrong attribute data length #2849
Comments
I can confirm this behaviour upgrading from 1.6.1 to 1.7.0 on nixos.
Note: We reboot the server in question every day, and if the unit comes up during boot (systemd with After=network.target) it does NOT display this error (three days in a row so far). However, if the unit is restarted the error starts to appear immediately.
Systemd Unit:
|
The "rtnetlink NeighMessage has a wrong attribute data length" error comes from the github.com/jsimonetti/rtnetlink package that the collector uses to fetch the ARP cache via rtnetlink. // errInvalidNeighMessageAttr is returned when neigh attributes are malformed.
errInvalidNeighMessageAttr = errors.New("rtnetlink NeighMessage has a wrong attribute data length") There are only two places in that code which can produce such an error: func (a *NeighAttributes) decode(ad *netlink.AttributeDecoder) error {
for ad.Next() {
switch ad.Type() {
case unix.NDA_UNSPEC:
// unused attribute
case unix.NDA_DST:
l := len(ad.Bytes())
if l != 4 && l != 16 {
return errInvalidNeighMessageAttr
}
a.Address = ad.Bytes()
case unix.NDA_LLADDR:
// Allow IEEE 802 MAC-48, EUI-48, EUI-64, or 20-octet
// IP over InfiniBand link-layer addresses
l := len(ad.Bytes())
if l != 6 && l != 8 && l != 20 {
return errInvalidNeighMessageAttr
}
a.LLAddress = ad.Bytes()
... This implies that you have neighbor cache entries of type NDA_DST or NDA_LLADR, which are an unexpected length. Are you able to paste the contents of your neighbor cache, using the command |
|
@jpds Thanks, that doesn't look anything unusual. This isn't really a bug in node_exporter per se, and probably needs some input from the rtnetlink package authors (@jsimonetti help...?) As a workaround, you can revert to the /proc/net/arp parsing method (used by node_exporter < 1.7.0) with |
I guess the only thing "special" about my particular setup is that I'm using With the workaround flag, the collector succeeds:
|
These neighbors' LLADDR are clearly each 6 bytes, and the DST is either 4 or 16 byte, so should not cause this issue. I wonder if I'd be curious to see what is returned if those size constraints are loosenend. Would that be something you could test in your setup? I can see if I can easily send you a nix derivation that includes a patch. |
Additionally, the output of |
Sure.
|
Same problem here with seemingly even simpler networking setup. Arch with systemd-networkd and kernel
Does not seem like the values the go library expects at all, however I don't know what nla_len is exactly and if that is different than the number of bytes the go lib expects. It seems like nla_len is always 4 greater than the payload. So I'm guessing the problem is the empty LLADDR reported for the "0.0.0.0" destination:
edit: this last line also appears in the dump from @jpds |
You could be on to something here. Let me see if I can make a reproducer. WRT |
FWIW, I tried to prove a suspicion that it is related to network namespaces, but alas that proved fruitless. node_exporter works as expected, even when running in a non-global netns. |
This looks like it is the issue currently. It appears I think the correct course of action would be to allow zero-length |
Opened a proposal PR: jsimonetti/rtnetlink#199 |
Updated and released https://github.com/jsimonetti/rtnetlink/releases/tag/v1.4.0 |
Dependabot will bump that in the next couple days. I'll see about doing a cherry-pick of the PR once it's ready. |
The const char *ll_addr_n2a(const unsigned char *addr, int alen, int type,
char *buf, int blen)
{
int i;
int l;
if (alen == 4 &&
(type == ARPHRD_TUNNEL || type == ARPHRD_SIT
|| type == ARPHRD_IPGRE))
return inet_ntop(AF_INET, addr, buf, blen);
if (alen == 16 && (type == ARPHRD_TUNNEL6 || type == ARPHRD_IP6GRE))
return inet_ntop(AF_INET6, addr, buf, blen);
if (alen == 7 && type == ARPHRD_AX25)
return ax25_ntop(AF_AX25, addr, buf, blen);
if (alen == 7 && type == ARPHRD_NETROM)
return netrom_ntop(AF_NETROM, addr, buf, blen);
if (alen == 5 && type == ARPHRD_ROSE)
return rose_ntop(AF_ROSE, addr, buf, blen);
snprintf(buf, blen, "%02x", addr[0]);
for (i = 1, l = 2; i < alen && l < blen; i++, l += 3)
snprintf(buf + l, blen - l, ":%02x", addr[i]);
return buf;
} The cf. (abridged): /* ARP protocol HARDWARE identifiers. */
#define ARPHRD_NETROM 0 /* from KA9Q: NET/ROM pseudo */
#define ARPHRD_AX25 3 /* AX.25 Level 2 */
/* Dummy types for non ARP hardware */
#define ARPHRD_ROSE 270
#define ARPHRD_TUNNEL 768 /* IPIP tunnel */
#define ARPHRD_TUNNEL6 769 /* IP6IP6 tunnel */
#define ARPHRD_SIT 776 /* sit0 device - IPv6-in-IPv4 */
#define ARPHRD_IPGRE 778 /* GRE over IP */
#define ARPHRD_IP6GRE 823 /* GRE over IPv6 */ Technically, the snprintf(buf, blen, "%02x", addr[0]);
for (i = 1, l = 2; i < alen && l < blen; i++, l += 3)
snprintf(buf + l, blen - l, ":%02x", addr[i]);
return buf; Since this code would still expect at least one byte of valid data in |
@jsimonetti I think it's worth studying the |
(disclaimer, I had only a quick glance at that code) For example, I see the zero-lladdr entries if I specifically query for it: # ip neigh get 0.0.0.0 dev eth0
0.0.0.0 dev eth0 lladdr 08 NOARP This filtering should be done in the application and not in the |
I agree, but since node_exporter is calling the code as follows, it returns the rtnetlink error to the function caller: neighbors, err := conn.Neigh.List()
if err != nil {
return nil, err
} If a zero-length lladdr is valid, and there is something else in the neighbor struct that we can test on, then I don't see any problem with the rtnetlink returning such entries. |
See prometheus/node_exporter#2849 and https://github.com/jsimonetti/rtnetlink/releases/tag/v1.4.0 for discussion. Change-Id: Id9d9630bf32c121ec059a3ee1de9b4e8aa42fb92 Reviewed-on: https://review.monogon.dev/c/monogon/+/2402 Reviewed-by: Tim Windelschmidt <tim@monogon.tech> Tested-by: Jenkins CI
I'm pretty sure this "NOARP" situation should already be correctly handled by node_exporter, as explicitly commented by me when I wrote the code: for _, n := range neighbors {
// Neighbors will also contain IPv6 neighbors, but since this is purely an ARP collector,
// restrict to AF_INET. Also skip entries which have state NUD_NOARP to conform to output
// of /proc/net/arp.
if n.Family == unix.AF_INET && n.State&unix.NUD_NOARP == 0 {
ifIndexEntries[n.Index]++
}
} So it's really just up to the rtnetlink package to return such entries instead of an error. |
Correct me if I'm wrong, but isn't that exactly wat jsimonetti/rtnetlink#200 does? |
It may do, but if it does, it may just be by fortuitous coincidence. I based the node_exporter arp collector on the Linux kernel source code which is responsible for generating the entries when viewing /proc/net/arp (net/core/neighbour.c) - functions |
Also having the same error log on Debian 12 with node_exporter v1.7.0.
|
In lieu of these errors It might be best to completely remove the length check from rtnetlink. I 'll have a PR ready in few minutes to adress this. |
I get this problem and have a wireguard vpn tunnel (running in docker on network mode host). Maybe something to that. |
I'm fairly certain that node_exporter (indirectly) trips over the
Note the absent link-layer address in the NDA_LLADDR attribute, since point-to-point tunnel interfaces typically do not rely on them, along with the struct rtattr {
unsigned short rta_len;
unsigned short rta_type;
};
node_exporter would handle such entries correctly, if the rtnetlink package returned such entries without error, since node_exporter checks for the For comparison's sake, this is a typical rtnetlink response for a "normal" IPv4 neighbor on an ethernet / wifi link:
We have jsimonetti/rtnetlink#200 provides a workaround whereby the rtnetlink package will skip empty In any case, node_exporter needs to tag a new release which is built against a more recent rtnetlink package. The continuing confirmations of this bug are in part due to the fact that node_exporter v1.7.0 is built against a version of rtnetlink which returns an error when it encounters such empty |
This bug should have been fixed with #2864 but that one has not be released yet. |
Unfortunately #2864 doesn't solve the problem, at least not for me. I had to build node_exporter with this version jsimonetti/rtnetlink#208. |
@jC3rny Thank you for reporting. I am still trying to come up with a good testcase for this. Could you please describe your network conditions under which this occurs? |
Virtual Machine: Node Exporter version 1.7.0 using the linux arm version and running with ./node_exporter using defaults The node_exporter log output starts showing an error with the following output: ts=2024-02-01T17:36:24.728Z caller=collector.go:169 level=error msg="collector failed" name=arp duration_seconds=0.000242241 err="could not get ARP entries: rtnetlink NeighMessage has a wrong attribute data length" ts=2024-02-01T17:36:29.731Z caller=collector.go:169 level=error msg="collector failed" name=arp duration_seconds=0.000115001 err="could not get ARP entries: rtnetlink NeighMessage has a wrong attribute data length" This error occurs on my armv7 device too. Just getting flooded with this message over and over I have decided to wait from using the exporter until this has been corrected. |
#2909 should fix this. |
Thanks for fixing this! I was bitten by this today on one (final: prev: {
prometheus-node-exporter = prev.prometheus-node-exporter.overrideAttrs
(_: _: {
version = "unstable-20240201";
__noChroot =
true; # escape sandbox (on build host must be set: nix.settings.sandbox = "relaxed" !!!), because we need to download modules during build
# also we must build with --impure
buildFlags = "-mod=mod";
preBuild = ''
GOPROXY="direct"
GOSUMDB="sum.golang.org"
'';
src = prev.fetchFromGitHub {
rev = "57de74a5f63feb222d4506afd2e8f384247fc51a";
owner = "prometheus";
repo = "node_exporter";
sha256 = "sha256-4Zed9joc2JfMwkQoxk32hWuPa6L6OzQfx8IcyUKh+dE=";
};
});
}) The hocus pocus with |
Great, thanks for confirming. |
@SuperQ would it be possible to have this in a release? Is there already a planned roadmap? Thank you. |
I wish we would have automated, monthly releases but I just can't justify spending (significant) time on it without anyone paying for it.. But yeah I agree, it's probably time for a new release. |
@discordianfish when do you think to release the fix for ARM7? Thank you very much |
For anyone waiting, v1.8.0 has been released 🎉 https://github.com/prometheus/node_exporter/releases/tag/v1.8.0 |
Host operating system: output of
uname -a
Linux 5.14.21-150500.55.36-default #1 SMP PREEMPT_DYNAMIC Tue Oct 31 08:37:43 UTC 2023 (e7a2e23) x86_64 x86_64 x86_64 GNU/Linux
openSUSE Leap 15.5
node_exporter version: output of
node_exporter --version
node_exporter command line flags
defaults
node_exporter log output
Are you running node_exporter in Docker?
no
What did you do that produced an error?
upgraded from version 1.6.0
What did you expect to see?
node_arp_entries
node_scrape_collector_success{collector="arp"} 1
What did you see instead?
node_scrape_collector_success{collector="arp"} 0
The text was updated successfully, but these errors were encountered: