Skip to content

Add per-zone cpu metrics.#9968

Open
jmcarp wants to merge 1 commit intomainfrom
jmcarp/zone-metrics
Open

Add per-zone cpu metrics.#9968
jmcarp wants to merge 1 commit intomainfrom
jmcarp/zone-metrics

Conversation

@jmcarp
Copy link
Contributor

@jmcarp jmcarp commented Mar 4, 2026

Add a new oximeter instrument for tracking per-zone cpu statistics with kstat, and use it in sled-agent. We add metrics for cpu_nsec_{user,sys,waitrq}, for a total cardinality of triple the number of internal zones.

Note: low priority for review, but also likely quick—all changes are mechanical.

@jmcarp jmcarp requested a review from bnaecker March 4, 2026 03:07
@jmcarp jmcarp force-pushed the jmcarp/zone-metrics branch from 5a6c908 to e18753c Compare March 4, 2026 17:26
@jmcarp jmcarp force-pushed the jmcarp/zone-metrics branch from e18753c to e67c7fa Compare March 13, 2026 15:15
@jmcarp
Copy link
Contributor Author

jmcarp commented Mar 13, 2026

Marking this ready for review. I ran this branch on a local omicron, and it did what it was supposed to. For example:

$ oxide experimental system timeseries query --query 'get zone:cpu_nsec_sys | filter timestamp > @now() - 5m && zone_type ~= "cockroachdb" | last 5'
{
  "tables": [
    {
      "name": "zone:cpu_nsec_sys",
      "timeseries": [
        {
          "fields": {
            "rack_id": {
              "type": "uuid",
              "value": "9044ae97-0bc8-4e5f-a171-169cf5d18ba3"
            },
            "sled_revision": {
              "type": "u32",
              "value": 0
            },
            "sled_id": {
              "type": "uuid",
              "value": "c97f3d85-9423-46d9-b025-53bc5579520c"
            },
            "zone_id": {
              "type": "uuid",
              "value": "9797dd6f-3d45-46d5-ba9f-cfc65c76a55f"
            },
            "sled_serial": {
              "type": "string",
              "value": "helios-sse"
            },
            "sled_model": {
              "type": "string",
              "value": "i86pc"
            },
            "zone_name": {
              "type": "string",
              "value": "oxz_cockroachdb_9797dd6f-3d45-46d5-ba9f-cfc65c76a55f"
            },
            "zone_type": {
              "type": "string",
              "value": "cockroachdb"
            }
          },
          "points": {
            "start_times": [
              "2026-03-13T20:54:47.792508616Z",
              "2026-03-13T21:04:23.920085184Z",
              "2026-03-13T21:04:33.922636344Z",
              "2026-03-13T21:04:43.925654919Z",
              "2026-03-13T21:04:53.927270623Z"
            ],
            "timestamps": [
              "2026-03-13T21:04:23.920085184Z",
              "2026-03-13T21:04:33.922636344Z",
              "2026-03-13T21:04:43.925654919Z",
              "2026-03-13T21:04:53.927270623Z",
              "2026-03-13T21:05:03.929757821Z"
            ],
            "values": [
              {
                "metric_type": "delta",
                "values": {
                  "type": "integer",
                  "values": [
                    8972933810,
                    49302468,
                    262251295,
                    114674545,
                    130900584
                  ]
                }
              }
            ]
          }
        },
        {
          "fields": {
            "sled_revision": {
              "type": "u32",
              "value": 0
            },
            "zone_name": {
              "type": "string",
              "value": "oxz_cockroachdb_c2f21953-f259-4b13-b9a9-f2bc7efe723e"
            },
            "sled_model": {
              "type": "string",
              "value": "i86pc"
            },
            "rack_id": {
              "type": "uuid",
              "value": "9044ae97-0bc8-4e5f-a171-169cf5d18ba3"
            },
            "sled_serial": {
              "type": "string",
              "value": "helios-sse"
            },
            "zone_id": {
              "type": "uuid",
              "value": "c2f21953-f259-4b13-b9a9-f2bc7efe723e"
            },
            "sled_id": {
              "type": "uuid",
              "value": "c97f3d85-9423-46d9-b025-53bc5579520c"
            },
            "zone_type": {
              "type": "string",
              "value": "cockroachdb"
            }
          },
          "points": {
            "start_times": [
              "2026-03-13T20:54:47.792474339Z",
              "2026-03-13T21:04:23.920080766Z",
              "2026-03-13T21:04:33.922624130Z",
              "2026-03-13T21:04:43.925651803Z",
              "2026-03-13T21:04:53.927266395Z"
            ],
            "timestamps": [
              "2026-03-13T21:04:23.920080766Z",
              "2026-03-13T21:04:33.922624130Z",
              "2026-03-13T21:04:43.925651803Z",
              "2026-03-13T21:04:53.927266395Z",
              "2026-03-13T21:05:03.929748763Z"
            ],
            "values": [
              {
                "metric_type": "delta",
                "values": {
                  "type": "integer",
                  "values": [
                    8557461128,
                    48655252,
                    226145731,
                    100237179,
                    133696241
                  ]
                }
              }
            ]
          }
        },
        {
          "fields": {
            "sled_revision": {
              "type": "u32",
              "value": 0
            },
            "sled_model": {
              "type": "string",
              "value": "i86pc"
            },
            "zone_id": {
              "type": "uuid",
              "value": "cfd03cbb-afb5-4cd2-9f7a-5f4f2f2f8cf5"
            },
            "sled_id": {
              "type": "uuid",
              "value": "c97f3d85-9423-46d9-b025-53bc5579520c"
            },
            "sled_serial": {
              "type": "string",
              "value": "helios-sse"
            },
            "rack_id": {
              "type": "uuid",
              "value": "9044ae97-0bc8-4e5f-a171-169cf5d18ba3"
            },
            "zone_name": {
              "type": "string",
              "value": "oxz_cockroachdb_cfd03cbb-afb5-4cd2-9f7a-5f4f2f2f8cf5"
            },
            "zone_type": {
              "type": "string",
              "value": "cockroachdb"
            }
          },
          "points": {
            "start_times": [
              "2026-03-13T20:54:48.046663605Z",
              "2026-03-13T21:04:23.920090485Z",
              "2026-03-13T21:04:33.922646645Z",
              "2026-03-13T21:04:43.925659718Z",
              "2026-03-13T21:04:53.927276545Z"
            ],
            "timestamps": [
              "2026-03-13T21:04:23.920090485Z",
              "2026-03-13T21:04:33.922646645Z",
              "2026-03-13T21:04:43.925659718Z",
              "2026-03-13T21:04:53.927276545Z",
              "2026-03-13T21:05:03.929765927Z"
            ],
            "values": [
              {
                "metric_type": "delta",
                "values": {
                  "type": "integer",
                  "values": [
                    9045903012,
                    51329329,
                    152421172,
                    77563250,
                    128780468
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  ]
}

@jmcarp jmcarp marked this pull request as ready for review March 13, 2026 15:25
@jmcarp jmcarp force-pushed the jmcarp/zone-metrics branch 3 times, most recently from 75f714c to 23011a9 Compare March 13, 2026 20:47
.unwrap()
.datum
.value();
assert!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks unrelated. Is this to address an existing test flake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting to submit in a separate patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this just got fixed in #10040.

/// and we can avoid the complexity of per-zone tracking or maintaining a
/// shared mapping of zone metadata.
fn parse_zone_name(zone_name: &str) -> Option<ZoneMetadata> {
let rest = zone_name.strip_prefix(ZONE_PREFIX)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we return None if the zone doesn't have the right prefix, but if we fail to parse it for another reason (no _, for example), we still return Some(_) with an empty UUID. Is that right? Can you explain the reasoning here? Is it maybe better to use only the zone name, without any interpretation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say extracting the zone type is useful, so that we can easily ask questions like "show me all the cpu utilization for cockroach", or maybe more to the point, "aggregate cpu utilization by service". I tried simplifying this a bit: if the zone name matches oxz_TYPE_UUID (i.e. is a proper oximeter zone), parse into components, else just submit the zone name and omit the type and id.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me, thanks!

None => (String::new(), Uuid::nil()),
};

let user_metric = zone::CpuNsecUser {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For physical CPU kstats, we represent the state for the cumulative CPU time (sys, user, etc) as a field. Here, they're different timeseries. Why not do the same thing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a reason for this, but now that you ask, I don't think I can defend it. I was thinking that the cpu stats refer to cpu microstates, which partition time into different states and whose rates of change should AIUI add up to some fixed value. The zone stats are a bit different—I believe they're actually derived by adding up thread microstates across the zone, and they don't necessarily add up to a fix value, so I wondered if we should treat them as independent metrics. But I'm not convinced by my own reasoning, so I'm changing this back to a single metric labeled by state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good. I think the consistency is valuable at this point.

}

#[tokio::test]
async fn test_kstat_sampler() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this test. It's testing the sampler itself, unrelated to the new zone CPU stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, that's right. I more or less copied and pasted this test from cpu.rs/link.rs without thinking too hard about it. It doesn't necessarily belong there either—maybe we can condense and move those tests into sampler.rs later—but for now I'm just deleting it from zone.rs.

})
.collect();
assert_eq!(zone_names.len(), 3);
assert!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a style nit, but I find it easier to follow this assertion if we collect the names into a set, and assert its length is 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test got refactored with the change to a single metric labeled by state.

@jmcarp jmcarp force-pushed the jmcarp/zone-metrics branch 2 times, most recently from f39d03b to 1d89fbf Compare March 16, 2026 19:00
Add a new oximeter instrument for tracking per-zone cpu statistics with kstat,
and use it in sled-agent. We add metrics for cpu_nsec_{user,sys,waitrq}, for a
total cardinality of triple the number of internal zones.
@jmcarp jmcarp force-pushed the jmcarp/zone-metrics branch from 1d89fbf to e7fbdee Compare March 16, 2026 19:48
@jmcarp
Copy link
Contributor Author

jmcarp commented Mar 17, 2026

Ok, I think all the issues from review should be addressed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants