Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics json #761

Open
wants to merge 82 commits into
base: master
Choose a base branch
from
Open

Conversation

Hooloovoo
Copy link

@Hooloovoo Hooloovoo commented Aug 16, 2022

This creates a new sanoid --monitor-metrics-json command that produces JSON output.

Example output from the scenario in test_two_criticals_hourly_two_warnings_daily is:

{
  "schema_version": 202204041,
  "overall_snapshot_health_issues": 2,
  "snapshot_info": {
    "sanoid-test-2": {
      "daily": {
        "crit_age_seconds": 172800,
        "has_snapshots": 1,
        "warn_age_seconds": 100800,
        "monitor_dont_warn": 0,
        "newest_age_seconds": 104401,
        "snapshot_health_issues": 1,
        "newest_snapshot_ctime_seconds": 1660685678,
        "monitor_dont_crit": 0
      },
      "hourly": {
        "monitor_dont_warn": 0,
        "has_snapshots": 1,
        "crit_age_seconds": 21600,
        "warn_age_seconds": 17400,
        "snapshot_health_issues": 2,
        "newest_age_seconds": 104401,
        "newest_snapshot_ctime_seconds": 1660685678,
        "monitor_dont_crit": 0
      },
      "monthly": {
        "snapshot_health_issues": 0,
        "newest_age_seconds": 104401,
        "monitor_dont_warn": 0,
        "crit_age_seconds": 3456000,
        "has_snapshots": 1,
        "warn_age_seconds": 2764800,
        "monitor_dont_crit": 0,
        "newest_snapshot_ctime_seconds": 1660685678
      }
    },
    "sanoid-test-1": {
      "hourly": {
        "newest_snapshot_ctime_seconds": 1660685678,
        "monitor_dont_crit": 0,
        "monitor_dont_warn": 0,
        "crit_age_seconds": 21600,
        "has_snapshots": 1,
        "warn_age_seconds": 5400,
        "snapshot_health_issues": 2,
        "newest_age_seconds": 104401
      },
      "monthly": {
        "monitor_dont_crit": 0,
        "newest_snapshot_ctime_seconds": 1660685678,
        "snapshot_health_issues": 0,
        "newest_age_seconds": 104401,
        "monitor_dont_warn": 0,
        "warn_age_seconds": 2764800,
        "crit_age_seconds": 3456000,
        "has_snapshots": 1
      },
      "daily": {
        "newest_age_seconds": 104401,
        "snapshot_health_issues": 1,
        "crit_age_seconds": 115200,
        "has_snapshots": 1,
        "warn_age_seconds": 100800,
        "monitor_dont_warn": 0,
        "monitor_dont_crit": 0,
        "newest_snapshot_ctime_seconds": 1660685678
      }
    }
  }
}

That would give a corresponding output from sanoid --monitor-snapshots of:

CRIT: sanoid-test-1 newest hourly snapshot is 1d 5h 0m 0s old (should be < 6h 0m 0s), CRIT: sanoid-test-2 newest hourly snapshot is 1d 5h 0m 0s old (should be < 6h 0m 0s), WARN: sanoid-test-1 newest daily snapshot is 1d 5h 0m 0s old (should be < 1d 4h 0m 0s), WARN: sanoid-test-2 newest daily snapshot is 1d 5h 0m 0s old (should be < 1d 4h 0m 0s)

The structure of the JSON is as follows:

"schema_version": date backwards followed by an incrementing digit (YYYYMMDDX) in case of backwards-incompatible changes
"overall_snapshot_health_issues": matches the exit code of `sanoid --monitor-snapshots` (0 fine, 1 warning, 2 critical)
  "snapshot_info": {
    "[path_name]": {
      "[type, e.g. daily]": {
        "crit_age_seconds": the setting for e.g. daily_crit from sanoid.conf
        "has_snapshots": does this path/type combination have any snapshots (0 no, 1 yes)
        "warn_age_seconds": the setting for e.g. daily_warn from sanoid.conf
        "monitor_dont_warn":  was monitor_dont_warn set? (0 no, 1 yes)
        "newest_age_seconds": the age of the newest snapshot of this path/type in seconds
        "snapshot_health_issues": does this path/type have any health issues (0 no, 1 warn, 2 critical), equivalent of `sanoid --monitor-snapshots` just for this path/type
        "newest_snapshot_ctime_seconds": the unix ctime of the most recent snapshot of this path/type
        "monitor_dont_crit": was monitor_dont_crit set? (0 no, 1 yes)

This does require the perl JSON module. I have made the changes to packages/debian/control to add this dependency, but other package types will have to be updated.

This deliberately builds on my earlier pull request: #729 to ensure that the existing behaviour of sanoid --monitor-snapshots is tested and that this new functionality does not alter that behaviour.

I have been running this on my servers for most of this year (it was adding the tests that took the most time) with no issues. I use the code at https://gitlab.com/aaron-w/sanoid_prometheus to export these into Prometheus and alert on them. This addresses #675 as far as Sanoid snapshot information is concerned.

In the future it would make sense to add additional top-level objects in the JSON output of the same command (sanoid --monitor-metrics-json) for the information returned by --monitor-health and --monitor-capacity and the code and JSON structure was designed to make that easier, but these were less urgent for me because there are other ZFS Prometheus exporters for this information. I also wanted to check I could get this merged in before spending more time on additional enhancements.

@Hooloovoo
Copy link
Author

Tests all pass.

time ./run-tests.sh 
Running test 1_one_year ... [PASS]
Running test 2_dst_handling ... [PASS]
Running test 3_monitor_snapshots ... [PASS]

real	0m0.000s
user	72m59.031s
sys	314m29.532s

@Hooloovoo
Copy link
Author

I just built a deb package and installed it in a clean VM and it seems to work.

@Hooloovoo Hooloovoo marked this pull request as ready for review August 17, 2022 19:42
@Hooloovoo
Copy link
Author

Is there anything more I can do to help get this merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant