-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: node exporter mixin large update #2665
Conversation
Very nice! This needs a thorough review though. For this, can you first clean up the commit history and add DCO sign-off, then ping us when you're ready to get this reviewed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a DCO sign-off. You can use git commit -s --amend
to add it.
yes, will clean this up after alerts #2644 is merged. |
Add UIDs to all dashboards. Add units and descriptions to all panels which were missing them. Modify alerts descriptions and summaries as needed for linting. Signed-off-by: Ryan J. Geyer <me@ryangeyer.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
* Add mountpoint to NodeFilesystem alerts This helps to identify alerting filesystem. * Decrease NodeFilesystem pending time to 15m 30m is too long and there is a risk of running out of disk space/inodes completely if something is filling up disk very fast (like log file). * Add CPU and memory alerts * Add failed systemd service alert * Decrease NodeNetwork*Errs pending period * Set 'at' everywhere as preposition for instance * Add NodeDiskIOSaturation alert * Add %(nodeExporterSelector)s to Network and conntrack alerts * Add diskDevice selector * Fix NodeMemoryHighUtilization alert * Add NodeSystemSaturation and NodeMemoryMajorPagesFaults * Decrease NodeSystemdServiceFailed severity to warning * Extend alert description * Add comma after 'mounted on' * Add thresholds for memory alerts * Add thresholds for memory, disk and system alerts * Set severity to NodeCPUHighUsage to info * Convert graph panels to timeseries panel ...With default style (opacity, tooltip etc). Also: Change 'logical core' line style to dotted Update Disk I/O time metric to dots * Move dashboard paramaters to config * Add overview row * Add Cpu Usage stat panel * Add network dash - Add interfaces overview panel - Add oper status timeline - Add common lib with reused elements (templates, queries) - Add common panels with shared style to be used accross this mixin * Remove external panels lib * Add fleet dashboard * Update fleet dash * Add CPU and memory to fleet * Add common cpu/memory/disk/network panels on fleet * add network errors panel as points * Fix alerts column in fleet table * Add support for multiple group and instance labels * Add sockstat to network dashboard * Add netstat to network dashboard * Change span to gridPod. Make overview row smaller. - gridPos supports tiny panels height. * add reboot annotation * Add system dashboard * add filesystem row * Add disk and fs dashboard * Add memory dashboard * Add memory generic counters to memory dashboard * Update common lib * Update OOM killer panel * Add common annotations: kernelChange, OOMkill * Add mountpoint to NodeFilesystem alerts - This helps to identify alerting filesystem. * Add CPU and memory alerts * Add failed systemd service alert * Decrease NodeNetwork*Errs pending period * Set 'at' everywhere as preposition for instance * Add NodeDiskIOSaturation alert * Add %(nodeExporterSelector)s to Network and conntrack alerts * Add diskDevice selector * Fix NodeMemoryHighUtilization alert * Add NodeSystemSaturation and NodeMemoryMajorPagesFaults * Decrease NodeSystemdServiceFailed severity to warning * Remove unused import * Add ability to set custom dashboardUID * Add mountpoint to NodeFilesystem alerts * Add failed systemd service alert * Remove systemd panel - systemd collector is disabled by default * Add some lint exclusions. - Add UIDs to all dashboards. - Add units and descriptions to all panels which were missing them. - Modify alerts descriptions and summaries as needed for linting. * Add multi-cluster dashboard lint exclusions * Extend alert description * Add thresholds for memory, disk and system alerts * Set severity to NodeCPUHighUsage to info * Fix broken diskSpaceUsage link * Fix cpuIdle panel units * Change cpuUsage to use $__rate_interval * Fix cpu usage (replace with nodeQuerySelector) * Fix units (seconds->s) * Fix iops units * Add %(nodeQuerySelector)s to alerts queries * Add support for multi in job * Fix Pagesout metric * Add total and available memory metrics * Update context switches description * Add network descriptions * Change pipe to | from / in AxisLabel * Update network descriptions * Add timezone metric --------- Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com> Signed-off-by: Ryan J. Geyer <me@ryangeyer.com>
Instead, one can redefine grafanaDashboardIDs in _config Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
to stay under mimir's default limit of 20 alerts per group. Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
This fixes an issue with selecting a node, given a specific datasource, and the link not using said datasource thus showing no data Signed-off-by: Emily Ahlstrand Rager <emily.rager@grafana.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
@discordianfish , @SuperQ , Hi! |
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
Signed-off-by: Vitaly Zhuravlev <v-zhuravlev@users.noreply.github.com>
* Add node-observ-lib * Remove trends support (not in 10.0 schema) * Make filteringSelector for logs dashboard configurable * Temp change dependency (until PR is merged for commonlib) * Refactor config * Update jsonnetfile.json * Update README * Add separate loki example * Add sep file example
* Add gitignore to node-observ-lib * Fix typo in node default filteringSelector * Prep alert group names for macos * Add macos-observ-lib * Change overview dashboard: show networkErrorsAndDroppedPerSec instead of networkErrorPerSec for Linux/MacOS * Add more alerts * Move alerts to sep file * Breaking: Update layout To allow to locally import linux from macos * Bring back NodeFilesystemAlmostOutOfFiles alert * Show only errors when they occur * Only show network interfaces that had traffic change at least once during selected dashboard interval
Closing in favor of #2861 |
Links and data links are provided for better navigation between views.
Checklist:
Various dashboards improvements: