-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate memory PSI collection and visualization tools #4141
Integrate memory PSI collection and visualization tools #4141
Conversation
Yeah, I see the Yetus warnings... Will fix them. |
487eb68
to
bd41930
Compare
if _, err := os.Stat(PIDFile); err == nil { | ||
log.Noticef("Memory PSI Collector is already running") | ||
return | ||
} | ||
// Create a PID file | ||
err := createPIDFile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is not atomic - it is kind of the same discussion as #4106 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can hardly be a problem, as the standalone version will be started manually.
But now I remembered that I wanted to detach it from the terminal at the start.
I dislike using Golang. I've struggled to merge a simple feature for over a week. The initial C PoC was ready in one day. It seems needlessly complex. |
bd41930
to
5148dab
Compare
5148dab
to
52f492c
Compare
Use dup3 instead of dup2, to support ARM |
52f492c
to
147c6b7
Compare
pkg/dom0-ztools/rootfs/bin/eve
Outdated
@@ -27,6 +27,8 @@ Welcome to EVE! | |||
dump-stacks | |||
dump-memory | |||
memory-monitor-update-config | |||
psi-collector-start | |||
psi-collector-stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for other commands we use on/off as a parameter - f.e. https://github.com/lf-edge/eve/pull/4141/files#diff-a3b42769797ef2e752af884f98fd0877b6034d918bae243c3f165d7d710987a1L25
(also this comment comes from the guy who did not use on/off himself but start/stop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of remote access, on/off looks fine. You either turn the access on or off. It's like a property. In the case of this tool, you have to start and stop it. It's a process, not a property. But I like the idea of adding a subcommand: "star|stop". Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with "psi-collector start|stop"
fb78e2a
to
2981b6f
Compare
I see a strange failure in the Go tests for my PR.
Isn't it expected, @milan-zededa? |
Looks like a race in the test, I will have a look... |
@@ -241,13 +251,47 @@ func listenDebug(log *base.LogObject, stacksDumpFileName, memDumpFileName string | |||
} | |||
|
|||
info := ` | |||
This server exposes the net/http/pprof API.</br> | |||
For examples on how to use it, see: <a href="https://pkg.go.dev/net/http/pprof">https://pkg.go.dev/net/http/pprof</a></br> | |||
<a href="debug/pprof/">pprof methods</a></br></br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now you removed this link? I used that in the browser ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returned the link
3bc2b2d
to
0969b2b
Compare
Virtualization test suite failed because of the docker rate limit, again:
I'm also trying to understand what happened to the Smoke Tests. I see some failure
here: https://github.com/lf-edge/eve/actions/runs/10527286917/job/29170125936?pr=4141 |
Meanwhile, I tried to rerun the failed tests... |
@@ -292,6 +337,46 @@ func listenDebug(log *base.LogObject, stacksDumpFileName, memDumpFileName string | |||
return | |||
} | |||
})) | |||
mux.Handle("/memory-monitor/psi-collector/start", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's out of the scope of this PR, but how about we move from http.Mux to chi router, which supports middleware and you can separate POST and GET methods as well? We use it in metadata server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm... I don't mind. Feel free to implement or fire a ticket so as not to forget it =)
'fullAvg10 fullAvg60 fullAvg300 fullTotal' | ||
|
||
|
||
def visualize_memory_pressure(log_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
holywar
we can actually utilise typing of Python3 and you can write something like
deff visualize_memory_pressure(log_file: pathlib.Path):
/holywar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can... Will try it next time =D
eve-kernel-amd64-v6.1.38-generic 8859f43ee8c9: eve_defconfig: Enable PSI (Pressure Stall Information). eve-kernel-amd64-v6.1.38-rt 37bc37278441: eve_defconfig: Enable PSI (Pressure Stall Information). eve-kernel-arm64-v5.10.104-nvidia 7b9a55e0b659: defconfig: Enable PSI in default configuration. eve-kernel-arm64-v6.1.38-generic acbfadebeb76: eve_defconfig: Enable PSI in default configuration. bb8ed98284d6: eve_defconfig: Remove ST33HTPM support. The second commit in the 5.10 kernel update is not logically a part of this patch set, but a leftover from a previous one. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit introduces a Pressure Stall Information (PSI) collector to the AgentLog component in Pillar. The collector periodically gathers memory pressure data from the system, logging it in a managed file. The code also includes functionality for handling the file size to ensure that only recent data is kept. This collector is designed to be shared between the AgentLog component and a future standalone tool for enhanced memory pressure monitoring. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit introduces new constants for the memory-monitor directories within the Pillar codebase. Specifically, it adds paths for the memory-monitor output directory and the PSI log file. This change ensures that Pillar components are aware of and can correctly reference the locations used by the memory-monitor. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit adds functionality to the Pillar internal debug API, allowing the start and stop of the Memory PSI collector via HTTP POST requests. New endpoints `/memory-monitor/psi-collector/start` and `/memory-monitor/psi-collector/stop` have been introduced, enabling control over the PSI collector process. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
…ript. This commit extends the `eve` script with new commands `psi-collector start|stop`. These commands interact with the internal debug API to start and stop the Memory PSI collector, respectively. This addition makes it convenient to control the PSI collector directly from the terminal, with the results stored in `/persist/memory-monitor/output/psi.txt`. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit adds a standalone version of the PSI Collector, allowing it to be built and used independently on older versions of EVE where the collector is not integrated into the root filesystem. The new component includes a Makefile for building the binary, a README with instructions, and a basic main.go implementation that handles the PSI collection process. This ensures compatibility and provides memory pressure monitoring capabilities for legacy systems. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit adds a new tool, `psi-visualizer`, designed to visualize PSI data, which is essential for understanding memory pressure patterns over time. The tool uses Python with Plotly to create interactive plots from PSI logs generated by the PSI Collector. The visualizer includes a Makefile for environment setup, a `requirements.txt` for dependencies, and a `visualize.py` script that reads PSI data and generates plots. A README is also included to guide users through setup and usage, making it easier to analyze memory pressure visually. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
…ertions. Added the `testify/require` package to enhance test assertions by allowing tests to stop immediately when a required condition is not met. This addition helps ensure that tests do not continue when critical assertions fail, providing better control over test flows. Vendor files for the require module have been included to support this functionality. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
This commit changes the listenDebug function in the agentlog package to ListenDebug, making it public. The purpose of this change is to allow the function to be utilized in tests, enabling test scenarios that require the pillar debug API. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
…agement. This commit introduces a comprehensive suite of tests for the PSI collector integration and the file size management functionality within the `agentlog` package. These tests cover various scenarios including starting and stopping the PSI collector, emulating memory pressure stats, verifying file content after operations, and ensuring the correct handling of edge cases like non-triggered truncation and improper threshold sizes. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
…synchronization. This commit introduces two mutex locks to ensure thread-safe handling of PSI data and synchronization during testing. The first lock (PsiMutex) is added to protect access to the PSI files, preventing race conditions between the PSI data producer in the tests and the consumer in memprofile. The second lock (psiProducerMutex) ensures that only one PSI data producer runs during tests, avoiding conflicts and ensuring consistent test results. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
0969b2b
to
122406e
Compare
Fixed the comments from @milan-zededa and @uncleDecart |
No, the test expects list with one IP but in this case we got |
The tests go reeeed 😭😭😭 |
The tests are green. Several approves are here. Merging. |
This PR introduces a toolset for collecting and visualizing Pressure Stall Information (PSI) data, fully integrated into the EVE framework. The primary purpose of this PR is to enable the collection of PSI data, which is crucial for analyzing memory pressure patterns in the system. Understanding these patterns will allow us to develop more effective monitoring tools that can predict and mitigate potential Out-of-Memory (OOM) situations.
To support this functionality, the EVE Linux kernel configuration has been adapted to enable the CONFIG_PSI option. This change ensures that PSI data can be collected on systems running EVE, providing the necessary insights into memory pressure.
The PR includes several key enhancements. First, it integrates a PSI collector into the Pillar's AgentLog component, making it accessible via the internal debug API. This allows for the dynamic start and stop of the PSI collector through HTTP requests, as well as through the eve command-line tool for ease of use. Additionally, a standalone version of the PSI collector is provided, enabling its deployment on older versions of EVE where it is not yet integrated into the root filesystem.
The PR also introduces a PSI visualizer tool, designed to process and display the collected PSI data in an interactive format. This visualization is essential for understanding memory pressure over time, providing insights that can inform the development of new triggers and monitoring mechanisms to preempt OOM conditions.
TODO: