-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-manager container stalls #44
Comments
Update: Test computer has not reproduced this stalling behavior thus far. This is not directly related, but we will try to limit docker logging, which does not rotate its log by default. |
2022-09-15Waldo froze overnight running scans. Users say re container was not pushing CPU 100% this time. We will continue to investigate for trends in CPU usage. Attempted to diagnose by looking for log feedback after qserver commands. The qserver commands did not reveal any new logging info. All commands are met with timeout errors. Since logging has not been useful, we are experimenting with tweaking the log paramters. After consulting Docker logging docs, we have added the following fields to the docker daemon config:
We are running scans overnight to see if stability results. |
After the changes implemented by ddkohler, bluesky still froze overnight during scans. This is the message RE manager is giving me as it trys to stop: [I 2022-09-16 12:11:26,292 bluesky_queueserver.manager.manager] Downloading the lists of existing plans and devices from the worker environment [E 2022-09-16 12:11:31,285 bluesky_queueserver.manager.start_manager] Timeout detected by Watchdog. RE Manager malfunctioned and must be restarted [I 2022-09-16 12:11:31,285 bluesky_queueserver.manager.manager] Starting ZMQ server at 'tcp://*:60615' [I 2022-09-16 12:11:31,285 bluesky_queueserver.manager.manager] ZMQ control channels: encryption disabled [I 2022-09-16 12:11:31,288 bluesky_queueserver.manager.manager] Starting RE Manager process [I 2022-09-16 12:11:31,305 bluesky_queueserver.manager.manager] Downloading the lists of existing plans and devices from the worker environment [E 2022-09-16 12:11:36,299 bluesky_queueserver.manager.start_manager] Timeout detected by Watchdog. RE Manager malfunctioned and must be restarted [I 2022-09-16 12:11:36,299 bluesky_queueserver.manager.manager] Starting ZMQ server at 'tcp://*:60615' [I 2022-09-16 12:11:36,299 bluesky_queueserver.manager.manager] ZMQ control channels: encryption disabled` |
One further difference from previous bluesky stalls this week. With previous freezes, restarting docker was ineffective at solving the issue at hand. There was still a timeout issue occurring. Re-composing docker was what was needed to restore functionality. This time, the user restarted a yaq daemon that was reporting itself as not busy, but it also seems to have disconnected. The daemon was restarted in the foreground. Afterwards, the user restarted docker and functionality was restored without having to use |
@kelsonoram disconnection messages on the daemon side just mean that a yaqc client was closed (i.e. it's normal operation, no issue). Also, can we be sure the disconnect was from a client on the bluesky end of things? Disconnect messages are common and can happen from e.g. yaqc-qtpy My guess is restarting docker was the actual fix. If restarting the daemon was the fix, it means something actually went wrong on the daemon side that we failed to detect. @kelsonoram next time if you believe the daemon is at fault, can you try come client calls/queries in yaqc-qtpy to it to see if anything seems wrong with it? i.e. can you |
@ddkohler I will do that next time. I just thought it was something different of note. I was going to see if just restarting docker fixed the issue next time. |
Tail of messages hook:
So the last message processed was a read of the Neither of those devices appear stalled, and they respond appropriately to external communication. Today when I encountered this, the manager process was responding, though I still got a timeout error for
The scan completed with a So threads to pull on:
|
It's useful to practice using madbg [1] beforehand so that we can all be ready to use it when encountering this issue. Please try the following. First make sure madbg is installed in your environment. Run the following script which will CPU lock after a few seconds. import asyncio
async def nice():
while True:
print("nice")
await asyncio.sleep(1)
async def evil():
i = 0
while True:
print("evil")
await asyncio.sleep(2)
i += 1
if i > 3:
while True:
1+1
loop = asyncio.new_event_loop()
loop.create_task(nice())
loop.create_task(evil())
loop.run_forever() Now in your command line run blaise@jasmine conda@edge ~ $ madbg attach 1350483
> /home/blaise/Desktop/play.py(2)<module>()
1 while True:
----> 2 1+1
ipdb> This will leave you in an interactive debug session where you can try typing |
Just some how-tos and what is useful to do when this stall occurs so we can get info: htop and info about the offending processTo get info about the processes, in particular which one(s) are actually chewing through CPU: Use the If htop is not installed: If there are a lot of entries with green text in Note: this originally contained instructions to go into settings, but the options required have keybinds, so updated to use those instead. Make note of this info (Probably best to screenshot it, honestly... I didn't because I got the info I wanted out of it, but for purposes of generalizing what to do and recording as best we can...) There should be three processes listed: one at the top of the tree, this is the Watchdog. asses state of manager process responding to inputsAssess whether the RE manager is responding to requests: This can be tested explicitly via the CLI:
Which will spit out a long JSON reply if it works or a shorter TimeoutError if it does not. determine at what part of the plan the stall occuredOnce again, start a shell in the re-manager container using the run:
This will print the last 30 lines of the hooks log file. Possibly using
|
An annotated example of htop (taken while idle, so CPU usage is low, but the red ellipse in the The ellipse in the |
Thanks so much for putting in the detail work @ksunden. |
Just a thought: Would probably be a good idea to do the I doubt it is processing messages, as if it were, I'd expect the scan to continue towards completion. but worth checking. |
Bluesky stalled on Waldo over the weekend. The stall followed the same patterns as previous stalls with the CPU usage for the RE-manager at 100%. The scan that was stalled on was a 2d frequency scan. The attune-delay was not being applied, so the delay stages should not have been moving. I followed the great plan laid out by @ksunden and the following are the results: Htop resultsHere is a screenshot of the htop results, hiding the non-relevant results. Like @ksunden saw earlier, the worker has the highest memory usage and has the high CPU usage. Unlike what was seen by ksunden earlier, the PIDs for the worker and the manager processes were both low. This indicates that the manager was not restarted at some point (I believe, feel free to correct me otherwise). Attempt to assess state of manager process responding to inputsSimilar to what @ksunden saw before, the manager was responding to requests. I saw timestamped status messages coming in regularly. To be thorough, I did a qserver status request and got the following result:
Hooks file and finding that hp was least read before the stallI then looked into the hooks log file. This showed that when the scan stalled, the last message processed was 'hp' and the next message would be 'twin2'. This is the second time it stalled on reading an hp message. I do not know yet whether this is a coincidence or the source of our troubles. The previous checkpoint of the hooks log is posted below. I did the tail of the hooks.log several times to check if the worker was processing messages still. The output never changed, so the worker was not processing messages.
madbg failed attemptI tried the madbg method but it did not work. The output of my attempt is posted below:
Getting bluesky restartedAfter following these steps in tracking down the problem, I tried to get bluesky working again. I tried the command
The |
The process that you would have wanted to attach to would have been |
Okay, so N=2 it has occurred while attempting to |
I note that the last line looks to be cut off, that could be a copy paste error or it could indicate that the buffer never got flushed to the output file..., hmmmm |
Good catch, that is a copy and paste error. that last line should be: |
I updated the comment to accurately reflect what was given in the hooks log. Sorry about that @ksunden |
When I or @untzag get another chance to look at this, we may implement some additional logging to get at what is happening here, but for now documenting as thoroughly as we can helps bring confidence to the conclusion that it is something happening while reading the I have a couple logs I may wish to put in bluesky itself to ensure that it is happening in the actual hardware communication, but more than likely it is. Beyond that, some logs in yaqc and/or yaqc-bluesky may be warranted. I've looked at darn near every line of code that gets run for that operation in that process short of looking at python itself and come up short on an explanation thus far: bluesky, yaqc-bluesky, yaqc, fastavro. The most suspicious thing so far is a potential infinite loop in yaqc, but even that my analysis of the function is that it should idle the CPU instead of spinning the CPU, so I'm not sure it fits the presentation. Logging more fine grained may be the best we can do to zero in on this... unfortunately the feedback loop is slow, but I have confidence we can resolve this eventually, just need to find it first. |
Lazily using the logged in account at the lab machine, but this is @ksunden. Stall did occur overnight, same symptoms: last Msg was read of hp, spinning CPU on the worker, environment destroy successfully relieved the spinning Before doing environment destroy, I attempted a few extra steps:
|
We have another stall! This is the first stall since the extra logging features were installed by @ksunden. htop resultsHere is a screenshot of the filtered htop results. Like before you see the worker is takin all of the CPU usage. Hooks log resultsBelow I have the output to the hooks log from the most recent checkpoint. I have saved elsewhere a full checkpoint log that was not apart of the stall. I saved this because more info is not bad, but I don't think it will reveal much of anything.
I hope this is helpful in tracking down the source of our problem. |
If I interpret this correctly, the problem is that calls to hp dependent on port 38708 never get back to the run engine? @kelsonoram is does that port correspond to the thorlabs motor controlling the harmonic crystal? |
@ddkohler The port 38708 corresponds to the lightcon-topas4-motor for the hp_RP6_Stage. |
Whichever daemon it actually is, I think this log rules out bluesky as the problem and suggests a deeper look into yaqc-bluesky processes. |
Question: what are the the keys for the position identifiers for that motor? c = yaqc.Client(38708)
c.get_position_identifier_options() If I remember correctly for that motor it is If it is empty (no discrete positions configured), then I'm really not sure what is going on. If it is not empty, I may have an idea regarding our dependency, fastavro, not erroring in a case where I would have expected it to when reading strings that are incomplete, and thus causing some problems on rare occasions when the bytes don't make it in time to be parsed. (specifically, if it looks like I should have a length 3 string, but only provide 2 bytes, it will just give you the length 2 string and leave the last byte hanging) I still do not have an explanation for how that causes the CPU utilization we are seeing, but was something I noticed when thinking on this, that would be quite rare, and dependent on timing we don't have control over fully. Blaise and I have discussed another round of logging, which we will try to implement. Soooo, if we do implement the next round of logging and then do not see the stall happen for a significant amount of time, there may be a round of logging where we take out the logs in the critical portion, but try to catch it afterwards and try to trigger it again so we are confident about what is happening. |
I am additionally curious about the WS_Wheel_1 prior, in the cut off portion of that log what the identifier was that it read, as it could have been that one that caused it, and if it was truncated from what we expect, that would confirm this theory (though again, not sure on the mechanism of it taxing CP....) That motor has longer names than most of the others, and thus would be more likely to have this particular error. |
This explanation also explains why it is a problem for waldo and not the other systems, as none of the other systems use discrete motors. Still, I wish I had an explanation for the mechanism of the high CPU utilization... that is still a mystery to me |
Here are the position identifier options:
|
Can you do 38707 as well, please? |
Yes, I don't know why I didn't add that one before:
|
Okay, I've wavered back and forth a bit, the mechanics of this are super subtle and timing based in ways that are difficult to describe and ways that are potentially self correcting in many instances, actually, which makes it harder to find. If I'm right, we may actually have many of the strings that are read be truncated, we just don't notice because we don't usually actually care about the strings (they don't make it to the wt5 files, for instance) |
Another stall, another series of logs: htopHooks logThe stall occurred in hp once again, but a different motor than yesterday. It failed when waiting for the response from the hp_SHG_Crystal at port 38705. Immediately before this motor polling, it was talking to the hp_Crystal_2 motor (at port 38704). The relevant logs are below
Position IdentifiersThe position identifiers for the daemon that failed us here and the one preceding it are below: hp_SHG_Crystal position identifier
hp_Crystal_2 position identifier
|
Some other things to note which are probably not that relevant:
|
The only plausible reason for this that I can think of at the moment is that the OS is killing it because it needs to recover resources.... even that doesn't make a ton of sense to me, though. As for the ps system... I think we are far enough down the path with waldo that we first resolve what we see there then revisit if the ps system needs anything different. |
Executive Summary
This is an ongoing investigation into symptoms that have recurred that seem very similar to #41, #42.
Namely, some scans, particularly large ones (such as 3D scans), fail to complete.
The problem does not seem to relate to memory pressure from the containers.
Thus far, the problem has only been seen on Waldo.
Initial report, Symptoms, troubleshooting
2022-09-08
After upgrading to Bluesky v1.10.0, scans stall despite low memory usage (~280 MB).
CLI requests to re-manager container do not complete; docker app reports 100% cpu usage.
The same symptoms have appeared 3 times today on that system, so it is very reproducible there.
Operation is restored by restarting containers.
Actions taken
The text was updated successfully, but these errors were encountered: