Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-manager container stalls #44

Open
ddkohler opened this issue Sep 8, 2022 · 34 comments
Open

re-manager container stalls #44

ddkohler opened this issue Sep 8, 2022 · 34 comments

Comments

@ddkohler
Copy link
Contributor

ddkohler commented Sep 8, 2022

Executive Summary

This is an ongoing investigation into symptoms that have recurred that seem very similar to #41, #42.
Namely, some scans, particularly large ones (such as 3D scans), fail to complete.
The problem does not seem to relate to memory pressure from the containers.
Thus far, the problem has only been seen on Waldo.

Initial report, Symptoms, troubleshooting

2022-09-08

After upgrading to Bluesky v1.10.0, scans stall despite low memory usage (~280 MB).
CLI requests to re-manager container do not complete; docker app reports 100% cpu usage.
The same symptoms have appeared 3 times today on that system, so it is very reproducible there.

Operation is restored by restarting containers.

Actions taken

  • currently trying to reproduce this on the test computer.
@ddkohler
Copy link
Contributor Author

Update: Test computer has not reproduced this stalling behavior thus far.

This is not directly related, but we will try to limit docker logging, which does not rotate its log by default.
https://docs.docker.com/config/containers/logging/configure/

@ddkohler
Copy link
Contributor Author

ddkohler commented Sep 15, 2022

2022-09-15

Waldo froze overnight running scans. Users say re container was not pushing CPU 100% this time. We will continue to investigate for trends in CPU usage. Attempted to diagnose by looking for log feedback after qserver commands. The qserver commands did not reveal any new logging info. All commands are met with timeout errors.

Since logging has not been useful, we are experimenting with tweaking the log paramters. After consulting Docker logging docs, we have added the following fields to the docker daemon config:

"log-driver": "local",
"log-opts": {
  "mode": "non-blocking"
}

We are running scans overnight to see if stability results.

@kelsonoram
Copy link

kelsonoram commented Sep 16, 2022

After the changes implemented by ddkohler, bluesky still froze overnight during scans.
The issue of 100% CPU usage is not seen. Docker is using about 14% CPU, which is fine. There are 4.3GB in memory being used by the vmmem, but that also seems normal, and the value is relatively constant.
There is a difference in the RE manager logs. Before they were just the table of points that docker spits out every once in a while. But this time (probably due to ddkohler's log changes), the logs included where it failed. Unfortunately, RE manager was unaware of an issue. The user was able to try to interrupt the scans, at which point the RE manager began trying to close out, and then failing repeatedly on loop. This was not something that was able to be done before the changes in the logger. Before, nothing happened when trying to interrupt.

This is the message RE manager is giving me as it trys to stop:
`[I 2022-09-16 12:11:26,275 bluesky_queueserver.manager.manager] Starting RE Manager process

[I 2022-09-16 12:11:26,292 bluesky_queueserver.manager.manager] Downloading the lists of existing plans and devices from the worker environment

[E 2022-09-16 12:11:31,285 bluesky_queueserver.manager.start_manager] Timeout detected by Watchdog. RE Manager malfunctioned and must be restarted

[I 2022-09-16 12:11:31,285 bluesky_queueserver.manager.manager] Starting ZMQ server at 'tcp://*:60615'

[I 2022-09-16 12:11:31,285 bluesky_queueserver.manager.manager] ZMQ control channels: encryption disabled

[I 2022-09-16 12:11:31,288 bluesky_queueserver.manager.manager] Starting RE Manager process

[I 2022-09-16 12:11:31,305 bluesky_queueserver.manager.manager] Downloading the lists of existing plans and devices from the worker environment

[E 2022-09-16 12:11:36,299 bluesky_queueserver.manager.start_manager] Timeout detected by Watchdog. RE Manager malfunctioned and must be restarted

[I 2022-09-16 12:11:36,299 bluesky_queueserver.manager.manager] Starting ZMQ server at 'tcp://*:60615'

[I 2022-09-16 12:11:36,299 bluesky_queueserver.manager.manager] ZMQ control channels: encryption disabled`

@kelsonoram
Copy link

One further difference from previous bluesky stalls this week. With previous freezes, restarting docker was ineffective at solving the issue at hand. There was still a timeout issue occurring. Re-composing docker was what was needed to restore functionality. This time, the user restarted a yaq daemon that was reporting itself as not busy, but it also seems to have disconnected. The daemon was restarted in the foreground. Afterwards, the user restarted docker and functionality was restored without having to use docker compose up --build

@ddkohler
Copy link
Contributor Author

This time, the user restarted a yaq daemon that was reporting itself as not busy, but it also seems to have disconnected. The daemon was restarted in the foreground.

@kelsonoram disconnection messages on the daemon side just mean that a yaqc client was closed (i.e. it's normal operation, no issue). Also, can we be sure the disconnect was from a client on the bluesky end of things? Disconnect messages are common and can happen from e.g. yaqc-qtpy

My guess is restarting docker was the actual fix. If restarting the daemon was the fix, it means something actually went wrong on the daemon side that we failed to detect. @kelsonoram next time if you believe the daemon is at fault, can you try come client calls/queries in yaqc-qtpy to it to see if anything seems wrong with it? i.e. can you measure and get_measured as expected?

@kelsonoram
Copy link

@ddkohler I will do that next time. I just thought it was something different of note. I was going to see if just restarting docker fixed the issue next time.

@ksunden
Copy link
Member

ksunden commented Sep 22, 2022

Tail of messages hook:

# tail -n30 hooks.log
save(None, *() **{}, run=None)
[event] document emitted with data keys dict_keys(['daq_channel1', 'daq_channel1_signal', 'daq_channel1_baseline', 'daq_channel2', 'daq_channel2_signal', 'daq_channel2_baseline', 'daq_channel3', 'daq_channel3_signal', 'daq_channel3_baseline', 'daq_channel4', 'hp', 'hp_destination', 'hp_Delay_1', 'hp_Delay_1_destination', 'hp_Delay_1_position_identifier', 'hp_Crystal_1', 'hp_Crystal_1_destination', 'hp_Crystal_1_position_identifier', 'hp_Delay_2', 'hp_Delay_2_destination', 'hp_Delay_2_position_identifier', 'hp_Crystal_2', 'hp_Crystal_2_destination', 'hp_Crystal_2_position_identifier', 'hp_SHG_Crystal', 'hp_SHG_Crystal_destination', 'hp_SHG_Crystal_position_identifier', 'hp_RP5_Stage', 'hp_RP5_Stage_destination', 'hp_RP5_Stage_position_identifier', 'hp_WS_Wheel_1', 'hp_WS_Wheel_1_destination', 'hp_WS_Wheel_1_position_identifier', 'hp_RP6_Stage', 'hp_RP6_Stage_destination', 'hp_RP6_Stage_position_identifier', 'hp_Crystal_Stage_2', 'hp_Crystal_Stage_2_destination', 'hp_Crystal_Stage_2_position_identifier', 'hp_WS_Wheel_2', 'hp_WS_Wheel_2_destination', 'hp_WS_Wheel_2_position_identifier', 'hp_FH_IDL', 'hp_FH_IDL_destination', 'hp_d1', 'hp_d1_destination', 'hp_d1_offset', 'hp_d1_stage', 'hp_d1_stage_destination', 'hp_d2', 'hp_d2_destination', 'hp_d2_offset', 'hp_d2_stage', 'hp_d2_stage_destination', 'd1', 'd1_destination', 'd1_offset', 'd1_stage', 'd1_stage_destination']) (run_uid='4e3b87bb-5be3-4807-b4c7-04521f3403df')
Msg('checkpoint', obj=None, args=(), kwargs={}, run=None)
checkpoint(None, *() **{}, run=None)
Msg('set', obj=<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, args=(0.06499999999999984,), kwargs={'group': 'set-0e6aa8'}, run=None)
set(<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, *(0.06499999999999984,) **{'group': 'set-0e6aa8'}, run=None)
Msg('wait', obj=None, args=(), kwargs={'group': 'set-0e6aa8'}, run=None)
wait(None, *() **{'group': 'set-0e6aa8'}, run=None)
{<yaqc_bluesky._status.Status object at 0x7fa64ffafd30>}
The object <yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)> reports set is done with status True
None
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:39003 (gage-compuscope:daq)>, args=(), kwargs={'group': 'trigger-6373d9'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:39003 (gage-compuscope:daq)>, *() **{'group': 'trigger-6373d9'}, run=None)
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, args=(), kwargs={'group': 'trigger-6373d9'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{'group': 'trigger-6373d9'}, run=None)
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, args=(), kwargs={'group': 'trigger-6373d9'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, *() **{'group': 'trigger-6373d9'}, run=None)
The object <yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)> reports trigger is done with status True.
Msg('wait', obj=None, args=(), kwargs={'group': 'trigger-6373d9'}, run=None)
The object <yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)> reports trigger is done with status True.
wait(None, *() **{'group': 'trigger-6373d9'}, run=None)
{<yaqc_bluesky._status.Status object at 0x7fa64fcd5b80>, <yaqc_bluesky._status.Status object at 0x7fa64fde3ca0>, <yaqc_bluesky._status.Status object at 0x7fa64fd7aee0>}
The object <yaqc_bluesky.Device to host.docker.internal:39003 (gage-compuscope:daq)> reports trigger is done with status True.
None
Msg('create', obj=None, args=(), kwargs={'name': 'primary'}, run=None)
create(None, *() **{'name': 'primary'}, run=None)
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:39003 (gage-compuscope:daq)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:39003 (gage-compuscope:daq)>, *() **{}, run=None)
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{}, run=None)

So the last message processed was a read of the hp, and the next message would be a read of d1

Neither of those devices appear stalled, and they respond appropriately to external communication.

Today when I encountered this, the manager process was responding, though I still got a timeout error for qserver re pause immediate.

qserver environment destroy worked as expected, killing the worker process. The worker process had a low PID (~140) and the manager process had a PID of ~19000, indicating that it had been restarted at some point, most likely.

The scan completed with a failed exit status indicating that the environment was destroyed.

So threads to pull on:

  • attune daemon.
    • This is the last device communicated with
    • the stall has only been observed on scans with attune axes (though probability states that it easily could just be chance of that)
  • attune delay daemon
    • This would be the next device to be communicated with
    • That said, it does not appear that it was attempted, so not fully likely.
    • That said, there have been other oddities about this hardware reporting an odd position
  • Possibly still an upstream thing relating to the communication between Worker and Manager processes
    • The manager PID being high makes me believe that it has been restarted, though the worker was never restarted.
    • The worker is spinning, but the manager is not
    • It could be that the worker is spinning on something that it expects the manager to reply to but because the manager was restarted it is not listening.

@untzag
Copy link
Member

untzag commented Sep 22, 2022

It's useful to practice using madbg [1] beforehand so that we can all be ready to use it when encountering this issue. Please try the following.

First make sure madbg is installed in your environment.

Run the following script which will CPU lock after a few seconds.

import asyncio

async def nice():
    while True:
        print("nice")
        await asyncio.sleep(1)

async def evil():
    i = 0
    while True:
        print("evil")
        await asyncio.sleep(2)
        i += 1
        if i > 3:
            while True:
                1+1


loop = asyncio.new_event_loop()
loop.create_task(nice())
loop.create_task(evil())
loop.run_forever()

Now in your command line run madbg attach <pid>. I get the following.

blaise@jasmine conda@edge ~ $ madbg attach 1350483
> /home/blaise/Desktop/play.py(2)<module>()
      1 while True:
----> 2     1+1

ipdb>

This will leave you in an interactive debug session where you can try typing bt.

[1] https://github.com/kmaork/madbg

@ksunden
Copy link
Member

ksunden commented Sep 23, 2022

Just some how-tos and what is useful to do when this stall occurs so we can get info:

htop and info about the offending process

To get info about the processes, in particular which one(s) are actually chewing through CPU:

Use the (>_) button in Docker Desktop to launch a command shell inside of the docker container for re_manager.
Run htop.

If htop is not installed:
run apt install htop

If there are a lot of entries with green text in Command, hit H (shift+h) to hide those, those are threads which obfuscate what we are looking for.
It is helpful to view the tree mode, so if there are not indentations showing the parent/child relationships in the Command, hit t (lower case) or F5.

Note: this originally contained instructions to go into settings, but the options required have keybinds, so updated to use those instead.

Make note of this info (Probably best to screenshot it, honestly... I didn't because I got the info I wanted out of it, but for purposes of generalizing what to do and recording as best we can...)
Make note of the PID(s) that are consuming appreciable CPU.

There should be three processes listed: one at the top of the tree, this is the Watchdog.
Two at equal level underneath the watchdog, these are the Manager and Worker.
The Worker will be the one that is using the most RAM (around 200-300 MB).
In the instance I saw yesterday, it was the Worker that was using CPU.
The important metric is CPU% which is the instantaneous CPU usage, rather than TIME+, which is the cumulative CPU usage.
You will also see an entry for htop iteself, this can be ignored.

asses state of manager process responding to inputs

Assess whether the RE manager is responding to requests:
While we have verbose logging on, if bluesky-cmds is running, you will see a sea of status messages printed to the logs (visible in either bluesky-cmds or docker desktop) with essentially the only update being the timestamp)
This is an immediate determination that the manager is in fact responding.

This can be tested explicitly via the CLI:

qserver status

Which will spit out a long JSON reply if it works or a shorter TimeoutError if it does not.

determine at what part of the plan the stall occured

Once again, start a shell in the re-manager container using the (>_) button

run:

tail -n30 hooks.log

This will print the last 30 lines of the hooks log file.
the number of lines can be changed by changing the number in the command.
In general, getting back to the previous checkpoint is a good idea, and maybe a few lines beyond that, as I did for the example above.

Possibly using madbg

To be clear, I'm not fully convinced madbg will help us here... while a really cool project that works in a rather mysterious way, I was unable to get it to attach at all while a plan was just spinning (I tested with an intentionally silly plan that has a long loop that just adds).

I think that the use of threads by queueserver makes it not quite work the way we need it to

But anyway, at least worth a shot.

Once again, start a shell in the re-manager container using (>_)

Run madbg attach <PID from htop>

see what happens... if you get to an ipdb> prompt, try pressing w and hitting enter

Honestly, I'm not experienced enough with actually using the debuggers in python to give actually good advice, but if this were to work, it would hopefully tell more of what is going on, but it is kind of the interactive "try and see" rather than being able to go in with a firm flowchart of things to do.

attempt to recover without restarting docker entirely

While in my experience it has not worked when the queue is stalled, a good first step is to try to pause the queue:

qserver re pause immediate

I saw TimeoutError in response to this command, but if it works, CPU usage should go down and you can try either resuming or halting the plan at that stage:

qserver re resume
qserver re halt

Those two will only work if the pause was successful.
If this works, it is quite possible that the resume will just continue as normal and the acquisition can be simply completed.
The issue, though, is that it hasn't worked.

However, If the pause is unsuccessful you can do

qserver environment destroy

To kill the worker environment, at which point the offending process (if it is the Worker, as it was yesterday), will be closed

It can be reopened (and in fact if bluesky-cmds is running, it will re-open immediately).
Once it is ready, the queue can be started again (it does take some time to set up).

There is also

qserver manager stop safe off

Though if it gets to that, not sure there is any real benefit over restarting docker...

@ksunden
Copy link
Member

ksunden commented Sep 23, 2022

image

An annotated example of htop (taken while idle, so CPU usage is low, but the red ellipse in the CPU% column is where I'd expect the 100% CPU to be if the presentation is the same as I saw yesterday.)

The ellipse in the RES column is showing the memory usage that is used to identify the Worker process.
The ellipse in the Command column shows the tree structure that is used to identify the Watchdog process.
The remaining process from that set, therefore, is the Manager.

@untzag
Copy link
Member

untzag commented Sep 23, 2022

Thanks so much for putting in the detail work @ksunden.

@ksunden
Copy link
Member

ksunden commented Sep 23, 2022

Just a thought:

Would probably be a good idea to do the tail of the hooks log multiple times before shutting down the worker, to make sure that the program is in fact not processing messages.

I doubt it is processing messages, as if it were, I'd expect the scan to continue towards completion. but worth checking.

@kelsonoram
Copy link

kelsonoram commented Sep 26, 2022

Bluesky stalled on Waldo over the weekend. The stall followed the same patterns as previous stalls with the CPU usage for the RE-manager at 100%. The scan that was stalled on was a 2d frequency scan. The attune-delay was not being applied, so the delay stages should not have been moving.

I followed the great plan laid out by @ksunden and the following are the results:

Htop results

Here is a screenshot of the htop results, hiding the non-relevant results. Like @ksunden saw earlier, the worker has the highest memory usage and has the high CPU usage. Unlike what was seen by ksunden earlier, the PIDs for the worker and the manager processes were both low. This indicates that the manager was not restarted at some point (I believe, feel free to correct me otherwise).
stalled htop

Attempt to assess state of manager process responding to inputs

Similar to what @ksunden saw before, the manager was responding to requests. I saw timestamped status messages coming in regularly. To be thorough, I did a qserver status request and got the following result:

(base) C:\Windows\system32>qserver status
Arguments: ['status']
C:\Users\Wright\miniconda3\lib\site-packages\zmq\_future.py:442: RuntimeWarning: Proactor event loop does not implement add_reader family of methods required for zmq. Registering an additional selector thread for add_reader support via tornado. Use `asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())` to avoid this warning.
  return self._get_loop().call_later(delay, callback)
10:59:32 - MESSAGE:
{'devices_allowed_uid': '6aeeec39-f45c-43d5-a565-9b1c4718db29',
 'devices_existing_uid': 'f55ca387-c408-4a28-8f6f-295599f4328e',
 'items_in_history': 46,
 'items_in_queue': 39,
 'lock': {'environment': False, 'queue': False},
 'lock_info_uid': '37b00059-e4ca-4cf2-a989-fec6a3266eff',
 'manager_state': 'executing_queue',
 'msg': 'RE Manager v0.0.16',
 'pause_pending': False,
 'plan_history_uid': 'f9fdef9d-7c5e-4c00-b7c6-b33b571a1276',
 'plan_queue_mode': {'loop': False},
 'plan_queue_uid': '409e94d3-a033-448a-bd84-7b65dcd83d5a',
 'plans_allowed_uid': '9502c798-7e4f-4087-8c3b-38d3d9102009',
 'plans_existing_uid': '057764f8-9857-44a4-9bc2-6efa07ab37ec',
 'queue_stop_pending': False,
 're_state': 'running',
 'run_list_uid': '5b60e537-1814-4877-86f1-26c9454e1748',
 'running_item_uid': '10ab9909-9bde-41f6-ae73-6e1dfaf574db',
 'task_results_uid': 'e9a663d6-0d69-4551-9676-bb4211ef4487',
 'worker_background_tasks': 0,
 'worker_environment_exists': True,
 'worker_environment_state': 'executing_plan'}

Hooks file and finding that hp was least read before the stall

I then looked into the hooks log file. This showed that when the scan stalled, the last message processed was 'hp' and the next message would be 'twin2'. This is the second time it stalled on reading an hp message. I do not know yet whether this is a coincidence or the source of our troubles. The previous checkpoint of the hooks log is posted below. I did the tail of the hooks.log several times to check if the worker was processing messages still. The output never changed, so the worker was not processing messages.

None
Msg('create', obj=None, args=(), kwargs={'name': 'primary'}, run=None)
create(None, *() **{'name': 'primary'}, run=None)
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, *() **{}, run=None)
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{}, run=None)
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:38003 (attune:twin2)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:38003 (attune:twin2)>, *() **{}, run=None)
Msg('save', obj=None, args=(), kwargs={}, run=None)
save(None, *() **{}, run=None)
[event] document emitted with data keys dict_keys(['daq_ai0', 'daq_ai1', 'daq_ai2', 'daq_ai3', 'daq_ai0_a', 'daq_ai0_b', 'daq_ai0_c', 'daq_ai0_d', 'daq_ai0_diff_abcd', 'daq_ai0_diff_ab', 'daq_ai0_diff_ad', 'hp', 'hp_destination', 'hp_Delay_1', 'hp_Delay_1_destination', 'hp_Delay_1_position_identifier', 'hp_Crystal_1', 'hp_Crystal_1_destination', 'hp_Crystal_1_position_identifier', 'hp_Delay_2', 'hp_Delay_2_destination', 'hp_Delay_2_position_identifier', 'hp_Crystal_2', 'hp_Crystal_2_destination', 'hp_Crystal_2_position_identifier', 'hp_SHG_Crystal', 'hp_SHG_Crystal_destination', 'hp_SHG_Crystal_position_identifier', 'hp_RP5_Stage', 'hp_RP5_Stage_destination', 'hp_RP5_Stage_position_identifier', 'hp_WS_Wheel_1', 'hp_WS_Wheel_1_destination', 'hp_WS_Wheel_1_position_identifier', 'hp_RP6_Stage', 'hp_RP6_Stage_destination', 'hp_RP6_Stage_position_identifier', 'hp_Crystal_Stage_2', 'hp_Crystal_Stage_2_destination', 'hp_Crystal_Stage_2_position_identifier', 'hp_WS_Wheel_2', 'hp_WS_Wheel_2_destination', 'hp_WS_Wheel_2_position_identifier', 'hp_FH_IDL', 'hp_FH_IDL_destination', 'hp_d1', 'hp_d1_destination', 'hp_d1_offset', 'hp_d1_stage', 'hp_d1_stage_destination', 'hp_d2', 'hp_d2_destination', 'hp_d2_offset', 'hp_d2_stage', 'hp_d2_stage_destination', 'twin2', 'twin2_destination', 'twin2_Delay_1', 'twin2_Delay_1_destination', 'twin2_Delay_1_position_identifier', 'twin2_Crystal_1', 'twin2_Crystal_1_destination', 'twin2_Crystal_1_position_identifier', 'twin2_Delay_2', 'twin2_Delay_2_destination', 'twin2_Delay_2_position_identifier', 'twin2_Crystal_2', 'twin2_Crystal_2_destination', 'twin2_Crystal_2_position_identifier', 'twin2_SHG_Crystal', 'twin2_SHG_Crystal_destination', 'twin2_SHG_Crystal_position_identifier', 'twin2_RP_Stage', 'twin2_RP_Stage_destination', 'twin2_RP_Stage_position_identifier', 'twin2_DFG_CS_1', 'twin2_DFG_CS_1_destination', 'twin2_DFG_CS_1_position_identifier', 'twin2_d2', 'twin2_d2_destination', 'twin2_d2_offset', 'twin2_d2_stage', 'twin2_d2_stage_destination']) (run_uid='5a0d247d-37c9-4ef8-9fcc-92b1c0dfd373')
Msg('checkpoint', obj=None, args=(), kwargs={}, run=None)
checkpoint(None, *() **{}, run=None)
Msg('set', obj=<yaqc_bluesky.Device to host.docker.internal:38003 (attune:twin2)>, args=(2185.7923497267757,), kwargs={'group': 'set-685059'}, run=None)
set(<yaqc_bluesky.Device to host.docker.internal:38003 (attune:twin2)>, *(2185.7923497267757,) **{'group': 'set-685059'}, run=None)
Msg('wait', obj=None, args=(), kwargs={'group': 'set-685059'}, run=None)
wait(None, *() **{'group': 'set-685059'}, run=None)
{<yaqc_bluesky._status.Status object at 0x7fb7432083d0>}
The object <yaqc_bluesky.Device to host.docker.internal:38003 (attune:twin2)> reports set is done with status True
None
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, args=(), kwargs={'group': 'trigger-6ac0ca'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, *() **{'group': 'trigger-6ac0ca'}, run=None)
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, args=(), kwargs={'group': 'trigger-6ac0ca'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{'group': 'trigger-6ac0ca'}, run=None)
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:38003 (attune:twin2)>, args=(), kwargs={'group': 'trigger-6ac0ca'}, run=None)
The object <yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)> reports trigger is done with status True.
trigger(<yaqc_bluesky.Device to host.docker.internal:38003 (attune:twin2)>, *() **{'group': 'trigger-6ac0ca'}, run=None)
Msg('wait', obj=None, args=(), kwargs={'group': 'trigger-6ac0ca'}, run=None)
The object <yaqc_bluesky.Device to host.docker.internal:38003 (attune:twin2)> reports trigger is done with status True.
wait(None, *() **{'group': 'trigger-6ac0ca'}, run=None)
{<yaqc_bluesky._status.Status object at 0x7fb742bbef10>, <yaqc_bluesky._status.Status object at 0x7fb7432081c0>, <yaqc_bluesky._status.Status object at 0x7fb742bbe940>}
The object <yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)> reports trigger is done with status True.
None
Msg('create', obj=None, args=(), kwargs={'name': 'primary'}, run=None)
create(None, *() **{'name': 'primary'}, run=None)
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, *() **{}, run=None)
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{}, run=None)

madbg failed attempt

I tried the madbg method but it did not work. The output of my attempt is posted below:

# madbg attach 18047
Traceback (most recent call last):
  File "/usr/local/bin/madbg", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/madbg/__main__.py", line 34, in attach
    attach_to_process(pid, port, connect_timeout=timeout)
  File "/usr/local/lib/python3.9/dist-packages/madbg/api.py", line 34, in attach_to_process
    _inject_set_trace(pid, ip, port)
  File "/usr/local/lib/python3.9/dist-packages/madbg/api.py", line 28, in _inject_set_trace
    inject_py(pid, f'__import__("signal").signal({sig_num},lambda _,f:__import__("madbg").set_trace(f,"{ip}",{port}))')
  File "/usr/local/lib/python3.9/dist-packages/hypno/hypno.py", line 42, in inject_py
    inject(pid, str(temp.name))
  File "/usr/local/lib/python3.9/dist-packages/pyinjector/pyinjector.py", line 103, in inject
    return injector.inject(library_path)
  File "/usr/local/lib/python3.9/dist-packages/pyinjector/pyinjector.py", line 87, in inject
    call_c_func(libinjector.injector_inject, self.injector_p, library_path, pointer(handle))
  File "/usr/local/lib/python3.9/dist-packages/pyinjector/pyinjector.py", line 66, in call_c_func
    raise exception_cls(func.__name__, ret, libinjector.injector_error())
pyinjector.pyinjector.InjectorError: injector_inject returned -1: The target process unexpectedly terminated with exit code 127.

Getting bluesky restarted

After following these steps in tracking down the problem, I tried to get bluesky working again. I tried the command qserver re pause immediate, but it did not work. The attempt is shown below:

(base) C:\Windows\system32>qserver re pause immediate
Arguments: ['re', 'pause', 'immediate']
C:\Users\Wright\miniconda3\lib\site-packages\zmq\_future.py:442: RuntimeWarning: Proactor event loop does not implement add_reader family of methods required for zmq. Registering an additional selector thread for add_reader support via tornado. Use `asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())` to avoid this warning.
  return self._get_loop().call_later(delay, callback)
11:13:39 - MESSAGE:
{'msg': 'Timeout occurred', 'success': None}

The qserver re environment destroy command was successful in restarting the worker without restarting docker. The scans were able to then continue.

@ksunden
Copy link
Member

ksunden commented Sep 26, 2022

The process that you would have wanted to attach to would have been 141, the one you did was htop

@ksunden
Copy link
Member

ksunden commented Sep 26, 2022

Okay, so N=2 it has occurred while attempting to read the hp, looking more and more solid that that is where to look to fix this... just not sure why that would be yet.

@ksunden
Copy link
Member

ksunden commented Sep 26, 2022

I note that the last line looks to be cut off, that could be a copy paste error or it could indicate that the buffer never got flushed to the output file..., hmmmm

@kelsonoram
Copy link

Good catch, that is a copy and paste error. that last line should be:
read(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{}, run=None)

@kelsonoram
Copy link

I updated the comment to accurately reflect what was given in the hooks log. Sorry about that @ksunden

@ksunden
Copy link
Member

ksunden commented Sep 27, 2022

When I or @untzag get another chance to look at this, we may implement some additional logging to get at what is happening here, but for now documenting as thoroughly as we can helps bring confidence to the conclusion that it is something happening while reading the hp (possibly attune daemons in general, but both of the most well documented instances are hp)

I have a couple logs I may wish to put in bluesky itself to ensure that it is happening in the actual hardware communication, but more than likely it is.

Beyond that, some logs in yaqc and/or yaqc-bluesky may be warranted.

I've looked at darn near every line of code that gets run for that operation in that process short of looking at python itself and come up short on an explanation thus far: bluesky, yaqc-bluesky, yaqc, fastavro.

The most suspicious thing so far is a potential infinite loop in yaqc, but even that my analysis of the function is that it should idle the CPU instead of spinning the CPU, so I'm not sure it fits the presentation.
That said, it is the only non-fixed length loop I can see in that whole stack below the bluesky run engine itself.

Logging more fine grained may be the best we can do to zero in on this... unfortunately the feedback loop is slow, but I have confidence we can resolve this eventually, just need to find it first.

@kelsonoram
Copy link

Lazily using the logged in account at the lab machine, but this is @ksunden.

Stall did occur overnight, same symptoms: last Msg was read of hp, spinning CPU on the worker, environment destroy successfully relieved the spinning

Before doing environment destroy, I attempted a few extra steps:

  • Did madbg with the correct PID, but it did not work (just hung)
  • I tried restarting lightcon motors, attune delay and attune, to see if that jolt kicked it in the right way, to no avail.
    • This leads me to think it may not be the loop I said I was suspicious of... specifically because even if I am wrong about CPU usage in that loop, it should error out of that loop on lost connection.

@kelsonoram
Copy link

We have another stall! This is the first stall since the extra logging features were installed by @ksunden.
This stall exhibited the same behavior as the other stalls. The CPU usage for the worker in the RE-manager is at 100% and the manager was still responding to requests. I looked at htop again and looked at the more extensive hooks log. It looks like our problem occurred in the FunctionResponse of trying to read hp's RP6 stage.

htop results

Here is a screenshot of the filtered htop results. Like before you see the worker is takin all of the CPU usage.
htop results for 20220929

Hooks log results

Below I have the output to the hooks log from the most recent checkpoint. I have saved elsewhere a full checkpoint log that was not apart of the stall. I saved this because more info is not bad, but I don't think it will reveal much of anything.

Msg('checkpoint', obj=None, args=(), kwargs={}, run=None)
checkpoint(None, *() **{}, run=None)
Msg('set', obj=<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, args=(-0.43000000000000016,), kwargs={'group': 'set-5ca32d'}, run=None)
set(<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, *(-0.43000000000000016,) **{'group': 'set-5ca32d'}, run=None)
FunctionArgs(name='HasPosition.set', host='host.docker.internal', port=39001, args=(-0.43000000000000016,), kwargs={})
FunctionArgs(name='Base._wait_until_still', host='host.docker.internal', port=39001, args=(), kwargs={})
FunctionResponse(name='Base._wait_until_still', host='host.docker.internal', port=39001, data=<yaqc_bluesky._status.Status object at 0x7f97d4a0fb80>)
FunctionResponse(name='HasPosition.set', host='host.docker.internal', port=39001, data=<yaqc_bluesky._status.Status object at 0x7f97d4a0fb80>)
Msg('wait', obj=None, args=(), kwargs={'group': 'set-5ca32d'}, run=None)
wait(None, *() **{'group': 'set-5ca32d'}, run=None)
{<yaqc_bluesky._status.Status object at 0x7f97d4a0fb80>}
The object <yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)> reports set is done with status True
None
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, args=(), kwargs={'group': 'trigger-df95ae'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, *() **{'group': 'trigger-df95ae'}, run=None)
FunctionArgs(name='HasMeasureTrigger.trigger', host='host.docker.internal', port=39003, args=(), kwargs={})
FunctionArgs(name='Base._wait_until_still', host='host.docker.internal', port=39003, args=(), kwargs={})
FunctionResponse(name='Base._wait_until_still', host='host.docker.internal', port=39003, data=<yaqc_bluesky._status.Status object at 0x7f97d4a0f5b0>)
FunctionResponse(name='HasMeasureTrigger.trigger', host='host.docker.internal', port=39003, data=<yaqc_bluesky._status.Status object at 0x7f97d4a0f5b0>)
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, args=(), kwargs={'group': 'trigger-df95ae'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{'group': 'trigger-df95ae'}, run=None)
FunctionArgs(name='Base.trigger', host='host.docker.internal', port=38001, args=(), kwargs={})
FunctionResponse(name='Base.trigger', host='host.docker.internal', port=38001, data=<yaqc_bluesky._status.Status object at 0x7f97cfb490d0>)
The object <yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)> reports trigger is done with status True.
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, args=(), kwargs={'group': 'trigger-df95ae'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, *() **{'group': 'trigger-df95ae'}, run=None)
FunctionArgs(name='Base.trigger', host='host.docker.internal', port=39001, args=(), kwargs={})
FunctionResponse(name='Base.trigger', host='host.docker.internal', port=39001, data=<yaqc_bluesky._status.Status object at 0x7f97ce232fa0>)
The object <yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)> reports trigger is done with status True.
Msg('wait', obj=None, args=(), kwargs={'group': 'trigger-df95ae'}, run=None)
wait(None, *() **{'group': 'trigger-df95ae'}, run=None)
{<yaqc_bluesky._status.Status object at 0x7f97ce232fa0>, <yaqc_bluesky._status.Status object at 0x7f97d4a0f5b0>, <yaqc_bluesky._status.Status object at 0x7f97cfb490d0>}
The object <yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)> reports trigger is done with status True.
None
Msg('create', obj=None, args=(), kwargs={'name': 'primary'}, run=None)
create(None, *() **{'name': 'primary'}, run=None)
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, *() **{}, run=None)
FunctionArgs(name='Base.read', host='host.docker.internal', port=39003, args=(), kwargs={})
FunctionArgs(name='IsSensor._read', host='host.docker.internal', port=39003, args=(OrderedDict(), 1664455872.3688958), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=39003, args=(OrderedDict(), 1664455872.3688958), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=39003, data=OrderedDict())
FunctionResponse(name='IsSensor._read', host='host.docker.internal', port=39003, data=OrderedDict([('daq_ai0', {'value': -7.366943359375009e-06, 'timestamp': 1664455872.3688958}), ('daq_ai1', {'value': 2.8955078124999976e-06, 'timestamp>
FunctionResponse(name='Base.read', host='host.docker.internal', port=39003, data=OrderedDict([('daq_ai0', {'value': -7.366943359375009e-06, 'timestamp': 1664455872.3688958}), ('daq_ai1', {'value': 2.8955078124999976e-06, 'timestamp': 16>
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{}, run=None)
FunctionArgs(name='Base.read', host='host.docker.internal', port=38001, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38001, args=(OrderedDict(), 1664455872.372715), kwargs={})
FunctionArgs(name='HasDependent._read', host='host.docker.internal', port=38001, args=(OrderedDict(), 1664455872.372715), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38001, args=(OrderedDict(), 1664455872.372715), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38001, data=OrderedDict([('hp_destination', {'value': 550.3577325261418, 'timestamp': 1664455872.372715}), ('hp_position', {'value': 550.3577325261418, 'timestamp': 1>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38701, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38701, args=(OrderedDict(), 1664455872.3783102), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38701, args=(OrderedDict(), 1664455872.3783102), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38701, data=OrderedDict([('hp_Delay_1_destination', {'value': 3.8164004743682463, 'timestamp': 1664455872.3783102}), ('hp_Delay_1_position', {'value': 3.81625, 'times>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38701, data=OrderedDict([('hp_Delay_1', {'value': 3.81625, 'timestamp': 1664455872.3783102}), ('hp_Delay_1_destination', {'value': 3.8164004743682463, 'timesta>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38701, data=OrderedDict([('hp_Delay_1', {'value': 3.81625, 'timestamp': 1664455872.3783102}), ('hp_Delay_1_destination', {'value': 3.8164004743682463, 'timestamp': 166>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38702, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38702, args=(OrderedDict(), 1664455872.3846343), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38702, args=(OrderedDict(), 1664455872.3846343), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38702, data=OrderedDict([('hp_Crystal_1_destination', {'value': -2.856083263387436, 'timestamp': 1664455872.3846343}), ('hp_Crystal_1_position', {'value': -2.85535172>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38702, data=OrderedDict([('hp_Crystal_1', {'value': -2.855351721773416, 'timestamp': 1664455872.3846343}), ('hp_Crystal_1_destination', {'value': -2.8560832633>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38702, data=OrderedDict([('hp_Crystal_1', {'value': -2.855351721773416, 'timestamp': 1664455872.3846343}), ('hp_Crystal_1_destination', {'value': -2.856083263387436, '>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38703, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38703, args=(OrderedDict(), 1664455872.3931024), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38703, args=(OrderedDict(), 1664455872.3931024), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38703, data=OrderedDict([('hp_Delay_2_destination', {'value': 39.37194601278524, 'timestamp': 1664455872.3931024}), ('hp_Delay_2_position', {'value': 39.3718582938693>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38703, data=OrderedDict([('hp_Delay_2', {'value': 39.37185829386933, 'timestamp': 1664455872.3931024}), ('hp_Delay_2_destination', {'value': 39.37194601278524,>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38703, data=OrderedDict([('hp_Delay_2', {'value': 39.37185829386933, 'timestamp': 1664455872.3931024}), ('hp_Delay_2_destination', {'value': 39.37194601278524, 'timest>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38704, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38704, args=(OrderedDict(), 1664455872.3995526), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38704, args=(OrderedDict(), 1664455872.3995526), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38704, data=OrderedDict([('hp_Crystal_2_destination', {'value': 6.5421741153368, 'timestamp': 1664455872.3995526}), ('hp_Crystal_2_position', {'value': 6.541487839372>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38704, data=OrderedDict([('hp_Crystal_2', {'value': 6.541487839372575, 'timestamp': 1664455872.3995526}), ('hp_Crystal_2_destination', {'value': 6.542174115336>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38704, data=OrderedDict([('hp_Crystal_2', {'value': 6.541487839372575, 'timestamp': 1664455872.3995526}), ('hp_Crystal_2_destination', {'value': 6.5421741153368, 'time>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38705, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38705, args=(OrderedDict(), 1664455872.4057093), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38705, args=(OrderedDict(), 1664455872.4057093), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38705, data=OrderedDict([('hp_SHG_Crystal_destination', {'value': 0.26, 'timestamp': 1664455872.4057093}), ('hp_SHG_Crystal_position', {'value': 0.25983514046328215, >
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38705, data=OrderedDict([('hp_SHG_Crystal', {'value': 0.25983514046328215, 'timestamp': 1664455872.4057093}), ('hp_SHG_Crystal_destination', {'value': 0.26, 't>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38705, data=OrderedDict([('hp_SHG_Crystal', {'value': 0.25983514046328215, 'timestamp': 1664455872.4057093}), ('hp_SHG_Crystal_destination', {'value': 0.26, 'timestamp>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38706, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38706, args=(OrderedDict(), 1664455872.4133527), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38706, args=(OrderedDict(), 1664455872.4133527), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38706, data=OrderedDict([('hp_RP5_Stage_destination', {'value': 213.1696142857143, 'timestamp': 1664455872.4133527}), ('hp_RP5_Stage_position', {'value': 213.16961428>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38706, data=OrderedDict([('hp_RP5_Stage', {'value': 213.1696142857143, 'timestamp': 1664455872.4133527}), ('hp_RP5_Stage_destination', {'value': 213.1696142857>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38706, data=OrderedDict([('hp_RP5_Stage', {'value': 213.1696142857143, 'timestamp': 1664455872.4133527}), ('hp_RP5_Stage_destination', {'value': 213.1696142857143, 'ti>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38707, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38707, args=(OrderedDict(), 1664455872.4197304), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38707, args=(OrderedDict(), 1664455872.4197304), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38707, data=OrderedDict([('hp_WS_Wheel_1_destination', {'value': 119.99294329432945, 'timestamp': 1664455872.4197304}), ('hp_WS_Wheel_1_position', {'value': 119.99294>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38707, data=OrderedDict([('hp_WS_Wheel_1', {'value': 119.99294329432945, 'timestamp': 1664455872.4197304}), ('hp_WS_Wheel_1_destination', {'value': 119.9929432>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38707, data=OrderedDict([('hp_WS_Wheel_1', {'value': 119.99294329432945, 'timestamp': 1664455872.4197304}), ('hp_WS_Wheel_1_destination', {'value': 119.99294329432945,>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38708, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38708, args=(OrderedDict(), 1664455872.4277976), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38708, args=(OrderedDict(), 1664455872.4277976), kwargs={})

I hope this is helpful in tracking down the source of our problem.

@ddkohler
Copy link
Contributor Author

ddkohler commented Sep 29, 2022

If I interpret this correctly, the problem is that calls to hp dependent on port 38708 never get back to the run engine? @kelsonoram is does that port correspond to the thorlabs motor controlling the harmonic crystal?

@kelsonoram
Copy link

@ddkohler The port 38708 corresponds to the lightcon-topas4-motor for the hp_RP6_Stage.
Since this is the first time we had this extensive logging, I don't know if the problem has anything to do with this motor specifically or if it just happened to be this motor among all the hp motors.

@ddkohler
Copy link
Contributor Author

Whichever daemon it actually is, I think this log rules out bluesky as the problem and suggests a deeper look into yaqc-bluesky processes.

@ksunden
Copy link
Member

ksunden commented Sep 29, 2022

Question: what are the the keys for the position identifiers for that motor?

c = yaqc.Client(38708)
c.get_position_identifier_options()

If I remember correctly for that motor it is {"ON", "OFF"} (not sure of order, but that doesn't matter)

If it is empty (no discrete positions configured), then I'm really not sure what is going on.

If it is not empty, I may have an idea regarding our dependency, fastavro, not erroring in a case where I would have expected it to when reading strings that are incomplete, and thus causing some problems on rare occasions when the bytes don't make it in time to be parsed.

(specifically, if it looks like I should have a length 3 string, but only provide 2 bytes, it will just give you the length 2 string and leave the last byte hanging)

I still do not have an explanation for how that causes the CPU utilization we are seeing, but was something I noticed when thinking on this, that would be quite rare, and dependent on timing we don't have control over fully.

Blaise and I have discussed another round of logging, which we will try to implement.
That said, if this is the reason, then it's extremely possible that logging in the critical portion of the code to observe this will actually end up shifting probabilities to where we don't see it (but it isn't "fixed", just "hiding")

Soooo, if we do implement the next round of logging and then do not see the stall happen for a significant amount of time, there may be a round of logging where we take out the logs in the critical portion, but try to catch it afterwards and try to trigger it again so we are confident about what is happening.

@ksunden
Copy link
Member

ksunden commented Sep 29, 2022

I am additionally curious about the WS_Wheel_1 prior, in the cut off portion of that log what the identifier was that it read, as it could have been that one that caused it, and if it was truncated from what we expect, that would confirm this theory (though again, not sure on the mechanism of it taxing CP....)

That motor has longer names than most of the others, and thus would be more likely to have this particular error.

@ksunden
Copy link
Member

ksunden commented Sep 29, 2022

This explanation also explains why it is a problem for waldo and not the other systems, as none of the other systems use discrete motors.

Still, I wish I had an explanation for the mechanism of the high CPU utilization... that is still a mystery to me

@kelsonoram
Copy link

Here are the position identifier options:

In [3]: c = yaqc.Client(38708)
   ...: c.get_position_identifier_options()
Out[3]: ['Rotate 90 deg', 'No rotation']

@ksunden
Copy link
Member

ksunden commented Sep 29, 2022

Can you do 38707 as well, please?

@kelsonoram
Copy link

Yes, I don't know why I didn't add that one before:

In [2]: c = yaqc.Client(38707)
   ...: c.get_position_identifier_options()
Out[2]: ['600-765', '755-1030', '1030-1385', '1385-1550', '1550-1970', '1970-2600']

@ksunden
Copy link
Member

ksunden commented Sep 29, 2022

Okay, I've wavered back and forth a bit, the mechanics of this are super subtle and timing based in ways that are difficult to describe and ways that are potentially self correcting in many instances, actually, which makes it harder to find.

If I'm right, we may actually have many of the strings that are read be truncated, we just don't notice because we don't usually actually care about the strings (they don't make it to the wt5 files, for instance)

@kelsonoram
Copy link

Another stall, another series of logs:

htop

htop results 20220930

Hooks log

The stall occurred in hp once again, but a different motor than yesterday. It failed when waiting for the response from the hp_SHG_Crystal at port 38705. Immediately before this motor polling, it was talking to the hp_Crystal_2 motor (at port 38704). The relevant logs are below

Msg('checkpoint', obj=None, args=(), kwargs={}, run=None)
checkpoint(None, *() **{}, run=None)
Msg('set', obj=<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, args=(-0.4850000000000001,), kwargs={'group': 'set-2fb4ac'}, run=None)
set(<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, *(-0.4850000000000001,) **{'group': 'set-2fb4ac'}, run=None)
FunctionArgs(name='HasPosition.set', host='host.docker.internal', port=39001, args=(-0.4850000000000001,), kwargs={})
FunctionArgs(name='Base._wait_until_still', host='host.docker.internal', port=39001, args=(), kwargs={})
FunctionResponse(name='Base._wait_until_still', host='host.docker.internal', port=39001, data=<yaqc_bluesky._status.Status object at 0x7fd32804dc70>)
FunctionResponse(name='HasPosition.set', host='host.docker.internal', port=39001, data=<yaqc_bluesky._status.Status object at 0x7fd32804dc70>)
Msg('wait', obj=None, args=(), kwargs={'group': 'set-2fb4ac'}, run=None)
wait(None, *() **{'group': 'set-2fb4ac'}, run=None)
{<yaqc_bluesky._status.Status object at 0x7fd32804dc70>}
The object <yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)> reports set is done with status True
None
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, args=(), kwargs={'group': 'trigger-ad2e16'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, *() **{'group': 'trigger-ad2e16'}, run=None)
FunctionArgs(name='HasMeasureTrigger.trigger', host='host.docker.internal', port=39003, args=(), kwargs={})
FunctionArgs(name='Base._wait_until_still', host='host.docker.internal', port=39003, args=(), kwargs={})
FunctionResponse(name='Base._wait_until_still', host='host.docker.internal', port=39003, data=<yaqc_bluesky._status.Status object at 0x7fd349ab30d0>)
FunctionResponse(name='HasMeasureTrigger.trigger', host='host.docker.internal', port=39003, data=<yaqc_bluesky._status.Status object at 0x7fd349ab30d0>)
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, args=(), kwargs={'group': 'trigger-ad2e16'}, run=None)
trigger(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{'group': 'trigger-ad2e16'}, run=None)
FunctionArgs(name='Base.trigger', host='host.docker.internal', port=38001, args=(), kwargs={})
FunctionResponse(name='Base.trigger', host='host.docker.internal', port=38001, data=<yaqc_bluesky._status.Status object at 0x7fd32aa77b20>)
Msg('trigger', obj=<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, args=(), kwargs={'group': 'trigger-ad2e16'}, run=None)
The object <yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)> reports trigger is done with status True.
trigger(<yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)>, *() **{'group': 'trigger-ad2e16'}, run=None)
FunctionArgs(name='Base.trigger', host='host.docker.internal', port=39001, args=(), kwargs={})
FunctionResponse(name='Base.trigger', host='host.docker.internal', port=39001, data=<yaqc_bluesky._status.Status object at 0x7fd32aa77160>)
Msg('wait', obj=None, args=(), kwargs={'group': 'trigger-ad2e16'}, run=None)
The object <yaqc_bluesky.Device to host.docker.internal:39001 (attune-delay:d1)> reports trigger is done with status True.
wait(None, *() **{'group': 'trigger-ad2e16'}, run=None)
{<yaqc_bluesky._status.Status object at 0x7fd32aa77b20>, <yaqc_bluesky._status.Status object at 0x7fd349ab30d0>, <yaqc_bluesky._status.Status object at 0x7fd32aa77160>}
The object <yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)> reports trigger is done with status True.
None
Msg('create', obj=None, args=(), kwargs={'name': 'primary'}, run=None)
create(None, *() **{'name': 'primary'}, run=None)
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:39003 (gage-chopping:daq)>, *() **{}, run=None)
FunctionArgs(name='Base.read', host='host.docker.internal', port=39003, args=(), kwargs={})
FunctionArgs(name='IsSensor._read', host='host.docker.internal', port=39003, args=(OrderedDict(), 1664550137.190457), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=39003, args=(OrderedDict(), 1664550137.190457), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=39003, data=OrderedDict())
FunctionResponse(name='IsSensor._read', host='host.docker.internal', port=39003, data=OrderedDict([('daq_ai0', {'value': -1.045074462890625e-05, 'timestamp': 1664550137.190457}), ('daq_ai1', {'value': -2.420043945312493e-06, 'timestamp'>
FunctionResponse(name='Base.read', host='host.docker.internal', port=39003, data=OrderedDict([('daq_ai0', {'value': -1.045074462890625e-05, 'timestamp': 1664550137.190457}), ('daq_ai1', {'value': -2.420043945312493e-06, 'timestamp': 166>
Msg('read', obj=<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, args=(), kwargs={}, run=None)
read(<yaqc_bluesky.Device to host.docker.internal:38001 (attune:hp)>, *() **{}, run=None)
FunctionArgs(name='Base.read', host='host.docker.internal', port=38001, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38001, args=(OrderedDict(), 1664550137.1937857), kwargs={})
FunctionArgs(name='HasDependent._read', host='host.docker.internal', port=38001, args=(OrderedDict(), 1664550137.1937857), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38001, args=(OrderedDict(), 1664550137.1937857), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38001, data=OrderedDict([('hp_destination', {'value': 549.1488193300384, 'timestamp': 1664550137.1937857}), ('hp_position', {'value': 549.1488193300384, 'timestamp': >
FunctionArgs(name='Base.read', host='host.docker.internal', port=38701, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38701, args=(OrderedDict(), 1664550137.1974971), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38701, args=(OrderedDict(), 1664550137.1974971), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38701, data=OrderedDict([('hp_Delay_1_destination', {'value': 3.8210821724195463, 'timestamp': 1664550137.1974971}), ('hp_Delay_1_position', {'value': 3.8209375, 'tim>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38701, data=OrderedDict([('hp_Delay_1', {'value': 3.8209375, 'timestamp': 1664550137.1974971}), ('hp_Delay_1_destination', {'value': 3.8210821724195463, 'times>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38701, data=OrderedDict([('hp_Delay_1', {'value': 3.8209375, 'timestamp': 1664550137.1974971}), ('hp_Delay_1_destination', {'value': 3.8210821724195463, 'timestamp': 1>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38702, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38702, args=(OrderedDict(), 1664550137.2025602), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38702, args=(OrderedDict(), 1664550137.2025602), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38702, data=OrderedDict([('hp_Crystal_1_destination', {'value': -2.8479981504738907, 'timestamp': 1664550137.2025602}), ('hp_Crystal_1_position', {'value': -2.8473160>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38702, data=OrderedDict([('hp_Crystal_1', {'value': -2.847316064885251, 'timestamp': 1664550137.2025602}), ('hp_Crystal_1_destination', {'value': -2.8479981504>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38702, data=OrderedDict([('hp_Crystal_1', {'value': -2.847316064885251, 'timestamp': 1664550137.2025602}), ('hp_Crystal_1_destination', {'value': -2.8479981504738907, >
FunctionArgs(name='Base.read', host='host.docker.internal', port=38703, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38703, args=(OrderedDict(), 1664550137.2075458), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38703, args=(OrderedDict(), 1664550137.2075458), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38703, data=OrderedDict([('hp_Delay_2_destination', {'value': 39.32391605124373, 'timestamp': 1664550137.2075458}), ('hp_Delay_2_position', {'value': 39.3236443525403>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38703, data=OrderedDict([('hp_Delay_2', {'value': 39.32364435254034, 'timestamp': 1664550137.2075458}), ('hp_Delay_2_destination', {'value': 39.32391605124373,>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38703, data=OrderedDict([('hp_Delay_2', {'value': 39.32364435254034, 'timestamp': 1664550137.2075458}), ('hp_Delay_2_destination', {'value': 39.32391605124373, 'timest>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38704, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38704, args=(OrderedDict(), 1664550137.2118905), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38704, args=(OrderedDict(), 1664550137.2118905), kwargs={})
FunctionResponse(name='Base._read', host='host.docker.internal', port=38704, data=OrderedDict([('hp_Crystal_2_destination', {'value': 6.610320868235731, 'timestamp': 1664550137.2118905}), ('hp_Crystal_2_position', {'value': 6.6111301990>
FunctionResponse(name='HasPosition._read', host='host.docker.internal', port=38704, data=OrderedDict([('hp_Crystal_2', {'value': 6.611130199070004, 'timestamp': 1664550137.2118905}), ('hp_Crystal_2_destination', {'value': 6.610320868235>
FunctionResponse(name='Base.read', host='host.docker.internal', port=38704, data=OrderedDict([('hp_Crystal_2', {'value': 6.611130199070004, 'timestamp': 1664550137.2118905}), ('hp_Crystal_2_destination', {'value': 6.610320868235731, 'ti>
FunctionArgs(name='Base.read', host='host.docker.internal', port=38705, args=(), kwargs={})
FunctionArgs(name='HasPosition._read', host='host.docker.internal', port=38705, args=(OrderedDict(), 1664550137.2165103), kwargs={})
FunctionArgs(name='Base._read', host='host.docker.internal', port=38705, args=(OrderedDict(), 1664550137.2165103), kwargs={})

Position Identifiers

The position identifiers for the daemon that failed us here and the one preceding it are below:

hp_SHG_Crystal position identifier

In [2]: c = yaqc.Client(38705)
   ...: c.get_position_identifier_options()
Out[2]: ['Max SHG', 'Off']

hp_Crystal_2 position identifier

In [3]: c = yaqc.Client(38704)
   ...: c.get_position_identifier_options()
Out[3]: []

@kelsonoram
Copy link

Some other things to note which are probably not that relevant:

  1. Ever since the most recent extra logging was implemented by @ksunden, bluesky-cmds GUI has been closing when it stalls. I
    thought that I may have accidentally closed it Wednesday evening, but I made sure to keep the GUI going on Thursday night and it was exited out of when I came in.

  2. Ryan has had the ps table's bluesky crash on him in the recent weeks. Yesterday (I think, sometime this week for sure) before he restarted docker I looked at the stats for the RE-manager, and the cpu was at 100%. So that is just to add to our sample size, and that this issue may not be specific to Waldo/light conversion. The ps table did behave differently though, in that it fails when it tries to start a scan, but if a scan is going it does not fail. So a bit of the opposite for Waldo which stalls during a scan but is fine adding scans.

@ksunden
Copy link
Member

ksunden commented Sep 30, 2022

Ever since the most recent extra logging was implemented by @ksunden, bluesky-cmds GUI has been closing when it stalls.

The only plausible reason for this that I can think of at the moment is that the OS is killing it because it needs to recover resources.... even that doesn't make a ton of sense to me, though.
I will state that I did notice it was closed when I came in on Tuesday, though I had no reason to believe it was not simply closed intentionally.
At this stage, I think it is a red herring to go after that, so I think we stay the course and try to resolve this with logging.

As for the ps system... I think we are far enough down the path with waldo that we first resolve what we see there then revisit if the ps system needs anything different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants