-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multipathd.service systemctl service should have restart enabled. #100
Comments
I feel pretty confident with setting Restart=on-abort Possibly with modified StartLimits. I'm a little worried that if we restart on timeouts, we could get into a situation were some block device issue is keeping multipathd from starting up correctly, and we just keep timing out over and over again. So we'd need something like (for example) StartLimitBurst=3 and we'd possibly need to set TimeoutSec to make sure it makes sense with our limits. Thoughts, @mwilck |
Just a warning about a possible issue. If multipathd crashes while doing the DM_TABLE_LOAD_CMD ioctl, then you can end up with partially created device. When multipathd re-starts then it will log the message: configure failed at map discovery and multipath -ll will return 7793.879783 | libmp_mapinfo: map mpatha has no targets dmsetup shows: dmsetup ls To fix the issue you need to run dmsetup remove to force the removal or if you add/remove paths then it seems re-finish the setup and you get a device that can be used. |
Oh. Thanks for the heads-up. We should actually handle that regardless of whether we enable auto-restarts. Since you can reproduce that just by running:
dm devices with no tables certainly aren't common, but they're also not that improbable. If they don't have a mpath dm uid, we should just ignore them. If they do have a dm uid that starts with "mpath-" it's a little trickier. Generally, multipath feels free to manage any device that has a single target of type "multipath" and a uid that starts with "mpath-". But in this case, there is no target information at all, just the "mpath-" prefixed uid. I feel like multipath should probably check for these and clean them up anyways, since they are far more likely to be failed multipath devices than anything else, but I'd be interested in hearing any objections. |
This makes sense. The defaults for
If there were some block device issue, one or more multipathd threads would hang in uninterruptible sleep, and systemd would probably fail to stop the service in the first place. Therefore I'm not too worried about too many restarts. With async path checkers, I think that such a situation should hardly ever occur. So I believe that With a very high number of block devices, a service (re)start might take a long time, but we can't do much about that upstream. We could just write documentation how to increase |
Have you seen actual evidence of this happening? If yes, do you have any insight why the crash happened?
Hm, so this is perhaps a regression of the I agree that multipathd should clean up these maps as suggested by @bmarzins. |
@mwilck I have not seen this in production. @Sasthan-SK had asked me about restart and so while reviewing the code I noticed it. I have only seen it in testing where I sigkill -9 multipathd when it's doing that ioctl. |
I have a patch that I'll be sending shortly that just moves the check for the UUID prefix before the checks of the target info to avoid error messages about the target number for devices that aren't supposed to be checked by multipathd at all, and also doesn't fail dm_get_maps() for DMP_NOT_FOUND. |
Except the @mikechristie just brought up a case where multipathd would die during startup. If there were a large number of paths, I could see not getting to map_discovery(), where this error occurred, for over 2 seconds. That would mean that multipathd would keep getting restarted forever. So I was thinking of bumping up StartLimitIntervalSec, perhaps not as much as I suggested earlier, but I would feel better with something like StartLimitIntervalSec=30s
It probably is, and I'm probably just being paranoid. multipath getting stuck in interruptible sleep isn't something I really see, so endless restarts and timeouts isn't very likely. |
Right, good point. I wonder if it's wise that we signal readiness to systemd only after having set up all maps. It doesn't play well with systemd's restart logic, which seems to be targeted primarily at services that restart quickly, at least with the default settings. As for being paranoid, I don't recall a single case of multipathd failing to start in production (as opposed to multiple cases of multipathd failing to stop) in the last 7 years. So we can set the restart limits quite strictly without risking anything, like
It'd still be an improvement, because we currently don't restart at all. |
The reason I switched to thinking about a smaller StartLimitIntervalSec is that once you hit StartLimitBurst, systemd will stop even manual restarts until StartLimitIntervalSec has expired. On the other hand, unless we do something like you suggest, if multipathd times out on startup (by default after 90 seconds), it will just keep getting restarted forever. And if multipathd does get stopped from restarting, a |
Right, I'd forgotten about that. Maybe 600 was over-agressive. 300 (slightly more than 3x90) might be a good guess for now. |
Observed that https://github.com/opensvc/multipath-tools/blob/master/multipathd/multipathd.service.in does not have
Restart
enabled. In case of any multipathd crash/failures, service would not be auto-restarted. Ideally it should be enabled.Is there a specific reason to not enable it ?
Thanks,
SK
The text was updated successfully, but these errors were encountered: