-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSD removal #186
OSD removal #186
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some inline comments throughout. One thing that remains unclear to me is if you have a partial failure of removing a disk, how does the software behave? Will running the command again pick up where it left off and proceed?
I believe there's a scenario with the current code in which the OSD can be removed from the Ceph cluster, but fail to get removed from the database. I don't think that the code allows for the OSD to still be removed from the database in this scenario as commands like reweightOSD, outDownOSD, etc will return an ENOENT when running the ceph commands.
Personally, I think there should be a "point of no return", upon which there may be errors, but from where the code cannot pick up. In the case of the DB failing to remove the entry, that should only be possible if something is really wrong (like filesystem corruption on the DB files). In case of such an error, I think it would be crucial to let the user know that the OSD is out of the cluster, but that manual intervention may be needed. |
What do you propose as the point of no return? There's several steps involved, and I think most all of them are reasonable to have a check when performed to see if they need to be performed. They are: scaleDownFailureDomain Once you start the purge of the OSD is where you will have challenges continuing forth, though the necessary information should still be in the dqlite database. But if purge, wipe or remove fail (which is more likely during the event of a disk failure) then you won't be able to delete the OSD from the database without running database commands directly. |
I'd argue the last step - the DB removal - should be the point of no return, because any other failure can be worked around via command flags (i.e: the 'force' flag). If they can't be worked around, I'd argue it's a design defect. Operating on the DB is the place where I would draw the line because it requires not just operator-level knowledge about Ceph, but about Microceph's internals. |
I agree with this particular point. I think that the database removal is a fatal error and requires fixing the database. However, my primary point is that if the removal of the device fails due to say I/O error or timeout at the wipe phase. At this point, the remove-disk command cannot be run again because the actions like reweight of the OSD will return an ENOENT error, which will cause the command to fail and the user is then unable to proceed further, much less remove the device from the database. |
bca0149
to
af82acd
Compare
Thanks for all the valuable input. I've pushed an update, hopefully adressing comments. Wrt. to failure modes and robustness, I've updated semantics in the following way:
Let me know what you think |
af82acd
to
78d273b
Compare
Implement `microceph disk remove` command. Update auto failure domain behaviour to also allow downgrading. Some updates for flaky single node CI tests. Update go modules. Signed-off-by: Peter Sabaini <peter.sabaini@canonical.com>
78d273b
to
c514af3
Compare
Hola @lmlg, I've pushed an update an answered some comments, please take a look :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few questions overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Implement
microceph disk remove
command. Update auto failure domain behaviour to also allow downgrading.Some updates for flaky single node CI tests.