OSD removal #186

sabaini · 2023-08-18T15:47:56Z

Implement microceph disk remove command. Update auto failure domain behaviour to also allow downgrading.

Some updates for flaky single node CI tests.

wolsen

Some inline comments throughout. One thing that remains unclear to me is if you have a partial failure of removing a disk, how does the software behave? Will running the command again pick up where it left off and proceed?

I believe there's a scenario with the current code in which the OSD can be removed from the Ceph cluster, but fail to get removed from the database. I don't think that the code allows for the OSD to still be removed from the database in this scenario as commands like reweightOSD, outDownOSD, etc will return an ENOENT when running the ceph commands.

microceph/cmd/microceph/disk_remove.go

microceph/database/disk_extras.go

microceph/ceph/crush.go

microceph/cmd/microceph/disk_remove.go

microceph/ceph/osd.go

microceph/api/disks.go

microceph/ceph/osd.go

lmlg · 2023-08-18T21:03:20Z

Some inline comments throughout. One thing that remains unclear to me is if you have a partial failure of removing a disk, how does the software behave? Will running the command again pick up where it left off and proceed?

I believe there's a scenario with the current code in which the OSD can be removed from the Ceph cluster, but fail to get removed from the database. I don't think that the code allows for the OSD to still be removed from the database in this scenario as commands like reweightOSD, outDownOSD, etc will return an ENOENT when running the ceph commands.

Personally, I think there should be a "point of no return", upon which there may be errors, but from where the code cannot pick up. In the case of the DB failing to remove the entry, that should only be possible if something is really wrong (like filesystem corruption on the DB files). In case of such an error, I think it would be crucial to let the user know that the OSD is out of the cluster, but that manual intervention may be needed.

wolsen · 2023-08-22T13:21:43Z

Some inline comments throughout. One thing that remains unclear to me is if you have a partial failure of removing a disk, how does the software behave? Will running the command again pick up where it left off and proceed?
I believe there's a scenario with the current code in which the OSD can be removed from the Ceph cluster, but fail to get removed from the database. I don't think that the code allows for the OSD to still be removed from the database in this scenario as commands like reweightOSD, outDownOSD, etc will return an ENOENT when running the ceph commands.

Personally, I think there should be a "point of no return", upon which there may be errors, but from where the code cannot pick up. In the case of the DB failing to remove the entry, that should only be possible if something is really wrong (like filesystem corruption on the DB files). In case of such an error, I think it would be crucial to let the user know that the OSD is out of the cluster, but that manual intervention may be needed.

What do you propose as the point of no return? There's several steps involved, and I think most all of them are reasonable to have a check when performed to see if they need to be performed. They are:

scaleDownFailureDomain
reweightOSD
(safetyCheckStop)
downOSD
stopOSD
(safetyCheckDestroy)
purgeOSD
wipeDevice
removeOSD
deleteFromDatabase

Once you start the purge of the OSD is where you will have challenges continuing forth, though the necessary information should still be in the dqlite database. But if purge, wipe or remove fail (which is more likely during the event of a disk failure) then you won't be able to delete the OSD from the database without running database commands directly.

lmlg · 2023-08-22T21:01:19Z

Some inline comments throughout. One thing that remains unclear to me is if you have a partial failure of removing a disk, how does the software behave? Will running the command again pick up where it left off and proceed?
I believe there's a scenario with the current code in which the OSD can be removed from the Ceph cluster, but fail to get removed from the database. I don't think that the code allows for the OSD to still be removed from the database in this scenario as commands like reweightOSD, outDownOSD, etc will return an ENOENT when running the ceph commands.

Personally, I think there should be a "point of no return", upon which there may be errors, but from where the code cannot pick up. In the case of the DB failing to remove the entry, that should only be possible if something is really wrong (like filesystem corruption on the DB files). In case of such an error, I think it would be crucial to let the user know that the OSD is out of the cluster, but that manual intervention may be needed.

What do you propose as the point of no return? There's several steps involved, and I think most all of them are reasonable to have a check when performed to see if they need to be performed. They are:

scaleDownFailureDomain reweightOSD (safetyCheckStop) downOSD stopOSD (safetyCheckDestroy) purgeOSD wipeDevice removeOSD deleteFromDatabase

Once you start the purge of the OSD is where you will have challenges continuing forth, though the necessary information should still be in the dqlite database. But if purge, wipe or remove fail (which is more likely during the event of a disk failure) then you won't be able to delete the OSD from the database without running database commands directly.

I'd argue the last step - the DB removal - should be the point of no return, because any other failure can be worked around via command flags (i.e: the 'force' flag). If they can't be worked around, I'd argue it's a design defect. Operating on the DB is the place where I would draw the line because it requires not just operator-level knowledge about Ceph, but about Microceph's internals.

wolsen · 2023-08-29T03:08:48Z

Once you start the purge of the OSD is where you will have challenges continuing forth, though the necessary information should still be in the dqlite database. But if purge, wipe or remove fail (which is more likely during the event of a disk failure) then you won't be able to delete the OSD from the database without running database commands directly.

I'd argue the last step - the DB removal - should be the point of no return, because any other failure can be worked around via command flags (i.e: the 'force' flag). If they can't be worked around, I'd argue it's a design defect. Operating on the DB is the place where I would draw the line because it requires not just operator-level knowledge about Ceph, but about Microceph's internals.

I agree with this particular point. I think that the database removal is a fatal error and requires fixing the database.

However, my primary point is that if the removal of the device fails due to say I/O error or timeout at the wipe phase. At this point, the remove-disk command cannot be run again because the actions like reweight of the OSD will return an ENOENT error, which will cause the command to fail and the user is then unable to proceed further, much less remove the device from the database.

UtkarshBhatthere · 2023-08-30T06:49:34Z

#28

sabaini · 2023-08-30T17:54:01Z

Thanks for all the valuable input. I've pushed an update, hopefully adressing comments.

Wrt. to failure modes and robustness, I've updated semantics in the following way:

add a guard against operating on OSDs that are not actually present in the cluster
change wiping so that a device fault is not fatal anymore

Let me know what you think

microceph/ceph/osd.go

Implement `microceph disk remove` command. Update auto failure domain behaviour to also allow downgrading. Some updates for flaky single node CI tests. Update go modules. Signed-off-by: Peter Sabaini <peter.sabaini@canonical.com>

sabaini · 2023-09-01T08:05:48Z

Hola @lmlg, I've pushed an update an answered some comments, please take a look :-)

lmlg

LGTM, just a few questions overall.

microceph/ceph/osd.go

lmlg

LGTM!

wolsen requested changes Aug 18, 2023

View reviewed changes

This was referenced Aug 30, 2023

Allow adding/remove services #23

Open

Implement microceph disk remove #28

Closed

UtkarshBhatthere linked an issue Aug 30, 2023 that may be closed by this pull request

Implement microceph disk remove #28

Closed

sabaini force-pushed the feature/remove-osd branch 2 times, most recently from bca0149 to af82acd Compare August 30, 2023 17:28

sabaini requested a review from wolsen August 31, 2023 13:46

sabaini force-pushed the feature/remove-osd branch from af82acd to 78d273b Compare August 31, 2023 18:48

lmlg requested changes Aug 31, 2023

View reviewed changes

microceph/ceph/osd.go Show resolved Hide resolved

microceph/ceph/osd.go Show resolved Hide resolved

microceph/ceph/osd.go Outdated Show resolved Hide resolved

microceph/ceph/osd.go Show resolved Hide resolved

OSD removal

c514af3

Implement `microceph disk remove` command. Update auto failure domain behaviour to also allow downgrading. Some updates for flaky single node CI tests. Update go modules. Signed-off-by: Peter Sabaini <peter.sabaini@canonical.com>

sabaini force-pushed the feature/remove-osd branch from 78d273b to c514af3 Compare September 1, 2023 07:26

sabaini requested a review from lmlg September 5, 2023 11:13

lmlg reviewed Sep 5, 2023

View reviewed changes

microceph/ceph/osd.go Show resolved Hide resolved

microceph/ceph/osd.go Show resolved Hide resolved

microceph/ceph/osd.go Show resolved Hide resolved

lmlg approved these changes Sep 5, 2023

View reviewed changes

sabaini merged commit d451ee6 into canonical:main Sep 6, 2023
7 checks passed

sabaini mentioned this pull request Sep 7, 2023

Error adding disk after expanding cluster #194

Closed

UtkarshBhatthere linked an issue Sep 13, 2023 that may be closed by this pull request

Error adding disk after expanding cluster #194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSD removal #186

OSD removal #186

sabaini commented Aug 18, 2023

wolsen left a comment

lmlg commented Aug 18, 2023

wolsen commented Aug 22, 2023

lmlg commented Aug 22, 2023

wolsen commented Aug 29, 2023

UtkarshBhatthere commented Aug 30, 2023

sabaini commented Aug 30, 2023

sabaini commented Sep 1, 2023

lmlg left a comment

lmlg left a comment

OSD removal #186

OSD removal #186

Conversation

sabaini commented Aug 18, 2023

wolsen left a comment

Choose a reason for hiding this comment

lmlg commented Aug 18, 2023

wolsen commented Aug 22, 2023

lmlg commented Aug 22, 2023

wolsen commented Aug 29, 2023

UtkarshBhatthere commented Aug 30, 2023

sabaini commented Aug 30, 2023

sabaini commented Sep 1, 2023

lmlg left a comment

Choose a reason for hiding this comment

lmlg left a comment

Choose a reason for hiding this comment