Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD removal #186

Merged
merged 1 commit into from
Sep 6, 2023
Merged

OSD removal #186

merged 1 commit into from
Sep 6, 2023

Conversation

sabaini
Copy link
Collaborator

@sabaini sabaini commented Aug 18, 2023

Implement microceph disk remove command. Update auto failure domain behaviour to also allow downgrading.

Some updates for flaky single node CI tests.

Copy link
Collaborator

@wolsen wolsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some inline comments throughout. One thing that remains unclear to me is if you have a partial failure of removing a disk, how does the software behave? Will running the command again pick up where it left off and proceed?

I believe there's a scenario with the current code in which the OSD can be removed from the Ceph cluster, but fail to get removed from the database. I don't think that the code allows for the OSD to still be removed from the database in this scenario as commands like reweightOSD, outDownOSD, etc will return an ENOENT when running the ceph commands.

microceph/cmd/microceph/disk_remove.go Outdated Show resolved Hide resolved
microceph/database/disk_extras.go Outdated Show resolved Hide resolved
microceph/ceph/crush.go Outdated Show resolved Hide resolved
microceph/cmd/microceph/disk_remove.go Show resolved Hide resolved
microceph/ceph/osd.go Outdated Show resolved Hide resolved
microceph/ceph/osd.go Outdated Show resolved Hide resolved
microceph/ceph/osd.go Outdated Show resolved Hide resolved
microceph/api/disks.go Outdated Show resolved Hide resolved
microceph/api/disks.go Outdated Show resolved Hide resolved
microceph/ceph/osd.go Outdated Show resolved Hide resolved
@lmlg
Copy link
Contributor

lmlg commented Aug 18, 2023

Some inline comments throughout. One thing that remains unclear to me is if you have a partial failure of removing a disk, how does the software behave? Will running the command again pick up where it left off and proceed?

I believe there's a scenario with the current code in which the OSD can be removed from the Ceph cluster, but fail to get removed from the database. I don't think that the code allows for the OSD to still be removed from the database in this scenario as commands like reweightOSD, outDownOSD, etc will return an ENOENT when running the ceph commands.

Personally, I think there should be a "point of no return", upon which there may be errors, but from where the code cannot pick up. In the case of the DB failing to remove the entry, that should only be possible if something is really wrong (like filesystem corruption on the DB files). In case of such an error, I think it would be crucial to let the user know that the OSD is out of the cluster, but that manual intervention may be needed.

@wolsen
Copy link
Collaborator

wolsen commented Aug 22, 2023

Some inline comments throughout. One thing that remains unclear to me is if you have a partial failure of removing a disk, how does the software behave? Will running the command again pick up where it left off and proceed?
I believe there's a scenario with the current code in which the OSD can be removed from the Ceph cluster, but fail to get removed from the database. I don't think that the code allows for the OSD to still be removed from the database in this scenario as commands like reweightOSD, outDownOSD, etc will return an ENOENT when running the ceph commands.

Personally, I think there should be a "point of no return", upon which there may be errors, but from where the code cannot pick up. In the case of the DB failing to remove the entry, that should only be possible if something is really wrong (like filesystem corruption on the DB files). In case of such an error, I think it would be crucial to let the user know that the OSD is out of the cluster, but that manual intervention may be needed.

What do you propose as the point of no return? There's several steps involved, and I think most all of them are reasonable to have a check when performed to see if they need to be performed. They are:

scaleDownFailureDomain
reweightOSD
(safetyCheckStop)
downOSD
stopOSD
(safetyCheckDestroy)
purgeOSD
wipeDevice
removeOSD
deleteFromDatabase

Once you start the purge of the OSD is where you will have challenges continuing forth, though the necessary information should still be in the dqlite database. But if purge, wipe or remove fail (which is more likely during the event of a disk failure) then you won't be able to delete the OSD from the database without running database commands directly.

@lmlg
Copy link
Contributor

lmlg commented Aug 22, 2023

Some inline comments throughout. One thing that remains unclear to me is if you have a partial failure of removing a disk, how does the software behave? Will running the command again pick up where it left off and proceed?
I believe there's a scenario with the current code in which the OSD can be removed from the Ceph cluster, but fail to get removed from the database. I don't think that the code allows for the OSD to still be removed from the database in this scenario as commands like reweightOSD, outDownOSD, etc will return an ENOENT when running the ceph commands.

Personally, I think there should be a "point of no return", upon which there may be errors, but from where the code cannot pick up. In the case of the DB failing to remove the entry, that should only be possible if something is really wrong (like filesystem corruption on the DB files). In case of such an error, I think it would be crucial to let the user know that the OSD is out of the cluster, but that manual intervention may be needed.

What do you propose as the point of no return? There's several steps involved, and I think most all of them are reasonable to have a check when performed to see if they need to be performed. They are:

scaleDownFailureDomain reweightOSD (safetyCheckStop) downOSD stopOSD (safetyCheckDestroy) purgeOSD wipeDevice removeOSD deleteFromDatabase

Once you start the purge of the OSD is where you will have challenges continuing forth, though the necessary information should still be in the dqlite database. But if purge, wipe or remove fail (which is more likely during the event of a disk failure) then you won't be able to delete the OSD from the database without running database commands directly.

I'd argue the last step - the DB removal - should be the point of no return, because any other failure can be worked around via command flags (i.e: the 'force' flag). If they can't be worked around, I'd argue it's a design defect. Operating on the DB is the place where I would draw the line because it requires not just operator-level knowledge about Ceph, but about Microceph's internals.

@wolsen
Copy link
Collaborator

wolsen commented Aug 29, 2023

Once you start the purge of the OSD is where you will have challenges continuing forth, though the necessary information should still be in the dqlite database. But if purge, wipe or remove fail (which is more likely during the event of a disk failure) then you won't be able to delete the OSD from the database without running database commands directly.

I'd argue the last step - the DB removal - should be the point of no return, because any other failure can be worked around via command flags (i.e: the 'force' flag). If they can't be worked around, I'd argue it's a design defect. Operating on the DB is the place where I would draw the line because it requires not just operator-level knowledge about Ceph, but about Microceph's internals.

I agree with this particular point. I think that the database removal is a fatal error and requires fixing the database.

However, my primary point is that if the removal of the device fails due to say I/O error or timeout at the wipe phase. At this point, the remove-disk command cannot be run again because the actions like reweight of the OSD will return an ENOENT error, which will cause the command to fail and the user is then unable to proceed further, much less remove the device from the database.

@UtkarshBhatthere
Copy link
Contributor

#28

@UtkarshBhatthere UtkarshBhatthere linked an issue Aug 30, 2023 that may be closed by this pull request
@sabaini sabaini force-pushed the feature/remove-osd branch 2 times, most recently from bca0149 to af82acd Compare August 30, 2023 17:28
@sabaini
Copy link
Collaborator Author

sabaini commented Aug 30, 2023

Thanks for all the valuable input. I've pushed an update, hopefully adressing comments.

Wrt. to failure modes and robustness, I've updated semantics in the following way:

  • add a guard against operating on OSDs that are not actually present in the cluster
  • change wiping so that a device fault is not fatal anymore

Let me know what you think

microceph/ceph/osd.go Show resolved Hide resolved
microceph/ceph/osd.go Show resolved Hide resolved
microceph/ceph/osd.go Outdated Show resolved Hide resolved
microceph/ceph/osd.go Show resolved Hide resolved
Implement `microceph disk remove` command. Update auto failure domain
behaviour to also allow downgrading.

Some updates for flaky single node CI tests.

Update go modules.

Signed-off-by: Peter Sabaini <peter.sabaini@canonical.com>
@sabaini
Copy link
Collaborator Author

sabaini commented Sep 1, 2023

Hola @lmlg, I've pushed an update an answered some comments, please take a look :-)

@sabaini sabaini requested a review from lmlg September 5, 2023 11:13
Copy link
Contributor

@lmlg lmlg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few questions overall.

microceph/ceph/osd.go Show resolved Hide resolved
microceph/ceph/osd.go Show resolved Hide resolved
microceph/ceph/osd.go Show resolved Hide resolved
Copy link
Contributor

@lmlg lmlg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sabaini sabaini merged commit d451ee6 into canonical:main Sep 6, 2023
7 checks passed
@UtkarshBhatthere UtkarshBhatthere linked an issue Sep 13, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error adding disk after expanding cluster Implement microceph disk remove
4 participants