Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move waiting for tasks to separate phase to unblock process #1262

Merged
merged 1 commit into from
Dec 12, 2024

Conversation

mnecas
Copy link
Member

@mnecas mnecas commented Dec 12, 2024

Issue:
When the Forklift creates the snapshot we need to wait for the task to finish. Right now we are using task.WaitForResult, this causes the whole process to wait for the snapshot creation and blocks other VM migrations. Same problem we have with snapshot removal, if the ESXi host is busy and we start the snapshot removal the snapshots can take longer than the reconcile cycle (3s) and we can fail due to it.

Fix:
Instead of using the task.WaitForResult the forklift will start querying for the latest tasks per VM, by default it's 10 tasks. This querying will be done in a separate phase then the creation/deletion. So we will have WaitFor phases for each of the object manipulations. We find the specific task for the creation/deletion and check its status. This has the advantage that we are not only getting the status of the task but in addition also the results of the task, so we can propagate them to the user, in case the creation/deletion fails.

Ref:

Proper fix of: #1253

@mnecas mnecas added this to the 2.7.7 milestone Dec 12, 2024
@mnecas mnecas requested a review from yaacov as a code owner December 12, 2024 13:02
@codecov-commenter
Copy link

codecov-commenter commented Dec 12, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 78 lines in your changes missing coverage. Please review.

Project coverage is 15.50%. Comparing base (ea38999) to head (d26472a).

Files with missing lines Patch % Lines
pkg/controller/plan/adapter/vsphere/client.go 0.00% 57 Missing ⚠️
pkg/controller/plan/migration.go 0.00% 12 Missing ⚠️
pkg/controller/plan/adapter/openstack/client.go 0.00% 3 Missing ⚠️
pkg/controller/plan/adapter/ova/client.go 0.00% 3 Missing ⚠️
pkg/controller/plan/adapter/ovirt/client.go 0.00% 3 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1262      +/-   ##
==========================================
- Coverage   15.57%   15.50%   -0.08%     
==========================================
  Files         112      112              
  Lines       23255    23300      +45     
==========================================
- Hits         3623     3612      -11     
- Misses      19345    19403      +58     
+ Partials      287      285       -2     
Flag Coverage Δ
unittests 15.50% <0.00%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Issue:
When the Forklift creates the snapshot we need to wait for the task to
finish. Right now we are using task.WaitForResult, this causes the whole
process to wait for the snapshot creation and blocks other VM migrations.
Same problem we have with snapshot removal, if the ESXi host is busy and
we start the snapshot removal the snapshots can take longer than the
reconcile cycle (3s) and we can fail due to it.

Fix:
Instead of using the task.WaitForResult the forklift will start querying
for the latest tasks per VM, by default it's 10 tasks. This querying
will be done in a separate phase then the creation/deletion. So we will
have WaitFor phases for each of the object manipulations.
We find the specific task for the creation/deletion and check its
status. This has the advantage that we are not only getting the status of the
task but in addition also the results of the task, so we can propagate them to
the user, in case the creation/deletion fails.

Ref:
- https://issues.redhat.com/browse/MTV-1753
- https://issues.redhat.com/browse/MTV-1775

Signed-off-by: Martin Necas <mnecas@redhat.com>
@mnecas mnecas merged commit 4208890 into kubev2v:main Dec 12, 2024
18 of 19 checks passed
mnecas added a commit to mnecas/forklift that referenced this pull request Dec 13, 2024
Add a wait phase for snapshot tasks

Issue:
The main problem of the MTV-1753 and MTV-1775 is that we are either not
waiting for the VMware task to finish or if we are waiting we are halting
the whole controller process. This causes either performance issues or
even migration failures. So we need to add a mechanism to wait for the
tasks without halting the whole process.

Fix:
My first attempt was in PR kubev2v#1262 which used the event manager. This on
the surface was an easy approach which did not require any additional changes
to the CR. The problem there was that some of the tasks were not
reported to the taskManager. These tasks had a prefix haTask. After some
investigation, I found out that these tasks are directly on the ESXi host
and not sent to the vspehre, so we can't use the taskManager.

This PR adds the taskIds to the status CR so additional wait phases can
monitor the tasks. The main controller will get the ESXi client and
create a property collector to request the specific task from the host.

Ref:
- https://issues.redhat.com/browse/MTV-1753
- https://issues.redhat.com/browse/MTV-1775
- kubev2v#1262
- kubev2v#1265

Signed-off-by: Martin Necas <mnecas@redhat.com>
mnecas added a commit to mnecas/forklift that referenced this pull request Dec 13, 2024
Issue:
The main problem of the MTV-1753 and MTV-1775 is that we are either not
waiting for the VMware task to finish or if we are waiting we are halting
the whole controller process. This causes either performance issues or
even migration failures. So we need to add a mechanism to wait for the
tasks without halting the whole process.

Fix:
My first attempt was in PR kubev2v#1262 which used the event manager. This on
the surface was an easy approach which did not require any additional changes
to the CR. The problem there was that some of the tasks were not
reported to the taskManager. These tasks had a prefix haTask. After some
investigation, I found out that these tasks are directly on the ESXi host
and not sent to the vspehre, so we can't use the taskManager.

This PR adds the taskIds to the status CR so additional wait phases can
monitor the tasks. The main controller will get the ESXi client and
create a property collector to request the specific task from the host.

Ref:
- https://issues.redhat.com/browse/MTV-1753
- https://issues.redhat.com/browse/MTV-1775
- kubev2v#1262
- kubev2v#1265

Signed-off-by: Martin Necas <mnecas@redhat.com>
mnecas added a commit to mnecas/forklift that referenced this pull request Dec 13, 2024
Issue:
The main problem of the MTV-1753 and MTV-1775 is that we are either not
waiting for the VMware task to finish or if we are waiting we are halting
the whole controller process. This causes either performance issues or
even migration failures. So we need to add a mechanism to wait for the
tasks without halting the whole process.

Fix:
My first attempt was in PR kubev2v#1262 which used the event manager. This on
the surface was an easy approach which did not require any additional changes
to the CR. The problem there was that some of the tasks were not
reported to the taskManager. These tasks had a prefix haTask. After some
investigation, I found out that these tasks are directly on the ESXi host
and not sent to the vspehre, so we can't use the taskManager.

This PR adds the taskIds to the status CR so additional wait phases can
monitor the tasks. The main controller will get the ESXi client and
create a property collector to request the specific task from the host.

Ref:
- https://issues.redhat.com/browse/MTV-1753
- https://issues.redhat.com/browse/MTV-1775
- kubev2v#1262
- kubev2v#1265

Signed-off-by: Martin Necas <mnecas@redhat.com>
mnecas added a commit that referenced this pull request Dec 13, 2024
Issue:
The main problem of the MTV-1753 and MTV-1775 is that we are either not
waiting for the VMware task to finish or if we are waiting we are halting
the whole controller process. This causes either performance issues or
even migration failures. So we need to add a mechanism to wait for the
tasks without halting the whole process.

Fix:
My first attempt was in PR #1262 which used the event manager. This on
the surface was an easy approach which did not require any additional changes
to the CR. The problem there was that some of the tasks were not
reported to the taskManager. These tasks had a prefix haTask. After some
investigation, I found out that these tasks are directly on the ESXi host
and not sent to the vspehre, so we can't use the taskManager.

This PR adds the taskIds to the status CR so additional wait phases can
monitor the tasks. The main controller will get the ESXi client and
create a property collector to request the specific task from the host.

Ref:
- https://issues.redhat.com/browse/MTV-1753
- https://issues.redhat.com/browse/MTV-1775
- #1262
- #1265

Signed-off-by: Martin Necas <mnecas@redhat.com>
mnecas added a commit to mnecas/forklift that referenced this pull request Dec 13, 2024
Issue:
The main problem of the MTV-1753 and MTV-1775 is that we are either not
waiting for the VMware task to finish or if we are waiting we are halting
the whole controller process. This causes either performance issues or
even migration failures. So we need to add a mechanism to wait for the
tasks without halting the whole process.

Fix:
My first attempt was in PR kubev2v#1262 which used the event manager. This on
the surface was an easy approach which did not require any additional changes
to the CR. The problem there was that some of the tasks were not
reported to the taskManager. These tasks had a prefix haTask. After some
investigation, I found out that these tasks are directly on the ESXi host
and not sent to the vspehre, so we can't use the taskManager.

This PR adds the taskIds to the status CR so additional wait phases can
monitor the tasks. The main controller will get the ESXi client and
create a property collector to request the specific task from the host.

Ref:
- https://issues.redhat.com/browse/MTV-1753
- https://issues.redhat.com/browse/MTV-1775
- kubev2v#1262
- kubev2v#1265

Signed-off-by: Martin Necas <mnecas@redhat.com>
@mnecas mnecas mentioned this pull request Dec 13, 2024
mnecas added a commit that referenced this pull request Dec 13, 2024
Issue:
The main problem of the MTV-1753 and MTV-1775 is that we are either not
waiting for the VMware task to finish or if we are waiting we are halting
the whole controller process. This causes either performance issues or
even migration failures. So we need to add a mechanism to wait for the
tasks without halting the whole process.

Fix:
My first attempt was in PR #1262 which used the event manager. This on
the surface was an easy approach which did not require any additional changes
to the CR. The problem there was that some of the tasks were not
reported to the taskManager. These tasks had a prefix haTask. After some
investigation, I found out that these tasks are directly on the ESXi host
and not sent to the vspehre, so we can't use the taskManager.

This PR adds the taskIds to the status CR so additional wait phases can
monitor the tasks. The main controller will get the ESXi client and
create a property collector to request the specific task from the host.

Ref:
- https://issues.redhat.com/browse/MTV-1753
- https://issues.redhat.com/browse/MTV-1775
- #1262
- #1265

Signed-off-by: Martin Necas <mnecas@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants