Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] two fsx instances with two lustre-integration modules fails #292

Open
swirkert1 opened this issue Sep 18, 2024 · 4 comments
Open

[BUG] two fsx instances with two lustre-integration modules fails #292

swirkert1 opened this issue Sep 18, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@swirkert1
Copy link

Describe the bug
create an eks cluster and two fsx volums.

Now use path: git::https://github.com/awslabs/idf-modules.git//modules/integration/fsx-lustre-on-eks?ref=release/1.11.0&depth=1

two times to connect the fsx voulums to the cluster. This fails, one time with the EksHandlerRoleArn (which is not documented but needed) already existing and the second time with the set_permissions_job already existingj

To Reproduce

  1. create eks cluster
  2. create fsx volume
  3. create another fsx volume
  4. make two fsx-lustre-on-eks modules to connect the fsx with eks

Expected behavior
ressources are created

Screenshots
addf-llpdrsw-integration-lustre-on-eks-1a | 10/13 | 2:06:49 PM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | addf-llpdrsw-integration-lustr-eks-cluster/manifest-SetPermissionsJob/Resource/Default (addfllpdrswintegrationlustreksclustermanifestSetPermissionsJob14F57FB1) Received response status [FAILED] from custom resource. Message returned: Error: b'Error from server (AlreadyExists): error when creating "/tmp/manifest.yaml": jobs.batch "set-permissions-job" already exists\n'

@swirkert1 swirkert1 added the bug Something isn't working label Sep 18, 2024
@swirkert1
Copy link
Author

I think a workaround is to let them run sequentially in different groups as the bug seems to be connected to running them in parallel.

@swirkert1
Copy link
Author

Unfortunately no. While it says "SUCCEEDED" in the state, the pvc of the previous integration module was deleted. Also, when trying this out with a third fsx volume and integration module it failed again. All seems kind of random

@swirkert1
Copy link
Author

I think for the bug with the set-permissions-job we need to give it a unique name here:
"metadata": {"name": "set-permissions-job", "namespace": eks_namespace},
the rest: dont know

@malachi-constant malachi-constant self-assigned this Sep 18, 2024
@swirkert1
Copy link
Author

swirkert1 commented Sep 18, 2024

I gave a unique name to the permission jobs and removed the pv and pvc from depending on the namespace. Now it works after the second make deploy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants