Skip to content

Added serverless support to spark fixture #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 31, 2025
Merged

Conversation

mwojtyczka
Copy link
Contributor

@mwojtyczka mwojtyczka commented Jan 30, 2025

Extend spark fixture to support Serverless compute.

Linked issues

Resolves #90

Tests

  • manually tested
  • [] added unit tests
  • added integration tests
  • verified on staging environment (screenshot attached)

@mwojtyczka mwojtyczka requested a review from nfx as a code owner January 30, 2025 13:11
@mwojtyczka mwojtyczka requested review from JCZuurmond and alexott and removed request for nfx January 30, 2025 13:12
Copy link

github-actions bot commented Jan 30, 2025

✅ 41/41 passed, 5 skipped, 5m42s total

Running from acceptance #169

Copy link
Contributor

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mwojtyczka , thanks for opening this PR. This "auto" serverless cluster id is new to me. What happens if that variable is set to something else than "auto"? Does that make sense for serverles?

@mwojtyczka
Copy link
Contributor Author

mwojtyczka commented Jan 30, 2025

Hi @mwojtyczka , thanks for opening this PR. This "auto" serverless cluster id is new to me. What happens if that variable is set to something else than "auto"? Does that make sense for serverles?

You can set it to the actual cluster id but it does not make sense for serverless as the cluster id is going to be active only for short period of time most of the time (due to autotermination). This is also the reason we only mention "auto" in the documentation:
https://docs.databricks.com/en/dev-tools/databricks-connect/cluster-config.html#configure-a-connection-to-serverless-compute

It works the same in notebooks for example. You cannot select specific cluster. You just select Serverless.

@JCZuurmond
Copy link
Contributor

Hi @mwojtyczka , thanks for opening this PR. This "auto" serverless cluster id is new to me. What happens if that variable is set to something else than "auto"? Does that make sense for serverles?

You can set it to the actual cluster id but it does not make sense for serverless as the cluster id is going to be active only for short period of time most of the time (due to autotermination). This is also the reason we only mention "auto" in the documentation: https://docs.databricks.com/en/dev-tools/databricks-connect/cluster-config.html#configure-a-connection-to-serverless-compute

It works the same in notebooks for example. You cannot select specific cluster. You just select Serverless.

Yeah, it makes sense to have no cluster id for serverless - kinda by definition. It confuses me that the workspace configuration has an attribute called serverless_compute_id. Please add the documentation link to the README and the docstring.

Also, should we add a check that explicitly tests the value is auto otherwise we fail with a ValueError saying we only accept it to be "auto"?

@mwojtyczka
Copy link
Contributor Author

mwojtyczka commented Jan 30, 2025

Hi @mwojtyczka , thanks for opening this PR. This "auto" serverless cluster id is new to me. What happens if that variable is set to something else than "auto"? Does that make sense for serverles?

You can set it to the actual cluster id but it does not make sense for serverless as the cluster id is going to be active only for short period of time most of the time (due to autotermination). This is also the reason we only mention "auto" in the documentation: https://docs.databricks.com/en/dev-tools/databricks-connect/cluster-config.html#configure-a-connection-to-serverless-compute
It works the same in notebooks for example. You cannot select specific cluster. You just select Serverless.

Yeah, it makes sense to have no cluster id for serverless - kinda by definition. It confuses me that the workspace configuration has an attribute called serverless_compute_id. Please add the documentation link to the README and the docstring.

Also, should we add a check that explicitly tests the value is auto otherwise we fail with a ValueError saying we only accept it to be "auto"?

will do but not sure we need to check for "auto". I tested it and if you have a running cluster it works if you give it a specific cluster_id. Maybe if given we should check it exists otherwise fail

@mwojtyczka
Copy link
Contributor Author

mwojtyczka commented Jan 30, 2025

@JCZuurmond i had to update the sdk dependency because UCX is using lower version of the sdk and that was causing downstreams job to fail. I don't really understand why the projects are coupled this way. It does not make much sense since UCX and Remoph can pin to a specific version of pytester.

The project was missing databricks-connect dependency. Therefore the test_databricks_connect was previously always skipped. After adding it and enabling the test, it seems that incompatible cluster for databricks connect is used to run the tests.

test_databricks_connect: pyspark.errors.exceptions.connect.SparkConnectGrpcException: BAD_REQUEST: SingleClusterComputeMode(DATABRICKS_CLUSTER_ID) is not Shared or Single User Cluster. (requestId=b64b9411-74d5-4e27-9566-5d1ed1fa72ce) (1.537s)

Can you please update cluster id DATABRICKS_CLUSTER_ID and TEST_DEFAULT_CLUSTER_ID in the vault_uri: ${{ secrets.VAULT_URI }} to a shared cluster 1114-152544-29g1w07e. The one that is currently used is neither Shared nor Single user cluster. We cannot use databricks-connect with this.

Copy link
Contributor

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwojtyczka : Approving as it LGTM. Please cover the last two suggestions before merging

@JCZuurmond
Copy link
Contributor

@JCZuurmond i had to update the sdk dependency because UCX is using lower version of the sdk and that was causing downstreams job to fail. I don't really understand why the projects are coupled this way. It does not make much sense since UCX and Remoph can pin to a specific version of pytester.

The project was missing databricks-connect dependency. Therefore the test_databricks_connect was previously always skipped. After adding it and enabling the test, it seems that incompatible cluster for databricks connect is used to run the tests.

test_databricks_connect: pyspark.errors.exceptions.connect.SparkConnectGrpcException: BAD_REQUEST: SingleClusterComputeMode(DATABRICKS_CLUSTER_ID) is not Shared or Single User Cluster. (requestId=b64b9411-74d5-4e27-9566-5d1ed1fa72ce) (1.537s)

Can you please update cluster id DATABRICKS_CLUSTER_ID and TEST_DEFAULT_CLUSTER_ID in the vault_uri: ${{ secrets.VAULT_URI }} to a shared cluster 1114-152544-29g1w07e. The one that is currently used is neither Shared nor Single user cluster. We cannot use databricks-connect with this.

@mwojtyczka : I cannot not help you here. @FastLee has access to our testing infrastructure, I expect that he can update that value.

On this design pattern, I suspect it is intended as a sanity check by @nfx , a backstop if you will

@mwojtyczka mwojtyczka removed the request for review from alexott January 31, 2025 10:44
@mwojtyczka mwojtyczka merged commit cefd79e into main Jan 31, 2025
7 checks passed
@mwojtyczka mwojtyczka deleted the serverless_support branch January 31, 2025 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support serverless clusters in spark fixture
2 participants