Skip to content

[PLUGIN-1715] Implement retry feature for BigQuery Execute Plugin #1334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

psainics
Copy link
Contributor

@psainics psainics commented Nov 14, 2023

Implement retry feature for BigQuery Execute Plugin

Jira : PLUGIN-1715

  • In general internal errors should be retried.
  • Do an exponential back-off retry on errors with reason jobBackendError , jobInternalError.
  • Backoff requirements (exponential from 1 to 32 seconds with a multiplier of 2)

Code Changes

  • Add a new set that contains all reason we should retry on.
  • A new custom exception that is raised when the error should be retried.
  • A new function to generate the exponential back- off retry policy
  • A new dependency is added dev.failsafe that handles the retry logic based on the exception raised.

Unit Tests

  • Use mocks to test the behavior of the retry.

@psainics psainics changed the title Added BQ Retry [PLUGIN-1715] Implement retry feature for BigQuery Execute Plugin Nov 14, 2023
@sau42shri sau42shri requested review from itsankit-google and fernst and removed request for itsankit-google November 15, 2023 13:48
@itsankit-google itsankit-google added the build Trigger unit test build label Nov 16, 2023
@psainics psainics force-pushed the patch-retry-bigQuery-execute branch from 57ae3ae to b74a689 Compare November 16, 2023 08:34
@sau42shri sau42shri requested a review from tivv November 20, 2023 16:44
import io.cdap.plugin.gcp.bigquery.sink.BigQuerySinkUtils;
import io.cdap.plugin.gcp.bigquery.util.BigQueryUtil;
import io.cdap.plugin.gcp.common.CmekUtils;
import io.cdap.plugin.gcp.common.GCPUtils;
import org.jetbrains.annotations.TestOnly;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use jetbrains annotations in production code. This may bring unnessesary dependencies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to mark a method as visible only for the purposes of testing, use @VisibleForTesting which is already used elsewhere in the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the jetbrains annotation, using @ VisibleForTesting


private Config config;
@TestOnly
public BigQueryExecute(Config config) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually package-only access is enough for testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an explicit config would effectively remove a no-argument constructor. Which constructor will be used for the production run? If this one, it should not be marked as TestOnly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this package only.

@@ -69,8 +76,13 @@ public final class BigQueryExecute extends AbstractBigQueryAction {
private static final Logger LOG = LoggerFactory.getLogger(BigQueryExecute.class);
public static final String NAME = "BigQueryExecute";
private static final String RECORDS_PROCESSED = "records.processed";
private final Config config;
private static final Set<String> retryOnReason = ImmutableSet.of("jobBackendError", "jobInternalError");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All constant should be uppercased

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the strings here should be in their own constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated !

Duration initialRetryDuration = Duration.ofSeconds(1);
Duration maxRetryDuration = Duration.ofSeconds(32);
int multiplier = 2;
int maxRetryCount = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this should be configurable. Retries must be disableable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Please add a Retry Configuration section in the UI and allow the users to fine-tune this behavior.

Copy link
Contributor

@vikasrathee-cs vikasrathee-cs Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tivv @fernst It was mentioned previously on the bug ticket also that Bigquery SLA is for 32 seconds and that is the reason we have added these fix properties to go upto that limit. Same was confirmed in design doc also.
Can you please confirm in design doc also. Shared with you.
Also have a look at this. https://cloud.google.com/bigquery/sla#:~:text=%22-,Back%2Doff%20Requirements,-%22%C2%A0means%2C%20when

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few points here:

  • Customer may not wish to wait so long. Yep, BQ may recover, but it's also possible that it's too long for given customer
  • You are changing the behaviour. We must have an option to disable this behavior
  • SLAs are statistical. They may still be violated
  • SLAs are point in time. They may change and we need to be able to adjust the bahvior without code change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added new UI field to turn this behavior off (default) and let user configure the parameters.

image

}

@Test
public void testExecuteQueryWithExponentialBackoffSuccess() throws InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want immediate retries, otherwise this test would be waiting unnesessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed !

bq.executeQueryWithExponentialBackoff(bigQuery, queryJobConfiguration, context);
}

@Test(expected = java.lang.RuntimeException.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you expecting just a RuntimeException here? It that it should go through retries and give some reasonable error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated !

Copy link
Contributor

@fernst fernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Vitalii's suggestions here. Please address.

@psainics psainics requested review from tivv and fernst December 1, 2023 17:50
Failsafe.with(getRetryPolicy()).run(() -> executeQuery(bigQuery, queryConfig, context));
} catch (RuntimeException e) {
LOG.error("Failed to execute BigQuery job. Error: {}", e.getMessage());
throw new BigQueryJobExecutionException("Failed to execute BigQuery job. Reason: " + e.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way you are loosing all the additional information from original BigQueryException stack trace. There are only some case where you would want to hide all details of the underlying exception by using e.getMessage instead of attaching a root cause. I don't see why would we want to have it here. Same applies to the catch clause in executeQuery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated !

@psainics psainics force-pushed the patch-retry-bigQuery-execute branch from 186fa05 to 5475c5b Compare December 6, 2023 04:34
@psainics psainics requested a review from tivv December 6, 2023 04:44
@psainics psainics force-pushed the patch-retry-bigQuery-execute branch 3 times, most recently from c759d01 to 35987f2 Compare December 7, 2023 13:46
@psainics
Copy link
Contributor Author

psainics commented Dec 7, 2023

Changes Squashed !

@psainics psainics force-pushed the patch-retry-bigQuery-execute branch from b859ad2 to b8d5c96 Compare December 7, 2023 19:31
@psainics psainics requested a review from tivv December 7, 2023 19:32
Copy link
Contributor

@fernst fernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

bq.executeQueryWithExponentialBackoff(bigQuery, queryJobConfiguration, context);
});
String actualMessage = exception.getMessage();
Assert.assertEquals("Failed to execute BigQuery job. Reason: " + mockErrorMessageNoRetry, actualMessage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, basically we say "Failed to execute BigQuery job. Reason: Job execution failed with error: $error" This is what I want to avoid as it's saying same thing twice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be in top level catch we should just find the correct cause and rethrow in instead of wrapping it again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i have removed the wrapping of error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this way you are loosing the underlying error. You should either:

  • do visa versa and pop up only the underlying error message "Job execution failed with error: $error"
  • prepend it with some text that adds value, e.g. "Job retries failed after X tries. Last try failure: Job execution failed with error: $error".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I am doing the 1st one , by removing wrapping I ment removing the extra messages string that was present before

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you are hiding the underlying "$error", so the only thing that we can see is that retries exhausted without much information about the underlying error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i already have logging when nth retry is made or the policy is out of retry should i just pop the error and throw that ?

something like this

try {
      Failsafe.with(getRetryPolicy()).run(() -> executeQuery(bigQuery, queryConfig, context));
    } catch (FailsafeException e) {
      throw e.getCause();
    } 

Also i could drop the catch (RuntimeException e)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's option #1 in the list above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check if e.getCause is not null as generally not every failure is an exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

if (e.getCause() != null) {
        throw e.getCause();  
} 
  throw e;

@psainics psainics requested a review from tivv December 7, 2023 22:58
@psainics psainics force-pushed the patch-retry-bigQuery-execute branch 5 times, most recently from b76ef45 to 34a7113 Compare December 11, 2023 19:56
@psainics
Copy link
Contributor Author

Thanks @tivv , i have made the changes and squashed them , hopefully you can approve this now.
I will get this QA tested before merging 👍

@psainics psainics force-pushed the patch-retry-bigQuery-execute branch from 34a7113 to ddadce4 Compare December 11, 2023 20:01
"value": "false",
"label": "NO"
},
"default": "false"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be set to true by default, @tivv @fernst wdyt??
I thought aim is to add some resilency to the plugin to reduce intermittent failures but if the behavior is false by default, users will need to enable it explicitly and modify every pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point makes a lot of sense. Setting this behavior to true by default aligns with the goal of enhancing plugin resilience and reducing intermittent failures.

Copy link
Contributor

@tivv tivv Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we usually have two things:

  • Default for UI. This is what we want default experience to be. In this case it should be "true"
  • Default for backend (when option is not set). This is usually how things worked before to ensure backward compatibility unless we have explicit OK from product to break backward compatibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itsankit-google we had a discussion with deepinder, the UI fields should be hidden and true by default !

@psainics psainics force-pushed the patch-retry-bigQuery-execute branch from ddadce4 to 7a56dad Compare December 13, 2023 10:27
@psainics
Copy link
Contributor Author

New changes include

  • Reverted changes in docs/BigQueryExecute-action.md as the UI fields are hidden.
  • Removed filter retryOnBackendError with condition retryOnBackendError == true as this will not be required now.
  • Made all the ui fields hidden
  • Retry On Backend Error is TRUE by default , on widget and backed.

@psainics psainics force-pushed the patch-retry-bigQuery-execute branch from 7a56dad to 2793ac7 Compare December 14, 2023 05:12
@psainics
Copy link
Contributor Author

Resolved conflicts caused by PR: #1344 in bigquery/action/BigQueryExecute.java

@vikasrathee-cs vikasrathee-cs force-pushed the patch-retry-bigQuery-execute branch from 2793ac7 to a5e3e7b Compare December 21, 2023 08:24
@vikasrathee-cs vikasrathee-cs merged commit 6ab069c into data-integrations:develop Dec 21, 2023
@psainics psainics deleted the patch-retry-bigQuery-execute branch February 13, 2025 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Trigger unit test build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants