BadRequest when performing merge after write in the same Spark session #964

khaledh · 2023-05-05T13:20:30Z

I have a use case where we need to use MERGE INTO, but because the connector doesn't support it natively (there's an issue for it #575) we do this by writing the delta dataframe to a temp table and then use the python-bigquery library to execute a MERGE sql query:

def write_then_merge():
    full_df = spark.createDataFrame([
        (1, 'a'),
        (2, 'b'),
        (3, 'c'),
    ], ['id', 'letter'])

    logger.info('writing full_df')
    full_df.write.format('bigquery').option('writeMethod', 'direct').save(
        'myproject.scratch.overwrite_then_merge'
    )

    delta_df = spark.createDataFrame([
        (1, 'aa'),
        (2, 'bb'),
    ], ['id', 'letter'])

    logger.info('writing delta_df')
    delta_df.write.format('bigquery').option('writeMethod', 'direct').save(
        'myproject.scratch.overwrite_then_merge_tmp'
    )

    logger.info('merging delta_df into full_df...')

    query = """
        MERGE INTO myproject.scratch.overwrite_then_merge AS target
        USING myproject.scratch.overwrite_then_merge_tmp AS source
        ON target.id = source.id
        WHEN MATCHED THEN UPDATE SET
          target.letter = source.letter
        WHEN NOT MATCHED THEN INSERT ROW"""

    client = bigquery.Client(project='myproject')
    job = client.query(query)
    job.result()

    logger.info('done')

However, this results in the following error:

BadRequest: 400 UPDATE or DELETE statement over table myproject.scratch.overwrite_then_merge would affect rows in the streaming buffer, which is not supported

Relevant part of the stack trace:

    job.result()
  File "/tmp/tmp3lb1gxhh/installed_wheels/.../google_cloud_bigquery-3.10.0-py2.py3-none-any.whl/google/cloud/bigquery/job/query.py", line 1520, in result
    do_get_result()
  File "/tmp/tmp3lb1gxhh/installed_wheels/.../google_api_core-2.11.0-py3-none-any.whl/google/api_core/retry.py", line 349, in retry_wrapped_func
    return retry_target(
  File "/tmp/tmp3lb1gxhh/installed_wheels/.../google_api_core-2.11.0-py3-none-any.whl/google/api_core/retry.py", line 191, in retry_target
    return target()
  File "/tmp/tmp3lb1gxhh/installed_wheels/.../google_cloud_bigquery-3.10.0-py2.py3-none-any.whl/google/cloud/bigquery/job/query.py", line 1510, in do_get_result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
  File "/tmp/tmp3lb1gxhh/installed_wheels/.../google_cloud_bigquery-3.10.0-py2.py3-none-any.whl/google/cloud/bigquery/job/base.py", line 911, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
  File "/tmp/tmp3lb1gxhh/installed_wheels/.../google_api_core-2.11.0-py3-none-any.whl/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 UPDATE or DELETE statement over table myproject.scratch.overwrite_then_merge would affect rows in the streaming buffer, which is not supported

I was able to find some relevant docs about availability of data after streaming into BigQuery, which states that it may take up to 90 minutes to be available, so I put the above code in a retry loop that retried for 2 hours, and it still hits the same issue. Also if I execute the first write and the second merge steps in two separate runs back to back, it works fine. So I don't think the above documentation page is relevant to this issue.

The text was updated successfully, but these errors were encountered:

khaledh · 2023-05-05T19:00:50Z

More context: If I switch the writing of full_df to use indirect write method (using a temp gcs bucket), this works flawlessly.

suryasoma · 2023-05-10T22:07:12Z

which spark version and connector version are you using?

davidrabinowitz · 2023-06-09T17:30:02Z

When using the DIRECT write method it my take a few seconds until the data appears in the table. Have you tried to use the INDIRECT write method?

khaledh · 2023-06-09T20:50:14Z

When using the DIRECT write method it my take a few seconds until the data appears in the table.

The data does appear if I query it, but I still get the error above if I try to MERGE into the table after the first write.

Have you tried to use the INDIRECT write method?

Yes, that's what I mentioned in my second comment above. If I create/write the table using the indirect method, then I don't get the error with the subsequent MERGE.

khaledh · 2023-06-09T20:50:32Z

which spark version and connector version are you using?

Spark 3.3 and connector 0.30.0.

davidrabinowitz · 2023-06-09T21:42:34Z

@yirutang can you please have a look?

yirutang · 2023-06-09T21:49:37Z

We are working on making data after commit to be on formal storage, but that is WIP. In the meantime, commit will trigger conversion, and the time is less than publicly documented streaming delay. Our study shows 99% conversion will be done in 2 min and the longest tail we saw is 25 min.

davidrabinowitz · 2023-06-09T22:21:25Z

Given that, I suggest to switch to the INDIRECT mode for the time being, or add a retry logic.

suryasoma self-assigned this May 10, 2023

davidrabinowitz assigned yirutang and unassigned suryasoma Jun 9, 2023

davidrabinowitz closed this as completed Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BadRequest when performing merge after write in the same Spark session #964

BadRequest when performing merge after write in the same Spark session #964

khaledh commented May 5, 2023 •

edited

Loading

khaledh commented May 5, 2023

suryasoma commented May 10, 2023

davidrabinowitz commented Jun 9, 2023

khaledh commented Jun 9, 2023

khaledh commented Jun 9, 2023

davidrabinowitz commented Jun 9, 2023

yirutang commented Jun 9, 2023

davidrabinowitz commented Jun 9, 2023

BadRequest when performing merge after write in the same Spark session #964

BadRequest when performing merge after write in the same Spark session #964

Comments

khaledh commented May 5, 2023 • edited Loading

khaledh commented May 5, 2023

suryasoma commented May 10, 2023

davidrabinowitz commented Jun 9, 2023

khaledh commented Jun 9, 2023

khaledh commented Jun 9, 2023

davidrabinowitz commented Jun 9, 2023

yirutang commented Jun 9, 2023

davidrabinowitz commented Jun 9, 2023

khaledh commented May 5, 2023 •

edited

Loading