Adds python script for incremental partition insertion. #76

prashastia · 2023-12-23T20:49:49Z

insert_dynamic_partitions.py A python script to insert partitions incrementally into the partitioned table. Used by the e2e unbounded source testing script.

/gcbrun

This module is similar to the BigQueryExample. A few changes to count the number of records and log them.

This test reads a simpleTable. Shell script and python script to check the number of records read.

comments CODECOV_TOKEN usage.

…bounded and unbounded source.

…ble with complex schema.

…ds to different tables required for the e2e tests.

jayehwhyehentee · 2024-01-02T10:51:56Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+    parser.add_argument(
+        '--refresh_interval',
+        dest='refresh_interval',
+        help='.',


Please add a proper description.

jayehwhyehentee · 2024-01-02T10:55:45Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+        '--now_timestamp',
+        dest='now_timestamp',


Not a great name. Something like execution_timestamp makes more sense. Btw, why do we need this argument?

Fixed the name. We need this argument to make sure we insert the partitions for the current date. If we provide a fixed date, then they will not be read (as partitions have been marked as closed) .

Current date can be found during execution using python's time or date libraries. Why do we need to take that as input?

jayehwhyehentee · 2024-01-02T10:56:22Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+    now_timestamp = args.now_timestamp
+    now_timestamp = datetime.datetime.strptime(
+        now_timestamp, '%Y-%m-%d'
+    ).astimezone(datetime.timezone.utc)


Either use different variables, or replace with:

now_timestamp = datetime.datetime.strptime( args.now_timestamp, '%Y-%m-%d' ).astimezone(datetime.timezone.utc)

jayehwhyehentee · 2024-01-02T11:01:46Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+    simple_avro_schema_string = (
+        '{"namespace": "project.dataset","type": "record","name":'
+        ' "table","doc": "Avro Schema for project.dataset.table",'
+        + simple_avro_schema_fields_string
+        + '}'
+    )


Replace with:

simple_avro_schema_string = ( '{"namespace": "project.dataset","type": "record","name":' ' "table","doc": "Avro Schema for project.dataset.table",' f'{simple_avro_schema_fields_string}' '}' )

jayehwhyehentee · 2024-01-02T11:04:26Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+                avro_file_local_identifier = avro_file_local.replace(
+                    '.', '_' + str(thread_number) + '.'
+                )
+                x = threading.Thread(


Please use a proper name. x is not sufficient.

jayehwhyehentee · 2024-01-02T11:05:38Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+                thread.join()
+
+        time_elapsed = time.time() - start_time
+        prev_partitions_offset += number_of_partitions


Why is prev_partitions_offset being incremented multiple times in the same iteration?

Within the same iteration we are adding rows spread amongst 1 or more partitions.
so that at a new read, we make sure that multiple partitions are being read from.

Let's take an example.

First iteration:

for number_of_partitions in partitions: # number_of_partitions is 2 ... prev_partitions_offset += 1 # prev_partitions_offset is 1 ... # called avro_to_bq_with_cleanup with partition_number as 1 # called avro_to_bq_with_cleanup with partition_number as 2 ... prev_partitions_offset += number_of_partitions # prev_partitions_offset is 3 ...

Second iteration:

for number_of_partitions in partitions: # number_of_partitions is 1 ... prev_partitions_offset += 1 # prev_partitions_offset is 4 ... # called avro_to_bq_with_cleanup with partition_number as 4 ... prev_partitions_offset += number_of_partitions # prev_partitions_offset is 5 ...

Third iteration:

for number_of_partitions in partitions: # number_of_partitions is 2 ... prev_partitions_offset += 1 # prev_partitions_offset is 6 ... # called avro_to_bq_with_cleanup with partition_number as 6 # called avro_to_bq_with_cleanup with partition_number as 7 ... prev_partitions_offset += number_of_partitions # prev_partitions_offset is 8 ...

So, we've skipped partition offsets 3 and 5. If that is intentional, then why?

Yeah, this is correct.
So to maintain time consistency, I took time at UTC which would be 18:30 hrs.
So (18:30 + 2 = 20:30) to (18:30 + 3 = 21:30) will generate values in partitions 20hrs, 21 hrs.

So in next phase if we generate for 18:30 + 3 - 18:30 + 4, the partitions will clash.

I think this is getting too confusing.
I'll fix this.

This is fixed now.

prashastia · 2024-01-02T17:01:17Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+    dataset_name = args.dataset_name
+    table_name = args.table_name
+
+    execution_timestamp = datetime.datetime.now(tz=datetime.timezone.utc).replace(hour=0,


Now generating midnight timestamp. So we can insert partitions incrementally irrespective of the time of the day the script is executed.

jayehwhyehentee · 2024-01-03T09:24:15Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+    number_of_rows_per_thread = int(
+        number_of_rows_per_partition / number_of_threads
+    )


Check out floor division:number_of_rows_per_partition // number_of_threads.

jayehwhyehentee · 2024-01-03T09:26:32Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+    # This is a buffer time to allow the read streams to be formed.
+    # Allows conflicting conditions by making sure that
+    # new (to be generated) partitions are not part of the current read stream.


Buffer time to ensure that new partitions are created after previous read session and before next split discovery.

jayehwhyehentee · 2024-01-03T09:29:09Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+        table_id,
+    )
+
+    # Insert in phases.


# Insert iteratively.
Iteration is more appropriate than phase. Please apply that to other comments in this loop.

vishalkarve15 · 2024-01-03T14:52:03Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+from utils import utils
+
+
+def wait():


It's unclear to the caller how long the wait is. This function can be named sleepForMinutes or sleepForSeconds with an argument.

vishalkarve15 · 2024-01-03T14:53:33Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+    partitions = [2, 1, 2]
+    # Insert 10000 - 30000 rows per partition.
+    # So, in a read up to 60000 new rows are read.
+    number_of_rows_per_partition = random.randint(1, 3) * 10000


What is the reason for randomizing this? Will an optional argument with a default value serve the purpose?

The idea behind randomization was to introduce a possibility of different number of rows being inserted at every execution.
But yeah we could hardcode a fixed value or even take it as an argument.

vishalkarve15 · 2024-01-03T14:58:26Z

cloudbuild/python-scripts/insert_dynamic_partitions.py

+        prev_partitions_offset += number_of_partitions
+        # We wait for the refresh to happen
+        # so that the data just created can be read.
+        while time_elapsed < float(60 * 2 * refresh_interval):


It's more efficient to sleep here for the required amount of time.

The step before this includes the generation and insertion of rows to the BQ table. The time for this insertion is variable. In order to make sure next insertion does not take place before the next read stream formation, we explicitly wait for the stipulated amount of time to pass.

…rgument. Reformats the file.

prashastia added 30 commits December 18, 2023 10:54

Adds a new module for nightly tests.

a4d0d25

This module is similar to the BigQueryExample. A few changes to count the number of records and log them.

Modifies the docstring.

f674638

Modifies the docstring.

6ec22dc

Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh

f0cd825

This test reads a simpleTable. Shell script and python script to check the number of records read.

Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh

c814bd9

This test reads a simpleTable. Shell script and python script to check the number of records read.

Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh

f5d3f31

This test reads a simpleTable. Shell script and python script to check the number of records read.

Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh

b9cad97

This test reads a simpleTable. Shell script and python script to check the number of records read.

Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh

b3ce92b

This test reads a simpleTable. Shell script and python script to check the number of records read.

Adds simple e2e test, Adds parse_logs.py, Adds table_read.sh

4e6ef67

This test reads a simpleTable. Shell script and python script to check the number of records read.

Modifies IntegrationTest to check query correctness.

dd8e5be

Adds spotless:apply.

137030d

comments CODECOV_TOKEN usage.

Renames the table_read file to bounded_table_read.sh

957ebf0

Addresses review comments.

2ff4032

Creates separate shell script for bounded jobs.

394132d

Adds a new common shell script for common actions performed for both …

5425806

…bounded and unbounded source.

Addresses review comments.

ea79601

Addresses review comments.

c6a564f

Formats the file.

a2a49b3

Fixes checkstyle violations.

3da1b14

Fixes checkstyle violations.

0f34c03

Fixes checkstyle violations.

0822681

Fixes checkstyle violations.

1770526

Changes test name, adds a new e2e test for checking table read for ta…

e416b1f

…ble with complex schema.

Changes test name, adds a new e2e test for checking table read for ta…

471aca2

…ble with complex schema.

Adds a new e2e test for checking query read.

68d48d0

Fixes cause of error in query read execution.

44f39f6

Fixes cause of error in query read execution.

b8a5883

Adds a new e2e test for checking large table ~200GBs read.

46ac1e6

Adds utils.py - a class containing implementations for writing recor…

74afc54

…ds to different tables required for the e2e tests.

Addresses review comments in parse_logs.py.

ea0cba5

jayehwhyehentee reviewed Jan 2, 2024

View reviewed changes

Addresses review comments.

54170ed

jayehwhyehentee reviewed Jan 2, 2024

View reviewed changes

prashastia added 4 commits January 2, 2024 16:45

Addresses review comments.

f2f21d5

Addresses review comments.

543817e

Addresses review comments.

b814589

Addresses review comments.

b68ea5b

prashastia commented Jan 2, 2024

View reviewed changes

prashastia added 2 commits January 2, 2024 22:44

Addresses review comments.

820dcc8

Reformats the file.

2bd9fad

jayehwhyehentee reviewed Jan 3, 2024

View reviewed changes

Addresses review comments.

a02570f

jayehwhyehentee approved these changes Jan 3, 2024

View reviewed changes

jayehwhyehentee requested a review from vishalkarve15 January 3, 2024 09:35

prashastia added 2 commits January 3, 2024 15:07

Addresses review comments.

2c7cab2

Addresses review comments.

d48c345

vishalkarve15 reviewed Jan 3, 2024

View reviewed changes

prashastia added 5 commits January 3, 2024 21:27

Addresses review comments.

4118b97

Addresses review comments. Takes number of rows per partition as an a…

5eee53a

…rgument. Reformats the file.

Addresses review comments.

3f5367e

Addresses review comments.

a8fbfb8

Reduces the wait time - an experiment

393a33f

vishalkarve15 approved these changes Jan 8, 2024

View reviewed changes

jayehwhyehentee merged commit ff65e15 into GoogleCloudDataproc:main Jan 8, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds python script for incremental partition insertion. #76

Adds python script for incremental partition insertion. #76

prashastia commented Dec 23, 2023

jayehwhyehentee Jan 2, 2024

prashastia Jan 2, 2024

jayehwhyehentee Jan 2, 2024

prashastia Jan 2, 2024

jayehwhyehentee Jan 2, 2024

jayehwhyehentee Jan 2, 2024

prashastia Jan 2, 2024

jayehwhyehentee Jan 2, 2024

prashastia Jan 2, 2024

jayehwhyehentee Jan 2, 2024

jayehwhyehentee Jan 2, 2024

prashastia Jan 2, 2024

jayehwhyehentee Jan 2, 2024

prashastia Jan 2, 2024

prashastia Jan 2, 2024

prashastia Jan 2, 2024 •

edited

Loading

jayehwhyehentee Jan 3, 2024

prashastia Jan 3, 2024

jayehwhyehentee Jan 3, 2024

prashastia Jan 3, 2024

jayehwhyehentee Jan 3, 2024

prashastia Jan 3, 2024

vishalkarve15 Jan 3, 2024

vishalkarve15 Jan 3, 2024

prashastia Jan 3, 2024 •

edited

Loading

vishalkarve15 Jan 3, 2024

prashastia Jan 3, 2024

Adds python script for incremental partition insertion. #76

Adds python script for incremental partition insertion. #76

Conversation

prashastia commented Dec 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prashastia Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prashastia Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prashastia Jan 2, 2024 •

edited

Loading

prashastia Jan 3, 2024 •

edited

Loading