Add time range parameters to sync script #483

fhenneke · 2025-01-07T14:30:59Z

This PR adds start and end times to the sync data script.

The command line arguments start_time and end_time are added to the sync_data script. The script is called as, e.g.,

python -m src.data_sync.sync_data --sync-table order_data --start-time 2025-12-30 --end-time 2025-01-07

Only data between start_time and end_time is computed. This data is then upserted into the corresponding table of the month. I.e. if a row was not in the table already, it is inserted into the table. Rows from the new data replace rows of the old data if it exists. Old data which was not recomputed stays as is.

The code is structured as follows:

Arguments are parsed with appropriate default values.
The full time range is partitioned into monthly ranges.
Block ranges and months are computed from those time ranges.
Essentially the old code is used for computing data for those block ranges.
Data is written to the database.

The convention for the stat time to be inclusive and for the end time to be exclusive is used. This way the two ranges (2024-12-30, 2025-01-02), (2025-01-02, 2025-01-07) would give the same result as the range (2024-12-30, 2025-01-07). Though some overlap is required in cases end_time is beyond the last finalized block.
If no argument is supplied, the start of the month and the start of the next month are used as default for start_time and end_time, respectively, to compute data for the full month, until the last finalized block.

The previous month is not automatically recomputed on the first of the next month. Instead, one can use a time range which contains whatever time window for which data needs to be recomputed.

Potential TODOs

Change data-jobs workflows to have (overlapping) time ranges.
Add tests
Add documentation
Add functionality to drop a table instead of upserting. This will get rid of wrong data not corresponding to any valid order or batch.

src/data_sync/common.py

bh2smith · 2025-01-08T23:33:13Z

src/data_sync/common.py

 from src.logger import set_log

 log = set_log(__name__)


+def compute_time_range(
+    start_time: datetime, end_time: datetime
+) -> list[tuple[datetime, datetime]]:


Didn't there used to be a type AccountingPeriod? If that class has been abolished, it might not hurt to add a basic declaration:

AccountingPeriod = tuple[datetime, datetime]

that can be passed around the project for better readability.

Good point. I am still not sure how to reasonably merge the two code bases of solver-rewards and dune-sync (v1).

Accounting period seems to be tailored to rewards with defaulting to weeks and having pretty printing for dune tables. The accounting period also has some implicit constraints on having a time of 00:00:00 in some parts of the code.

Making the concept of accounting periods consistent throughout the code would definitely help.

Technically, I block number ranges would be the most appropriate thing. This was the benefit of the old AccountingPeriod class; it provided a bijection between timestamps and block numbers so computations could be performed on the more natural type.

Anyway, I suppose this is just a side note. Feel free to ignore.

There is an ongoing internal debate on what the most natural concept for accounting periods is. Block ranges sound nice when considering how backend data is structured. But they become a bit unnatural if multiple chains are considered. For simplicity, we want to use time based periods everywhere, across chains and projects.

bh2smith · 2025-01-08T23:34:48Z

src/data_sync/sync_data.py


-async def sync_data_to_db(  # pylint: disable=too-many-arguments
+async def sync_data_to_db(  # pylint: disable=too-many-arguments, too-many-locals


Nit: Looks like this method has become quite overloaded...

I simplified the code (but not the responsibilities of the function) a bit. At least I do not need to exclude the pylint check anymore.

bram-vdberg

LGTM!

bram-vdberg · 2025-01-09T07:43:55Z

src/data_sync/common.py

+
+    start_block = find_block_with_timestamp(node, start_time.timestamp())
+    if latest_block_time < end_time:
+        end_block = int(latest_block["number"])


Should we add a log here to let us know if this happens that the script is not actually using the end time because it's still in the future?

Do you have an opinion on where this processing for time ranges should take place?

On the one hand, most of the logging and checking can also happen during initialization in ScriptArgs. That might also remove the implicit changing of end times due to the restriction to finalized blocks.

On the other hand, a node would be required to do the checking for latest block. And it would be mixing abstractions. The accounting, on a high level, should be time based. That this is internally split into monthly ranges and blocks is a bit of an implementation detail.

I'm not sure to be honest. Since this requires actual implementation details and not checking whether or not the args are valid it feels cleaner to leave this here. For the sake of serializing the args any end time > start time would be valid, so it seems to me like this is the right place. But if moving this to the initialization makes it more maintainable then that could be a good argument to do this there instead.

For now I just added a log message.

socket-security · 2025-01-09T16:06:55Z

New and removed dependencies detected. Learn more about Socket for GitHub ↗︎

Package	New capabilities	Transitives	Size	Publisher
pypi/protobuf@5.27.4	environment, unsafe	`0`	1.67 MB	protobuf-packages
pypi/psycopg2-binary@2.9.9	environment, eval, filesystem, network, shell, unsafe	`0`	1.67 MB	piro

🚮 Removed packages: pypi/async-timeout@4.0.3, pypi/exceptiongroup@1.2.2

View full report↗︎

fhenneke · 2025-01-09T17:04:24Z

I recompiled dependencies after adding python-dateutils. This removed some dependencies.

harisang · 2025-01-09T23:34:21Z

src/data_sync/common.py

+    ) + relativedelta(months=1):
+        return [(start_time, end_time)]
+
+    # if there are multiple month to consider


Suggested change

# if there are multiple month to consider

# if there are multiple months to consider

harisang · 2025-01-09T23:51:55Z

src/data_sync/common.py


 log = set_log(__name__)


+def compute_time_range(


Name of function is a bit confusing as its input is actually the time range, but tbh i don't have any good suggestion for the name

harisang · 2025-01-10T00:25:51Z

src/data_sync/common.py

+    else:
+        end_block = find_block_with_timestamp(node, end_time.timestamp()) - 1
+
+    return BlockRange(block_from=start_block, block_to=end_block)


Although this could happen only if one is really trying to make things crash, in principle you can get start_block > end_block. Do you think that this can cause a problem?

fhenneke added 7 commits January 7, 2025 15:12

add time cli arguments

8e9c478

use new functions to compute block ranges

786f8ed

removes old function for computing block ranges

ba8f745

add implementation for multiple month

a608e4b

lint fixes

0cd668c

add function to write to database

11c9414

lint fix

0f45be5

fhenneke changed the title ~~[Draft] Add time range parameters to sync script~~ Add time range parameters to sync script Jan 8, 2025

fhenneke requested review from harisang and bram-vdberg January 8, 2025 15:36

fhenneke marked this pull request as ready for review January 8, 2025 15:37

bh2smith reviewed Jan 8, 2025

View reviewed changes

bram-vdberg approved these changes Jan 9, 2025

View reviewed changes

fhenneke mentioned this pull request Jan 9, 2025

Add network info in config #485

Closed

fhenneke added 5 commits January 9, 2025 13:08

simplify main sync function a bit

08615b9

add additional logging

14901f1

add recreate-table command line flag

2b5fb06

adding more logging

c967b64

add dependency on python-dateutils

819f8cf

fhenneke added 2 commits January 9, 2025 17:08

add log message for using latest finalized block

58722bc

add documentation to README

1f01c20

harisang reviewed Jan 9, 2025

View reviewed changes

harisang reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add time range parameters to sync script #483

Add time range parameters to sync script #483

fhenneke commented Jan 7, 2025 •

edited

Loading

bh2smith Jan 8, 2025

fhenneke Jan 9, 2025

bh2smith Jan 9, 2025

fhenneke Jan 9, 2025

bh2smith Jan 8, 2025

fhenneke Jan 9, 2025

bram-vdberg left a comment

bram-vdberg Jan 9, 2025

fhenneke Jan 9, 2025

bram-vdberg Jan 9, 2025

fhenneke Jan 9, 2025

socket-security bot commented Jan 9, 2025 •

edited

Loading

fhenneke commented Jan 9, 2025

harisang Jan 9, 2025

harisang Jan 9, 2025

harisang Jan 10, 2025


		async def sync_data_to_db( # pylint: disable=too-many-arguments
		async def sync_data_to_db( # pylint: disable=too-many-arguments, too-many-locals

	# if there are multiple month to consider
	# if there are multiple months to consider

Add time range parameters to sync script #483

Are you sure you want to change the base?

Add time range parameters to sync script #483

Conversation

fhenneke commented Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bram-vdberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

socket-security bot commented Jan 9, 2025 • edited Loading

fhenneke commented Jan 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhenneke commented Jan 7, 2025 •

edited

Loading

socket-security bot commented Jan 9, 2025 •

edited

Loading