Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

KaspariK · 2025-01-15T16:55:20Z

Use Boto3 retries (see Retries - Boto3)
Backoff on getting unprocessed keys

…ntial backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail

…or retries in test assertion

…_reading test

KaspariK · 2025-01-20T19:26:28Z

tron/serialize/runstate/dynamodb_state_store.py

+                log.warning(
+                    f"Attempt {attempts}/{MAX_UNPROCESSED_KEYS_RETRIES} - Retrying {len(cand_keys_list)} unprocessed keys after {delay:.2f}s delay."
+                )
+                time.sleep(delay)


What to do about this lil guy?

!8ball we should use a restore thread

yea, we should probably try to figure out a non-blocking way to do this or have this run in a separate thread - if we get to the worst case of 5 attempts and this is running on the reactor thread, we'll essentially block all of tron from doing anything for 20s

although, actually - this is probably fine since we do all sorts of blocking stuff in restore and aren't expecting tron to be usable/do anything until we've restored everything

...so maybe this is fine?

nemacysts · 2025-01-21T16:53:44Z

tests/serialize/runstate/dynamodb_state_store_test.py

@@ -294,7 +296,8 @@ def test_delete_item_with_json_partitions(self, store, small_object, large_objec
            vals = store.restore([key])
        assert key not in vals

-    def test_retry_saving(self, store, small_object, large_object):
+    @mock.patch("time.sleep", return_value=None)


my personal preference is usually to use the context manager way of mocking since that gives a little more control over where a mock is active, but not a blocker :)

Alright, not what you commented on at all, but upon closer inspection this test isn't really testing much. I'll rewrite this

tests/serialize/runstate/dynamodb_state_store_test.py

nemacysts · 2025-01-21T17:08:28Z

tron/serialize/runstate/dynamodb_state_store.py

+                log.warning(
+                    f"Attempt {attempts}/{MAX_UNPROCESSED_KEYS_RETRIES} - Retrying {len(cand_keys_list)} unprocessed keys after {delay:.2f}s delay."
+                )
+                time.sleep(delay)


yea, we should probably try to figure out a non-blocking way to do this or have this run in a separate thread - if we get to the worst case of 5 attempts and this is running on the reactor thread, we'll essentially block all of tron from doing anything for 20s

nemacysts · 2025-01-21T17:09:34Z

tron/serialize/runstate/dynamodb_state_store.py

+                log.warning(
+                    f"Attempt {attempts}/{MAX_UNPROCESSED_KEYS_RETRIES} - Retrying {len(cand_keys_list)} unprocessed keys after {delay:.2f}s delay."
+                )
+                time.sleep(delay)


although, actually - this is probably fine since we do all sorts of blocking stuff in restore and aren't expecting tron to be usable/do anything until we've restored everything

...so maybe this is fine?

tron/serialize/runstate/dynamodb_state_store.py

nemacysts · 2025-01-21T17:19:21Z

tests/serialize/runstate/dynamodb_state_store_test.py

+        # We also need to verify that sleep was called with expected delays
+        expected_delays = []
+        base_delay_seconds = 0.5
+        max_delay_seconds = 10
+        for attempt in range(1, MAX_UNPROCESSED_KEYS_RETRIES + 1):
+            expected_delay = min(base_delay_seconds * (2 ** (attempt - 1)), max_delay_seconds)
+            expected_delays.append(expected_delay)
+        actual_delays = [call.args[0] for call in mock_sleep.call_args_list]
+        assert_equal(actual_delays, expected_delays)


i'd maybe extract the exponential backoff logic in tron/serialize/runstate/dynamodb_state_store.py to a function so that we can write a more targeted test for that and simplify this to checking if we called that function the right amount of times

(mostly 'cause I generally try to avoid for loops/calculations inside tests :p)

EmanElsaban · 2025-01-21T18:25:15Z

tron/serialize/runstate/dynamodb_state_store.py

-                            f"tron_dynamodb_restore_failure: failed to retrieve items with keys \n{failed_keys}\n from dynamodb\n{resp.result()}"
-                        )
-                        raise error
+                    result = resp.result()


I wonder if we should also print the response when we get into the exception block to also have an idea on why we got unprocessed keys and why we exceeded the attempts

so maybe we add it here

except Exception as e: log.exception("Encountered issues retrieving data from DynamoDB") raise e

I was hesitant to dump the response because it can get pretty large. After a lot of reading I've landed on logging ResponseMetadata on ClientError. This should capture what we care about

See https://fluffy.yelpcorp.com/i/qWG1tRPrFt40M6pPr3lLkXnCSbJJBFhd.html

… for UnprocessedKeys retry

…tions and requeue of failed items. Remove backoff from test_retry_reading

…lculate_backoff_delay

nemacysts

generally looks good to me - i left some minor suggestions, but i don't think any of them are blocking atm

nemacysts · 2025-01-31T18:17:57Z

tests/serialize/runstate/dynamodb_state_store_test.py

-                assert_equal(mock_failed_read.call_count, 11)
+        ) as mock_batch_get_item, mock.patch("time.sleep") as mock_sleep, pytest.raises(Exception) as exec_info:
+            store.restore(keys)
+        assert "failed to retrieve items with keys" in str(exec_info.value)


we could probably remove this assert if we raised a less generic exception and rely on the pytest.raises(SomeMoreTargetedException) proving that the right exception was raised

nemacysts · 2025-01-31T18:19:56Z

tron/serialize/runstate/dynamodb_state_store.py

@@ -83,12 +102,19 @@ def chunk_keys(self, keys: Sequence[T]) -> List[Sequence[T]]:
            cand_keys_chunks.append(keys[i : min(len(keys), i + 100)])
        return cand_keys_chunks

+    def _calculate_backoff_delay(self, attempt: int) -> int:


this technically doesn't need to be in the class since we're not accessing anything in it (i.e., we never use self)

nemacysts · 2025-01-31T18:23:03Z

tron/serialize/runstate/dynamodb_state_store.py

+    def _calculate_backoff_delay(self, attempt: int) -> int:
+        base_delay_seconds = 1
+        max_delay_seconds = 10
+        delay: int = min(base_delay_seconds * (2 ** (attempt - 1)), max_delay_seconds)


should we be a little defensive and set attempt to 1 if attempt is <= 0?

(to protect against someone passing in a 0th (i.e., first) attempt value if they pass in a pre-increment attempt value)

i guess the attempt=0 case would really just lead to a .5s jitter...which is fine?

nemacysts · 2025-01-31T20:41:58Z

tron/serialize/runstate/dynamodb_state_store.py

+            error = Exception(
+                f"tron_dynamodb_restore_failure: failed to retrieve items with keys \n{cand_keys_list}\n from dynamodb after {MAX_UNPROCESSED_KEYS_RETRIES} retries."
+            )
+            log.error(error)
+            raise error


i'd maybe do something like

Suggested change

error = Exception(

f"tron_dynamodb_restore_failure: failed to retrieve items with keys \n{cand_keys_list}\n from dynamodb after {MAX_UNPROCESSED_KEYS_RETRIES} retries."

)

log.error(error)

raise error

msg = f"tron_dynamodb_restore_failure: failed to retrieve items with keys \n{cand_keys_list}\n from dynamodb after {MAX_UNPROCESSED_KEYS_RETRIES} retries."

log.error(msg)

raise KeyError(msg)

as it looks a little funky to log an Exception object, and a KeyError would give more info :)

(or we could have a custom exception defined for this)

nemacysts · 2025-01-31T20:43:32Z

tron/serialize/runstate/dynamodb_state_store.py

+                        name="tron.dynamodb.setitem",
+                        delta=time.time() - start,
+                    )
+                    log.error(f"Failed to save partition for key: {key}, error: {repr(e)}")


i know this is old code being moved around, so you can leave this as-is but

Suggested change

log.error(f"Failed to save partition for key: {key}, error: {repr(e)}")

log.exception(f"Failed to save partition for key: {key}")

would include the full traceback for us automatically :)

although: it looks like there's a behavior change here?

might be worth adding a comment here (or in the docstring) that this function will not retry on its own and that it's the callers responsibility to do so)

nemacysts · 2025-01-31T20:52:43Z

tron/serialize/runstate/dynamodb_state_store.py

+                    raise
+            if cand_keys_list:
+                attempts += 1
+                delay = self._calculate_backoff_delay(attempts)


fwiw, I think it's probably fine to rely on the built-in backoff from boto - there shouldn't be anything else touching these dynamo tables other than tron, so we don't really need any jitter to avoid a thundering herd scenario :)

There are two levels of backoff, basically. There is the built-in retry config that catches something like throttling, and then there is our own backoff based on unprocessedkeys. This seems to be the suggested approach based on the warning in: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb/client/batch_get_item.html

It's not 100% clear to me that the retry config will handle unprocessedkeys

KaspariK added 2 commits January 20, 2025 07:41

Add dynamodb retry config for throttling and other errors. Add expone…

e382156

…ntial backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail

Ignore botocore

3e74d75

KaspariK force-pushed the u/kkasp/TRON-2342-exponential-backoff-dynamo-get branch from b361235 to 3e74d75 Compare January 20, 2025 15:43

KaspariK added 3 commits January 20, 2025 08:24

Fix setitem loop now that we use Dynamo retry config. Update number f…

f88d7f9

…or retries in test assertion

Add timer back to exception block to capture failures

6902ab0

Add unit of measurement to base_delay and max_delay. Expand the retry…

b4e423d

…_reading test

KaspariK commented Jan 20, 2025

View reviewed changes

KaspariK requested review from nemacysts and EmanElsaban January 21, 2025 16:39

KaspariK marked this pull request as ready for review January 21, 2025 16:39

KaspariK requested a review from a team as a code owner January 21, 2025 16:39

nemacysts reviewed Jan 21, 2025

View reviewed changes

EmanElsaban reviewed Jan 21, 2025

View reviewed changes

KaspariK added 3 commits January 22, 2025 09:03

Move backoff logic for UnprocessedKeys to new function. Remove jitter…

b0e890b

… for UnprocessedKeys retry

Add test for backoff, Update test_retry_saving to actually test asser…

de2e2f9

…tions and requeue of failed items. Remove backoff from test_retry_reading

Catch ClientError and log ResponseMetadata. Make mypy happy about _ca…

be7e063

…lculate_backoff_delay

nemacysts approved these changes Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

KaspariK commented Jan 15, 2025 •

edited

Loading

KaspariK Jan 20, 2025

KaspariK Jan 20, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

KaspariK Jan 22, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

nemacysts Jan 21, 2025

EmanElsaban Jan 21, 2025

EmanElsaban Jan 21, 2025

KaspariK Jan 24, 2025

nemacysts left a comment

nemacysts Jan 31, 2025

nemacysts Jan 31, 2025

nemacysts Jan 31, 2025

nemacysts Jan 31, 2025

nemacysts Jan 31, 2025

nemacysts Jan 31, 2025

nemacysts Jan 31, 2025

nemacysts Jan 31, 2025

KaspariK Jan 31, 2025

	log.error(f"Failed to save partition for key: {key}, error: {repr(e)}")
	log.exception(f"Failed to save partition for key: {key}")

Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

Are you sure you want to change the base?

Add dynamodb retry config for throttling and other errors. Add exponential backoff and jitter for unprocessed keys. Fix edge case where we succesfully process keys on our last attempt but still fail #1023

Conversation

KaspariK commented Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nemacysts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KaspariK commented Jan 15, 2025 •

edited

Loading