Fetcher retries #1153

barbarahui · 2024-11-12T21:48:29Z

implement retries in base Fetcher class
move make_http_request() functionality into ucd_json_fetcher -- this is the only place this was being used

amywieliczka · 2024-11-13T00:19:58Z

metadata_fetcher/fetchers/Fetcher.py

+        http = requests.Session()
+        retry_strategy = Retry(
+            total=3,
+            status_forcelist=[413, 429, 500, 502, 503, 504],


Why did you remove the back off factor - that seems useful and relevant?

I also wonder if this makes sense as a global configuration - like in rikolti/utils.py?

Well, I didn't actually remove the backoff factor as there wasn't one configured at all here previously. I figured that we could go with the default backoff factor of 0 and tweak if necessary. But sure, it probably makes sense to set it to 2.

Yeah, we could make this a global configuration. This is actually a copy of configure_http_session() from the content harvester code (which doesn't have a backoff factor set).

amywieliczka · 2024-11-13T00:22:29Z

metadata_fetcher/fetchers/ucd_json_fetcher.py

+        session = requests.Session()
+        retries = Retry(total=3, backoff_factor=2)
+        session.mount("https://", HTTPAdapter(max_retries=retries))
+        response = session.get(url=url)


If you added the backoff_factor to the gloabl configuration for self.http_session, then you should be able to do:

Suggested change

response = session.get(url=url)

response = self.http_session.get(url=url)

And get rid of lines 93-95 here.

Sure, self.http_session adds some status codes to the forcelist and attached the retry strategy to both http and https, but that seems fine?

Yeah, I thought about that too. I figured to introduce the least amount of change possible, but you're right, it probably would be fine.

Mmmm, that makes sense - what you did change (Fetcher.fetch_page()) is the function that is called like 95% of the time though, so it seemed confusing to me to change it most of the way but not all of the way. Could result in weird errors on strange edge cases where we've had to implement additional requests outside of the standard requests managed by the Fetcher base class.

amywieliczka · 2024-11-13T00:32:58Z

Looks like nuxeo_fetcher, oac_fetcher, and ucd_json_fetcher all make their own calls to requests.get (at least in the fetcher) so you might want to update those other instances in nuxeo_fetcher and oac_fetcher to use self.http_session.get instead.

Oh, yeah and a global search for Retry across the codebase is actually pretty demonstrative - the Retry configuration is the same across several mappers and a fetcher (islandora_mapper, contentdm_mapper, flickr_fetcher)

Since it does look like we use the same configuration everywhere (with the exception of this one discrepancy on backoff_factor and status_forcelist), I think adding it to a rikolti/utils.py makes a whole lot of sense, unless you can think of some reason why we would want different configuration in these different places?

barbarahui · 2024-11-13T00:48:37Z

@amywieliczka Sure yeah, I think it makes sense to do this. I can do the work to put this in place. I don't think it'll take too long--I was just trying to do this quickly so that I could focus on the Nuxeo API issue.

amywieliczka · 2024-11-13T01:52:27Z

Figured I could do this quickly while @barbarahui's head was in Nuxeo API issues.
Resolved my own asks - @barbarahui I'll leave this open in case you want to take a look, but as far as I'm concerned it's good to go - rebase & merge when you get to it.

I did leave us using the default requests session for all requests to our own Registry API, as well as requests to the OpenSearch API in the record_indexer.

barbarahui · 2024-11-13T17:52:41Z

@amywieliczka this looks great, thank you so much!!!!

barbarahui added 2 commits November 12, 2024 13:37

Configure retries for base Fetcher

9493b17

Move requests session config into ucd json fetcher

e18a2d5

barbarahui requested a review from amywieliczka as a code owner November 12, 2024 21:48

amywieliczka requested changes Nov 13, 2024

View reviewed changes

amywieliczka added 3 commits November 12, 2024 17:32

Update metadata_fetcher to use retries

8d3eac0

Use global retry config for metadata_mapper requests

70183c9

Use global retry config for content_harvester requests

46730d6

barbarahui merged commit dfc396b into main Nov 13, 2024
2 checks passed

barbarahui deleted the fetcher-retries branch November 13, 2024 17:53

barbarahui mentioned this pull request Nov 13, 2024

TIND validate_by_mapper's throws 503 errors when fetching batch jobs; possibly does not like being hit in quick succession (running individual collections works fine) - implement retries globally for the fetcher #1150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetcher retries #1153

Fetcher retries #1153

barbarahui commented Nov 12, 2024

amywieliczka Nov 13, 2024

amywieliczka Nov 13, 2024

barbarahui Nov 13, 2024

amywieliczka Nov 13, 2024

barbarahui Nov 13, 2024

amywieliczka Nov 13, 2024

amywieliczka commented Nov 13, 2024 •

edited

Loading

barbarahui commented Nov 13, 2024

amywieliczka commented Nov 13, 2024 •

edited

Loading

barbarahui commented Nov 13, 2024

	response = session.get(url=url)
	response = self.http_session.get(url=url)

Fetcher retries #1153

Fetcher retries #1153

Conversation

barbarahui commented Nov 12, 2024

amywieliczka Nov 13, 2024

Choose a reason for hiding this comment

amywieliczka Nov 13, 2024

Choose a reason for hiding this comment

barbarahui Nov 13, 2024

Choose a reason for hiding this comment

amywieliczka Nov 13, 2024

Choose a reason for hiding this comment

barbarahui Nov 13, 2024

Choose a reason for hiding this comment

amywieliczka Nov 13, 2024

Choose a reason for hiding this comment

amywieliczka commented Nov 13, 2024 • edited Loading

barbarahui commented Nov 13, 2024

amywieliczka commented Nov 13, 2024 • edited Loading

barbarahui commented Nov 13, 2024

amywieliczka commented Nov 13, 2024 •

edited

Loading

amywieliczka commented Nov 13, 2024 •

edited

Loading