Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetcher retries #1153

Merged
merged 5 commits into from
Nov 13, 2024
Merged

Fetcher retries #1153

merged 5 commits into from
Nov 13, 2024

Conversation

barbarahui
Copy link
Collaborator

  • implement retries in base Fetcher class
  • move make_http_request() functionality into ucd_json_fetcher -- this is the only place this was being used

http = requests.Session()
retry_strategy = Retry(
total=3,
status_forcelist=[413, 429, 500, 502, 503, 504],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the back off factor - that seems useful and relevant?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if this makes sense as a global configuration - like in rikolti/utils.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I didn't actually remove the backoff factor as there wasn't one configured at all here previously. I figured that we could go with the default backoff factor of 0 and tweak if necessary. But sure, it probably makes sense to set it to 2.

Yeah, we could make this a global configuration. This is actually a copy of configure_http_session() from the content harvester code (which doesn't have a backoff factor set).

session = requests.Session()
retries = Retry(total=3, backoff_factor=2)
session.mount("https://", HTTPAdapter(max_retries=retries))
response = session.get(url=url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you added the backoff_factor to the gloabl configuration for self.http_session, then you should be able to do:

Suggested change
response = session.get(url=url)
response = self.http_session.get(url=url)

And get rid of lines 93-95 here.

Sure, self.http_session adds some status codes to the forcelist and attached the retry strategy to both http and https, but that seems fine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that too. I figured to introduce the least amount of change possible, but you're right, it probably would be fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm, that makes sense - what you did change (Fetcher.fetch_page()) is the function that is called like 95% of the time though, so it seemed confusing to me to change it most of the way but not all of the way. Could result in weird errors on strange edge cases where we've had to implement additional requests outside of the standard requests managed by the Fetcher base class.

@amywieliczka
Copy link
Collaborator

amywieliczka commented Nov 13, 2024

Looks like nuxeo_fetcher, oac_fetcher, and ucd_json_fetcher all make their own calls to requests.get (at least in the fetcher) so you might want to update those other instances in nuxeo_fetcher and oac_fetcher to use self.http_session.get instead.

Oh, yeah and a global search for Retry across the codebase is actually pretty demonstrative - the Retry configuration is the same across several mappers and a fetcher (islandora_mapper, contentdm_mapper, flickr_fetcher)

Since it does look like we use the same configuration everywhere (with the exception of this one discrepancy on backoff_factor and status_forcelist), I think adding it to a rikolti/utils.py makes a whole lot of sense, unless you can think of some reason why we would want different configuration in these different places?

@barbarahui
Copy link
Collaborator Author

@amywieliczka Sure yeah, I think it makes sense to do this. I can do the work to put this in place. I don't think it'll take too long--I was just trying to do this quickly so that I could focus on the Nuxeo API issue.

@amywieliczka
Copy link
Collaborator

amywieliczka commented Nov 13, 2024

Figured I could do this quickly while @barbarahui's head was in Nuxeo API issues.
Resolved my own asks - @barbarahui I'll leave this open in case you want to take a look, but as far as I'm concerned it's good to go - rebase & merge when you get to it.

I did leave us using the default requests session for all requests to our own Registry API, as well as requests to the OpenSearch API in the record_indexer.

@barbarahui
Copy link
Collaborator Author

@amywieliczka this looks great, thank you so much!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants