-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make User-Agent a default header #3828
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -124,7 +124,8 @@ | |||
|
|||
# User-Agent header for APIs that require it | |||
CONTACT_EMAIL = os.getenv("CONTACT_EMAIL") | |||
UA_STRING = f"Openverse/0.1 (https://wordpress.org/openverse; {CONTACT_EMAIL})" | |||
CANONICAL_ORIGIN = os.getenv("CANONICAL_ORIGIN", "https://openverse.org") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if we want to match the API's new variables, it should be an environment variable of CANONICAL_DOMAIN
with origin derived from it, but I don't know if that was a goal with this PR. It doesn't matter to me either way, as there's so little overlap between the variables for each I don't think we need to be concerned with having them match except when the names are this similar, it would be easier to make a quick mistake.
To clarify, not a suggestion for a change, just noting the slight difference, will at least help me keep it in mind during infra work to migrate Airflow's environment variables to the new approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I intended to make the change aligned with your on moving the domain :) I thought the CANONICAL_DOMAIN
was a variable we could skip here, but you're right. It would be better to match the API's variables in this case.
Done ✅
5412660
to
b8eccfb
Compare
self._DELAY = delay | ||
self.headers = headers or {} | ||
self.headers = {"User-Agent": prov.UA_STRING} | headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we default here in addition to adding this as a default in the ProviderDataIngester?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the suggestion of the issue, and I went that route first, but it turns out it wasn't enough to set the default header. I leave it after updating the ProviderDataIngester
as it will be better if each request from the Calalog goes with this header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 That's so strange -- do you mean setting it as the default in DelayedRequester
did not cover all the existing cases? Are we making requests that don't use the requester somewhere?
If it can be added in only one place, I do agree about adding it in the requester instead of the ProviderDataIngester, since that gets complicated with subclasses that override __init__
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean setting it as the default in DelayedRequester did not cover all the existing cases?
Yes, the thing is the ProviderDataIngester
class passes each time its headers (previously an empty dictionary) to the DelayedRequester.get_response_json
method (for the possibility of overriding them), so it wasn't enough to add them to the DelayedRequester
only.
Are we making requests that don't use the requester somewhere?
It's a possibility, I haven't checked but this isn't the issue here, as I explained above. We should be safe with these changes!
b8eccfb
to
c4c7f39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of questions but nothing blocking. LGTM and all DAGs work as expected 🚀
CANONICAL_DOMAIN: str = os.getenv("CANONICAL_DOMAIN", "openverse.org") | ||
|
||
_proto = "http" if "localhost" in CANONICAL_DOMAIN else "https" | ||
CANONICAL_ORIGIN: str = f"{_proto}://{CANONICAL_DOMAIN}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just for consistency with the api? Would we ever actually use a domain with "localhost" in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this arose from my conversation with Sara here. We opted for keeping consistency with the API in any case.
self._DELAY = delay | ||
self.headers = headers or {} | ||
self.headers = {"User-Agent": prov.UA_STRING} | headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 That's so strange -- do you mean setting it as the default in DelayedRequester
did not cover all the existing cases? Are we making requests that don't use the requester somewhere?
If it can be added in only one place, I do agree about adding it in the requester instead of the ProviderDataIngester, since that gets complicated with subclasses that override __init__
etc.
Fixes
Fixes #1362 by @krysal
Related to #2037 (Airflow work)
Description
Resuming the work done previous to the monorepo. This PR adds the User-Agent as the default header for all requests in the
DelayedRequester
andProviderDataIngester
classes, making it unnecessary to repeat code in providers DAGs, so it was removed from museum_victoria, nappy, rawpixel, stocksnap, and wikimedia_commons.I added a new
CANONICAL_ORIGIN
reading it from the os environment to replace the old domain referenced in the UA string. Is it better to read it from an Airflow variable? CC @AetherUnbound.Testing Instructions
Run one of the above-mentioned DAGs, for example rawpixel, to confirm they continue to work. See the code and updated tests and confirm it all makes sense.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin