Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make User-Agent a default header #3828

Merged
merged 6 commits into from
Feb 28, 2024
Merged

Make User-Agent a default header #3828

merged 6 commits into from
Feb 28, 2024

Conversation

krysal
Copy link
Member

@krysal krysal commented Feb 26, 2024

Fixes

Fixes #1362 by @krysal
Related to #2037 (Airflow work)

Description

Resuming the work done previous to the monorepo. This PR adds the User-Agent as the default header for all requests in the DelayedRequester and ProviderDataIngester classes, making it unnecessary to repeat code in providers DAGs, so it was removed from museum_victoria, nappy, rawpixel, stocksnap, and wikimedia_commons.

I added a new CANONICAL_ORIGIN reading it from the os environment to replace the old domain referenced in the UA string. Is it better to read it from an Airflow variable? CC @AetherUnbound.

Testing Instructions

Run one of the above-mentioned DAGs, for example rawpixel, to confirm they continue to work. See the code and updated tests and confirm it all makes sense.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@krysal krysal added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Feb 26, 2024
@krysal krysal requested a review from a team as a code owner February 26, 2024 18:49
@krysal krysal requested review from obulat and stacimc February 26, 2024 18:49
@openverse-bot openverse-bot added the 🕹 aspect: interface Concerns end-users' experience with the software label Feb 26, 2024
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -124,7 +124,8 @@

# User-Agent header for APIs that require it
CONTACT_EMAIL = os.getenv("CONTACT_EMAIL")
UA_STRING = f"Openverse/0.1 (https://wordpress.org/openverse; {CONTACT_EMAIL})"
CANONICAL_ORIGIN = os.getenv("CANONICAL_ORIGIN", "https://openverse.org")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we want to match the API's new variables, it should be an environment variable of CANONICAL_DOMAIN with origin derived from it, but I don't know if that was a goal with this PR. It doesn't matter to me either way, as there's so little overlap between the variables for each I don't think we need to be concerned with having them match except when the names are this similar, it would be easier to make a quick mistake.

To clarify, not a suggestion for a change, just noting the slight difference, will at least help me keep it in mind during infra work to migrate Airflow's environment variables to the new approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I intended to make the change aligned with your on moving the domain :) I thought the CANONICAL_DOMAIN was a variable we could skip here, but you're right. It would be better to match the API's variables in this case.

Done ✅

@krysal krysal force-pushed the update/ua_string branch 2 times, most recently from 5412660 to b8eccfb Compare February 26, 2024 21:05
self._DELAY = delay
self.headers = headers or {}
self.headers = {"User-Agent": prov.UA_STRING} | headers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we default here in addition to adding this as a default in the ProviderDataIngester?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the suggestion of the issue, and I went that route first, but it turns out it wasn't enough to set the default header. I leave it after updating the ProviderDataIngester as it will be better if each request from the Calalog goes with this header.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 That's so strange -- do you mean setting it as the default in DelayedRequester did not cover all the existing cases? Are we making requests that don't use the requester somewhere?

If it can be added in only one place, I do agree about adding it in the requester instead of the ProviderDataIngester, since that gets complicated with subclasses that override __init__ etc.

Copy link
Member Author

@krysal krysal Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean setting it as the default in DelayedRequester did not cover all the existing cases?

Yes, the thing is the ProviderDataIngester class passes each time its headers (previously an empty dictionary) to the DelayedRequester.get_response_json method (for the possibility of overriding them), so it wasn't enough to add them to the DelayedRequester only.

Are we making requests that don't use the requester somewhere?

It's a possibility, I haven't checked but this isn't the issue here, as I explained above. We should be safe with these changes!

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions but nothing blocking. LGTM and all DAGs work as expected 🚀

CANONICAL_DOMAIN: str = os.getenv("CANONICAL_DOMAIN", "openverse.org")

_proto = "http" if "localhost" in CANONICAL_DOMAIN else "https"
CANONICAL_ORIGIN: str = f"{_proto}://{CANONICAL_DOMAIN}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for consistency with the api? Would we ever actually use a domain with "localhost" in this context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this arose from my conversation with Sara here. We opted for keeping consistency with the API in any case.

self._DELAY = delay
self.headers = headers or {}
self.headers = {"User-Agent": prov.UA_STRING} | headers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 That's so strange -- do you mean setting it as the default in DelayedRequester did not cover all the existing cases? Are we making requests that don't use the requester somewhere?

If it can be added in only one place, I do agree about adding it in the requester instead of the ProviderDataIngester, since that gets complicated with subclasses that override __init__ etc.

@krysal krysal merged commit 6636dcf into main Feb 28, 2024
43 checks passed
@krysal krysal deleted the update/ua_string branch February 28, 2024 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🕹 aspect: interface Concerns end-users' experience with the software ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Include UA string on every request made by the DelayedRequester
4 participants