Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make User-Agent a default header #3828

Merged
merged 6 commits into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion catalog/dags/common/loader/provider_details.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,8 @@

# User-Agent header for APIs that require it
CONTACT_EMAIL = os.getenv("CONTACT_EMAIL")
UA_STRING = f"Openverse/0.1 (https://wordpress.org/openverse; {CONTACT_EMAIL})"
CANONICAL_ORIGIN = os.getenv("CANONICAL_ORIGIN", "https://openverse.org")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we want to match the API's new variables, it should be an environment variable of CANONICAL_DOMAIN with origin derived from it, but I don't know if that was a goal with this PR. It doesn't matter to me either way, as there's so little overlap between the variables for each I don't think we need to be concerned with having them match except when the names are this similar, it would be easier to make a quick mistake.

To clarify, not a suggestion for a change, just noting the slight difference, will at least help me keep it in mind during infra work to migrate Airflow's environment variables to the new approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I intended to make the change aligned with your on moving the domain :) I thought the CANONICAL_DOMAIN was a variable we could skip here, but you're right. It would be better to match the API's variables in this case.

Done ✅

UA_STRING = f"Openverse/0.1 ({CANONICAL_ORIGIN}; {CONTACT_EMAIL})"


# Available Image Categories for API
Expand Down
8 changes: 5 additions & 3 deletions catalog/dags/common/requester.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from requests.exceptions import JSONDecodeError

import oauth2
from common.loader import provider_details as prov


# pytest_socket will not be available in production, so we must create a shim for
Expand Down Expand Up @@ -42,12 +43,13 @@ class DelayedRequester:
delay: an integer giving the minimum number of seconds to wait
between consecutive requests via the `get` method.
headers: a dict that will be passed in all requests, unless overridden
by kwargs in specific calls to the get method
by kwargs in specific calls to the `get` method
"""

def __init__(self, delay=0, headers=None):
def __init__(self, delay: int = 0, headers: dict | None = None):
headers = {} if headers is None else headers
self._DELAY = delay
self.headers = headers or {}
self.headers = {"User-Agent": prov.UA_STRING} | headers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we default here in addition to adding this as a default in the ProviderDataIngester?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the suggestion of the issue, and I went that route first, but it turns out it wasn't enough to set the default header. I leave it after updating the ProviderDataIngester as it will be better if each request from the Calalog goes with this header.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 That's so strange -- do you mean setting it as the default in DelayedRequester did not cover all the existing cases? Are we making requests that don't use the requester somewhere?

If it can be added in only one place, I do agree about adding it in the requester instead of the ProviderDataIngester, since that gets complicated with subclasses that override __init__ etc.

Copy link
Member Author

@krysal krysal Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean setting it as the default in DelayedRequester did not cover all the existing cases?

Yes, the thing is the ProviderDataIngester class passes each time its headers (previously an empty dictionary) to the DelayedRequester.get_response_json method (for the possibility of overriding them), so it wasn't enough to add them to the DelayedRequester only.

Are we making requests that don't use the requester somewhere?

It's a possibility, I haven't checked but this isn't the issue here, as I explained above. We should be safe with these changes!

self._last_request = 0
self.session = requests.Session()

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ class ImageDetails(TypedDict, total=False):
class VictoriaDataIngester(ProviderDataIngester):
providers = {"image": prov.VICTORIA_DEFAULT_PROVIDER}
endpoint = "https://collections.museumsvictoria.com.au/api/search"
headers = {"User-Agent": prov.UA_STRING, "Accept": "application/json"}
headers = {"Accept": "application/json"}
batch_limit = 100
delay = 5
LANDING_PAGE = "https://collections.museumsvictoria.com.au/"
Expand Down
2 changes: 1 addition & 1 deletion catalog/dags/providers/provider_api_scripts/nappy.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
class NappyDataIngester(ProviderDataIngester):
providers = {constants.IMAGE: prov.NAPPY_DEFAULT_PROVIDER}
endpoint = "https://api.nappy.co/v1/openverse/images"
headers = {"User-Agent": prov.UA_STRING, "Accept": "application/json"}
headers = {"Accept": "application/json"}

# Hardcoded to CC0, the only license Nappy.co uses
license_info = get_license_info(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from airflow.exceptions import AirflowException
from airflow.models import Variable

from common.loader import provider_details as prov
from common.requester import DelayedRequester
from common.storage.media import MediaStore
from common.storage.util import get_media_store_class
Expand Down Expand Up @@ -145,6 +146,9 @@ def __init__(
# Keep track of number of records ingested
self.record_count = 0

# Set default headers
self.headers = {"User-Agent": prov.UA_STRING} | self.headers

# Initialize the DelayedRequester and all necessary Media Stores.
self.delayed_requester = DelayedRequester(
delay=self.delay, headers=self.headers
Expand Down
1 change: 0 additions & 1 deletion catalog/dags/providers/provider_api_scripts/rawpixel.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,6 @@ class RawpixelDataIngester(ProviderDataIngester):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.api_key: str = Variable.get("API_KEY_RAWPIXEL")
self.headers = {"User-Agent": prov.UA_STRING}

def get_media_type(self, record: dict) -> str:
return constants.IMAGE
Expand Down
5 changes: 1 addition & 4 deletions catalog/dags/providers/provider_api_scripts/stocksnap.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,7 @@ class StockSnapDataIngester(ProviderDataIngester):
providers = {"image": prov.STOCKSNAP_DEFAULT_PROVIDER}
batch_limit = 1000
delay = 1 # in seconds
headers = {
"Accept": "application/json",
"User-Agent": prov.UA_STRING,
}
headers = {"Accept": "application/json"}
license_url = "https://creativecommons.org/publicdomain/zero/1.0/"
license_info = get_license_info(license_url=license_url)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,6 @@ class WikimediaCommonsDataIngester(ProviderDataIngester):
}
host = "commons.wikimedia.org"
endpoint = f"https://{host}/w/api.php"
headers = {"User-Agent": prov.UA_STRING}

# The 10000 is a bit arbitrary, but needs to be larger than the mean
# number of uses per file (globally) in the response_json, or we will
Expand Down
12 changes: 10 additions & 2 deletions catalog/tests/dags/common/test_requester.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@

from catalog.tests.dags.conftest import FAKE_OAUTH_PROVIDER_NAME
from common import requester
from common.loader import provider_details as prov


USER_AGENT = {"User-Agent": prov.UA_STRING}


@patch("common.requester.time")
Expand Down Expand Up @@ -155,8 +159,12 @@ def test_oauth_requester_initializes_correctly(oauth_provider_var_mock):
@pytest.mark.parametrize(
"init_headers, request_kwargs, expected_request_kwargs",
[
(None, None, {"headers": {}}),
({"init_header": "test"}, None, {"headers": {"init_header": "test"}}),
(None, None, {"headers": USER_AGENT}),
(
{"init_header": "test"},
None,
{"headers": {"init_header": "test"} | USER_AGENT},
),
(
None,
{"headers": {"h1": "test1"}, "other_kwarg": "test"},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
MockImageOnlyProviderDataIngester,
MockProviderDataIngester,
)
from common.loader import provider_details as prov
from common.requester import RetriesExceeded
from common.storage.audio import AudioStore, MockAudioStore
from common.storage.image import ImageStore, MockImageStore
Expand Down Expand Up @@ -92,6 +93,15 @@ def test_get_response_json(endpoint, expected):
assert actual_endpoint == expected


def test_passes_user_agent_header():
ingester = MockProviderDataIngester()
with patch.object(ingester.delayed_requester, "get_response_json") as mock_get:
ingester.get_response_json({})
actual_headers = mock_get.call_args.kwargs["headers"]
assert "User-Agent" in actual_headers
assert actual_headers["User-Agent"] == prov.UA_STRING


def test_batch_limit_is_capped_to_ingestion_limit():
with patch(
"providers.provider_api_scripts.provider_data_ingester.Variable"
Expand Down
Loading