Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Europeana images may change the direct URL which cause broken images to be displayed in Openverse #3772

Open
krysal opened this issue Feb 8, 2024 · 9 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work 🐛 tooling: sentry Sentry issue

Comments

@krysal
Copy link
Member

krysal commented Feb 8, 2024

Description

Apparently, some Europeana images can change their direct link while remaining available through their landing page. This is a problem for us because it seems the Data Refresh process is not updating this value (I haven't confirmed it).

Observe this image for example: https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/

{
    "id": "f8c86a20-eb9c-4ffc-9a06-3664151dbce6",
    "title": "varrukad, hame, naiste",
    "indexed_on": "2022-11-20T17:08:56.418834Z",
    "foreign_landing_url": "https://www.muis.ee/museaalView/534165",
    "url": "https://www.muis.ee/digitaalhoidla/api/meedia/originaal?id=7c0829e9-1731-4ad1-894f-7980bb09f3c7",
    "creator": null,
    "creator_url": null,
    "license": "cc0",
    "license_version": "1.0",
    "license_url": "https://creativecommons.org/publicdomain/zero/1.0/",
    "provider": "europeana",
    "source": "europeana",
    "category": null,
    "filesize": null,
    "filetype": null,
    "tags": [],
    "attribution": "\"varrukad, hame, naiste\" is marked with CC0 1.0. To view the terms, visit https://creativecommons.org/publicdomain/zero/1.0/.",
    "fields_matched": [],
    "mature": false,
    "height": null,
    "width": null,
    "thumbnail": "https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/thumb/",
    "detail_url": "https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/",
    "related_url": "https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/related/",
    "unstable__sensitivity": []
}

Reproduction

  1. Go to https://openverse.org/search/image?q=varrukad,%20hame,%20naiste
  2. See all images found have broken thumbnails.

Screenshots

CleanShot 2024-02-08 at 17 17 39@2x

Additional context

Sentry issue.

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🐛 tooling: sentry Sentry issue 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Feb 8, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Feb 8, 2024
@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Feb 9, 2024

Since Europeana is an aggregator, I suspect that all of the images from this particular source might have been affected (given they're all producing the same not_found.txt thumbnail: https://www.muis.ee/digitaalhoidla/api/meedia/originaal?id=7c0829e9-1731-4ad1-894f-7980bb09f3c7).

I've run the following to see how pervasive this issue is:

deploy@localhost:openledger> select count(*) from image where provider='europeana' and STARTS_WITH(url, 'https://www.muis.ee');
+--------+
| count  |
|--------|
| 143600 |
+--------+
SELECT 1
Time: 313.911s (5 minutes 13 seconds), executed in: 313.905s (5 minutes 13 seconds)

This seems like something that could be addressed in a batched update, if we could figure out how to correct the URLs!

@AetherUnbound
Copy link
Collaborator

Diving into the result above, it looks like all of the related URLs differ now:

Because these are all unique UUIDs, it doesn't look like we can derive those values in a way that could be updated using the batched update 😞 Maybe the best option would be to use the additional_query_parameters added in #3648 to select only images from this domain (Estonian National Museum) and reingest those specifically to get the new URLs? What do you think @WordPress/openverse-catalog?

@stacimc
Copy link
Collaborator

stacimc commented Feb 15, 2024

Maybe the best option would be to use the additional_query_parameters added in #3648 to select only images from this domain (Estonian National Museum) and reingest those specifically to get the new URLs?

Following up in this thread from an in-person conversation: I think this sounds good, but noting that because Europeana does not have a traditional reingestion DAG we'd want to look into whether there's a reasonable range of dates we could re-run the DAG for to cover all images from this domain.

@AetherUnbound
Copy link
Collaborator

I believe I've found a suitable additional_query_parameters that will allow us to select only the Estonian National Museum data! Currently the dated portion of the DAG configuration goes directly into the query field - this is exactly the field that we can override with the additional_query_parameters! That means that it doesn't really matter for us that the DAG is dated in this case 😄 I tested an API call with the following and it seemed to work, currently running a locally triggered DAG now with these values and will share if that works.

additional_query_parameters override: {'query': 'DATA_PROVIDER:("Estonian National Museum")'}

@AetherUnbound
Copy link
Collaborator

Confirmed that that should work! I ran this locally and ingested 250 records, all of which were from the Estonian National Museum. We should be able to run this triggered DAG next week!

openledger> select count(*) from image where provider='europeana';
+-------+
| count |
|-------|
| 250   |
+-------+
SELECT 1
Time: 0.023s
openledger> select identifier, meta_data from image where provider='europeana' limit 10;
+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------+
| identifier                           | meta_data                                                                                                                                                                  
                                                                                                                                             |
|--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------|
| ce8a15ad-712f-4434-8b9c-d97b89b8f7a8 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 1697095d-74b9-46c9-92d7-1f7443a87b90 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| a136209d-eb0d-44ea-9e68-7e27143e1581 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 196db965-be4c-4944-bcea-61c47592b4a4 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 07186a32-4e56-49c1-bc1c-c69cbcb03448 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| ec680b1a-29af-4138-a05e-6e5e3eb1ce55 | {"country": ["Estonia"], "description": "sündmuse kommentaar: Eesti Apostliku Õigeusu kiriku Tartu Püha Aleksandri kogudus", "license_url": "https://creativecommons.org/pu
blicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| 4b2cca42-80f4-412e-b355-1c3efa06aa3b | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 5fe5cf4d-30cc-46f4-85dd-6420ac7b04c2 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 84d18904-f90c-41d9-ace6-279b9a4e946e | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| ce2ebde1-c4fd-418e-b343-d191ea984b14 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------+
SELECT 10

@AetherUnbound AetherUnbound self-assigned this Feb 26, 2024
@openverse-bot openverse-bot moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Feb 26, 2024
@AetherUnbound
Copy link
Collaborator

The ingestion completed (DAG run link), but only ingested 21,020 records 😕 We did get a data refresh after that, but even with the updated record the primary URL is still showing the "not found" thumbnail 😖 https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/

I'm going to try and see if I can isolate this exact result in a query, and see if Europeana is giving us incorrect URLs.

@AetherUnbound
Copy link
Collaborator

I've narrowed down a set of query parameters that reflects the images affected above:

{'wskey': '[redacted]',
 'profile': 'rich',
 'reusability': ['open', 'restricted'],
 'sort': ['europeana_id+desc', 'timestamp_created+desc'],
 'rows': '100',
 'media': 'true',
 'start': 1,
 'qf': ['TYPE:IMAGE',
  'provider_aggregation_edm_isShownBy:*',
  'DATA_PROVIDER:("Estonian National Museum")'],
 'query': 'varrukad, hame, naiste',
 'cursor': '*'}

This returns 7 results, one of which is the result shared above, here's the full contents of the response body:

{'completeness': 5,
 'country': ['Estonia'],
 'dataProvider': ['Estonian National Museum'],
 'dcCreator': ['Danilova, Marfa (valmistaja)'],
 'dcCreatorLangAware': {'et': ['Danilova, Marfa (valmistaja)']},
 'dcSubjectLangAware': {'def': ['http://data.europeana.eu/concept/2585'],
  'et': ['särk']},
 'dcTitleLangAware': {'en': ['sleeves, hame, women'],
  'et': ['varrukad, hame, naiste']},
 'dcTypeLangAware': {'def': ['http://data.europeana.eu/concept/2585'],
  'et': ['särk']},
 'edmConcept': ['http://data.europeana.eu/concept/2585'],
 'edmConceptLabel': [{'def': 'Hemd'},
  {'def': 'Рубашка'},
  {'def': 'Paita'},
  {'def': 'Camisa'},
  {'def': 'Риза'},
  {'def': 'Marškiniai'},
  {'def': 'Krekls'},
  {'def': 'Košulja'},
  {'def': 'Chemise'},
  {'def': 'Ing'},
  {'def': 'Košeľa'},
  {'def': 'Léine'},
  {'def': 'Camisa'},
  {'def': 'Skjorta'},
  {'def': 'Πουκάμισο'},
  {'def': 'Shirt'},
  {'def': 'Camicia'},
  {'def': 'Camisa'},
  {'def': 'Särk'},
  {'def': 'Alkandora'},
  {'def': 'Košile'},
  {'def': 'Koszula'},
  {'def': 'Cămașă'},
  {'def': 'Skjorte'},
  {'def': 'Overhemd'}],
 'edmConceptPrefLabelLangAware': {'de': ['Hemd'],
  'ru': ['Рубашка'],
  'fi': ['Paita'],
  'pt': ['Camisa'],
  'bg': ['Риза'],
  'lt': ['Marškiniai'],
  'lv': ['Krekls'],
  'hr': ['Košulja'],
  'fr': ['Chemise'],
  'hu': ['Ing'],
  'sk': ['Košeľa'],
  'ga': ['Léine'],
  'ca': ['Camisa'],
  'sv': ['Skjorta'],
  'el': ['Πουκάμισο'],
  'en': ['Shirt'],
  'it': ['Camicia'],
  'es': ['Camisa'],
  'et': ['Särk'],
  'eu': ['Alkandora'],
  'cs': ['Košile'],
  'pl': ['Koszula'],
  'ro': ['Cămașă'],
  'da': ['Skjorte'],
  'nl': ['Overhemd']},
 'edmDatasetName': ['401_Muuseumid'],
 'edmIsShownAt': ['https://www.muis.ee/museaalView/534165'],
 'edmIsShownBy': ['https://www.muis.ee/digitaalhoidla/api/meedia/originaal?id=7c0829e9-1731-4ad1-894f-7980bb09f3c7'],
 'edmPreview': ['https://api.europeana.eu/thumbnail/v2/url.json?uri=https%3A%2F%2Fwww.muis.ee%2Fdigitaalhoidla%2Fapi%2Fmeedia%2Foriginaal%3Fid%3D7c0829e9-1731-4ad1-894f-7980bb09f3c7&type=IMAGE'],
 'europeanaCollectionName': ['401_Muuseumid'],
 'europeanaCompleteness': 5,
 'guid': 'https://www.europeana.eu/item/401/item_O55A2YTA2TMLVLCDBPCUIPSNMBJEJTRT?utm_source=api&utm_medium=api&utm_campaign=dialialika',
 'id': '/401/item_O55A2YTA2TMLVLCDBPCUIPSNMBJEJTRT',
 'index': 0,
 'language': ['et'],
 'link': 'https://api.europeana.eu/record/401/item_O55A2YTA2TMLVLCDBPCUIPSNMBJEJTRT.json?wskey=dialialika',
 'organizations': ['http://data.europeana.eu/organization/1482250000000435049',
  'http://data.europeana.eu/organization/1482250000026719048'],
 'previewNoDistribute': False,
 'provider': ['Estonian e-Repository and Conservation of Collections'],
 'rights': ['http://creativecommons.org/publicdomain/zero/1.0/'],
 'score': 246.3116,
 'timestamp': 1688490887425,
 'timestamp_created': '2022-05-10T08:10:51.546Z',
 'timestamp_created_epoch': 1652170251546,
 'timestamp_update': '2022-05-10T08:10:51.546Z',
 'timestamp_update_epoch': 1652170251546,
 'title': ['varrukad, hame, naiste', 'sleeves, hame, women'],
 'type': 'IMAGE',
 'ugc': [False]}

We use the edmIsShownBy value for our URL, and indeed this value which is returned from Europeana is redirecting to the "not found" image. @Hobbesball - would you happen to have any insight on this?

@AetherUnbound AetherUnbound moved this from 📅 To Do to 🏗 In Progress in Openverse Backlog Mar 11, 2024
@AetherUnbound
Copy link
Collaborator

I've emailed the folks at Europeana directly to ask them about this issue.

@AetherUnbound AetherUnbound added the ⛔ status: blocked Blocked & therefore, not ready for work label Apr 23, 2024
@openverse-bot openverse-bot moved this from 🏗 In Progress to ⛔ Blocked in Openverse Backlog Apr 23, 2024
@sarayourfriend
Copy link
Collaborator

@AetherUnbound did you ever hear back from Europeana? If not, should we recheck the response and see if the problem still exists, and then if so, reach back out to Europeana and potentially the museum itself? Maybe the data issue goes upstream from Europeana and the institution would be able to help (and maybe more responsive).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work 🐛 tooling: sentry Sentry issue
Projects
Status: ⛔ Blocked
Development

No branches or pull requests

4 participants