-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle duplication of records between auckland_museum and wikimedia #3659
Comments
Just adding a note that it would be great to do this in a way that lets us also identify other big sources within Wikimedia Commons, like the NGA (#3167) or others in the spreadsheet I linked in the comment Staci quoted: https://docs.google.com/spreadsheets/d/1WPS-KJptUJ-o8SXtg00llcxq0IKJu8eO6Ege_GrLaNc/edit#gid=1216556120 Specifically, getting the "Collection" metadata from Wikimedia Commons (probably into the This would be in contrast to not storing that metadata and just reading it to exclude these records. |
We may also need to identify duplicated records uploaded from Flickr (https://www.flickr.org/introducing-flickypedia/). |
Hi, I'd like to work on this one |
I've been poking at this and wanted to share what I could find. Unfortunately, even though "Collection" shows up as a section on the page (example), I cannot seem to find anywhere that it would be in the metadata. Here's the documentation on what's available in the |
Hi @AetherUnbound, I have two ideas one stupid the other a bit less. Both of these are just something that came up when I looked at this but I don't expect you to take them seriously. Stupid idea, but would it be possible to when ingesting the images keep a hash of their data to then compare to old images? This would mean that we would have to have a hash of all of the ingested images which is probably not possible but it would allow for checking any and all duplicates in the database in the future and generating the hashes could be a job that could run in the background to make sure that slowly the hashes would fill up? Another less stupid idea would be to maybe use a crawler and get the collection data from the page itself? We could use the descriptionshorturl property to get the url and then collect the collections from the html, this would mean that any html changes will break the ingestion so maybe it would have to be put behind an option for stricter checks or something? |
Further investigation, it looks like the "Collection" section may use the Institutions template, which is something we can match on (using the I was thinking we could have an alert similar to the Flickr subprovider auditor which might try to monitor the list of "Institutions" in Wikimedia. That'd probably be a massive number though, I'm not sure the viability of that approach. At least for the Auckland Museum, there is an existing
And we'd also want to filter down to just that template namespace by adding
Here's what that response looks like when added: https://commons.wikimedia.org/w/api.php?action=query&prop=info|templates&inprop=displaytitle&tlnamespace=106&titles=File:(Figure_sketches)_PD-1952-2-34.jpg The assumption would be that a piece of media is from one institution, but this approach on a technical level might allow more than one institution to be present on the page, since we're just using whether or not the institution's template was used. It seems like we might want to store all institution values in a list in the @stacimc and @sarayourfriend, I'm interested to hear your thoughts on all this! @szymon-polaczy to respond to your questions (thanks for asking!): Hashing the image (using either similarity methods like perceptual hashing or more standard cryptographic/file contents hashing) would indeed help us identify duplicates. As you point out though, for our dataset, this would be a massive undertaking and likely a full project in order to execute. Not that that makes it impossible, but we'd have to scope and prioritize it among all our other projects 🙂 On the crawling front - the Wikimedia ingestion process gathers nearly all of the data it needs using the "Allimages" generator. This means that we're parsing through pages of images at a given time and ingesting them all at once, rather than combing through one result at a time (you can read more about this on the DAG documentation we have). This mechanism already pushes up against API/throughput limits (while trying to be respectful of Wikimedia's API request rate), and would be slowed down significantly by querying individual images instead of performing that querying in bulk. That's part of the motivation behind trying to find the right set of properties to use above that can be paired with the That said, in addition to That to say, I'm frustrated that the information isn't more easily available when Wikimedia is clearly able to find it somewhere for the record so that it actually knows that there are collections associated with an image 😕 |
Thank you for the deep explanation.
and in the response there was this Wikipedia link for the National Gallery of Art didn't get caugth but it was mentioned as the Artist and the Credit in the metadata So maybe it would be possible to also combine other props and create some mapping so see if an element exists somewhere else? @AetherUnbound When I find more time I'll look through the last message to see if I can help with anything there but I wanted to send this through as maybe another option. edit: I quickly read through the explanation for your idea and I think that what I sent over might just be a worse / roundabout version of what you found |
Heh, thanks Szymon! I was also looking at |
@AetherUnbound using the templates to determine it sounds like the right way to go, based on my very small understanding of MediaWiki's data model. Would we need to reingest records for this solution? |
Templates seems like the best approach as far as I can see, although I've spent much less time working with this API than you @AetherUnbound :)
We'd need some kind of backfill regardless, because there's no way to update the Wikimedia DAG to delete existing records even once we've identified them as duplicates. The simple way we'd do it with a smaller provider would be to update the DAG to add a new |
I wonder if they should be deleted or if they should be marked as duplicates of a provider-specific DAG's works? It would be good if we could exclude them from search (and I guess other analysis) without chucking the data or skipping them during ingestion. I wonder if the wikimedia API makes it possible to filter based on the template parameter value. In which case, we could do a targeted backfill of the Auckland Museum works in wikimedia to suppress/exclude them. If that's possible, there might be some crossover with the targeted reingestion work we've discussed in #4452. |
Totally agree, although I was intentionally vague when I said "otherwise preserve the duplicates" because we don't have an "official" way of doing that yet as far as I know 😄 FWIW simplest thing to do is probably move them to the |
I have seen this and will respond to this discussion when I have time! |
Problem
As noted by @sarayourfriend in this comment, many records from the Auckland Museum's collection are already in Openverse due to their inclusion in Wikimedia Commons. If we run both DAGs and do nothing to address this, these records will be duplicated in Openverse.
Description
Suggestion taken directly from Sara's comment:
Additional context
The
auckland_museum
DAG is currently blocked on other issues (see DAG Status page), but this issue should not necessarily prevent us from turning the DAG on.However, we should not add the provider as a source in the API until this has been resolved.
The text was updated successfully, but these errors were encountered: