-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode and deduplicate tags in the catalog with a TargetedReingestionDAG #4452
Comments
I tried to run the DAG again today, and it turns out RDS does not support the PL/Pythonu extension 😞 RDS does support plperl, plrust, plpgsql, and plv8. Of those, plv8 (v8 being the same JS engine as Chrome), might be the most proximate for this use case, but I'll see. It might be that the reingestion of these records is the easiest, most reliable way to do it. |
@stacimc can you chime in on whether you think it's okay to go ahead and start thinking about how we would reingest these specific records identified as have potential unicode escape literals? I think that's the safest approach anyway, because even an escaped literal unicode sequence could be valid metadata from the source. So even escaped literal sequences might be wrong to assume were caused by some previous bad transformation on our end. Most of them certainly were, we know there's some historical occurrence of that, but if even one isn't, then we've just mangled that data with no easy way to get it back (other than reingesting it anyway). And the only way to be sure will be to reingest those records. Can you give any hints or tips of things you think need to be looked out for when thinking about this? The general idea I shared in the initial PR is here, for reference: #4475 (comment) Basically: I don't think a DAG like the one I merged in that PR is actually safe. This isn't a safe or reliable thing to infer based on our data alone. Therefore, to fix this, we need to reingest the records. A temporary fix for instances like https://openverse.org/image/3e3bc924-a7c0-4a6b-bcea-4a17160355d2 is to only apply the frontend's automated fix for works with indexation dates before a certain date (lets say 2021-01-01, based on Olga's query in this comment). |
@sarayourfriend if the only way to safely determine what the tags should be is to reingest, then I think that's fine. The Flickr reingestion DAG is absolutely not reliable for fixing this for a variety of reasons, but I do think we could implement something like you've described using a lot of existing code from the ProviderDataIngester and the loader tasks. Actually I would look into extending the existing FlickDataIngester. You can add The only thing that would be an issue is that the |
That's a good point about potentially overwriting the tags rather than merging them! I wonder though, if it might actually be better to store all of the tags but (similar to |
Madison, in your version, would we basically update the existing tags where I don't mind that personally, we should add dates to new tags if we're doing that, I'd think, for provenance's sake. It's certainly the least destructive approach, which is great! I suppose the main difficulty there is that if we set How about instead of keeping those "former" tags in We could also have a separate table for this, like I'm on board to take it slow on these considerations for this issue. I've downgraded the priority to medium from high, I think our current stop-gap is OK, and this kind of data-quality and revision work is worth making sure we get right and don't make destructive mistakes. |
I'm not opposed, although that raises some questions for me (for example, should
I also feel like this is a little different from, for example, the machine-generated tags below a certain accuracy threshold. In that case those are all perfectly valid tags, and the issue is that we're making decisions about which ones we want to use/display. This case could be thought of as correcting data that has been invalidated upstream (in the way that when the title or description of a record changes at the source, we simply update that in the catalog). But... that is certainly a nuanced distinction! I certainly have no objection to keeping the data (beyond noting that it'll make the upsert a little more complicated, which is totally fine). Those questions about deleted tags in display/search can also be answered later. There's no regression from current behavior :) And the great thing about keeping the data is you can always change your mind later, which is obviously not the case if we delete it. As far as a |
Agreed that this is scope enough for a project. My assumption is that yes we should keep that data, if we can, but I'm also wary of the fact that I actually don't have a good reason why we should, and that without a good reason, we risk spending a considerable amount of money storing data we don't currently have a known use for. You're absolutely right to point out that it's very different in that respect from the machine generated tags where the argument to keep all of them is easier to make. We also don't need to start a full historical record for all works right away. We could, for example, develop an idea for a generic approach but only use it for this targeted reingestion, and see how it goes, and then implement it more broadly if we decide it suits our purposes (and probably more freely iterate with a smaller amount of pre-existing data than if we started doing this for everything right away). I suppose a big question is about non-dated provider DAGs? Would we diff every field for every work to see if it needs a new history entry? |
I think we could use triggers here to insert into some It would be very interesting to see if we can come up with any use case for storing this information. Honestly, I have some reservation about storing historical user-generated data from third party sites -- for example, I wonder if a Flickr user who changed the title/tags/etc on an image would be surprised that Openverse is retaining the old data, and if that could be cause for concern. If we scope a project for this I'd like to see that addressed in the proposal. Circling back to the decode/deduplicate tags issue, do you think it should block on this? We're potentially talking about a pretty big and as yet unprioritized/unplanned project. One thing we could do, using the approach I outlined in this comment, is to add another preingestion task just to the new DAG that drops at least the tags with potential invalid characters. You could argue that tags that have gotten into an invalid state, can be dropped and replaced with current data from the provider. Then the regular load steps will merge in the current tags just fine. Alternatively the preingestion task could add a |
Just wanted to mention that I've implemented this postgres trigger history logging pattern in a project before. It was for a Gutenberg-esque CMS which stored page revision history. Postgres triggers are given access to some useful variables:
My project had a generic CREATE OR REPLACE FUNCTION log_history() RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'UPDATE' THEN
INSERT INTO history (id, old, new, operation_type, change_timestamp)
VALUES (OLD.id, row_to_json(OLD), row_to_json(NEW), 'UPDATE', current_timestamp);
ELSIF TG_OP = 'DELETE' THEN
INSERT INTO history (id, old, new, operation_type, change_timestamp)
VALUES (OLD.id, row_to_json(OLD), NULL, 'DELETE', current_timestamp);
ELSIF TG_OP = 'INSERT' THEN
INSERT INTO history (id, old, new, operation_type, change_timestamp)
VALUES (NEW.id, NULL, row_to_json(NEW), 'INSERT', current_timestamp);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql; It worked well! The only thing we'll need to test is performance with large batch updates. Agreed all around that this would be a standalone "data preservation/archiving" project, though. |
This occurred to me as well. There's a certain extent to which we should describe works the way the upstream provider describes them. Retaining changed information for search sounds bad, especially if extrapolated to a GLAM institution where descriptions are much more intentional and changes might be reflective of changing institutional policies that we would not want to misrepresent.
The issue with this, to me, is that there is no reliable way to detect tags in an invalid state. At the very least, there's no way to determine whether something is in that state because of a mistake we made with the data in the past, or whether it's something intentional from the upstream provider. Even when scoped to just ones we "suspect" to have the issue, I don't think there's a way to reliably compare across the new set of tags from the provider and our own to determine which are ones we should drop vs keep. Dropping all the old tags (from search) and just keeping the new ones is the only 100% reliable way to make sure we match the upstream. Your point about it being a big blocker though, is important. We don't need to block on this so long as we store the old tags in such a way that we could backfill any future broader history-tracking solution we decide on. For example, storing the old tags in S3 with minimal metadata stored in the key (identifier + date?) is a fine temporary solution that would allow us to unblock this specific issue, whilst preserving the ability to backfill whatever history-tracking method we try. If anything, choosing the most generic/abstract version of this (e.g., one that is explicitly unoptimised for anything other than storage) is probably the best first choice given we don't have a precise use case for this data now, and unbinds us from needing to turn this into a big project. So long as whatever we choose is minimally viable (date, identifier, explanatory note/event marker, plus the json blob, I think), we can defer an actual design for this kind of thing to later on. Would that be alright? If we did that, we'd implement the changes to the DAG you suggested, Staci, in addition to loading the "old" row as JSON into S3 before replacing it altogether in the catalog DB. Create the table based on the query as well as upload the "old" rows in
This seems like the biggest blocker to me, it isn't deferrable like tracking the old data, right? Can we introduce a new strategy that does what you've suggested? It should be possible to filter out tags from jsonb_path_query_array(
old.tags,
format('$ ? (@.provider != %s)', old.provider)
) |
These are all really good points, and I think asking the question of what do we hope to get from this data is always a good place to start before we invest time into building out a way to store/track/record it. I'm on board with the approach of making a minimally viable, storage-optimized approach for unblocking this specific issue. That's similar to the plan we've had with #4610 and the data normalization efforts thus-far. As long as we're preserving the old values somewhere, I think that's best here! |
Okay, I'll wait for @stacimc to reply regarding the tag merging issue she brought up, and then we can start to think about concrete implementation details for the first-pass just to get these initial works reingested. |
@sarayourfriend is my understanding correct that you're suggesting we use the approach I outlined in this comment, but also update the
? No objections from me if so, particularly because our load steps are not a performance bottleneck so I'm not particularly concerned about additional complexity there. This is the best minimal solution that keeps the tag history, IMO. Although I agree we should retain data even if we are initially skeptical of its usefulness (deleted/modified tags were likely low quality tags, as you point out), the issue to me remains privacy/ethical concerns with deliberately retaining this historical data that is intentionally changed at source. I can't think of a use case for this data that would avoid those issues. That being said I have no blocking objection to storing it in s3, and this solution is even a huge step forward, since it will bring tags in our catalog/API up to date. 👍 |
Yes, for the provider. Something like the expression: new_provider_tags + [old_tag for old_tag in old_tags if old_tag["provider"] == provider] The SQL I shared above should be (or is intended to be) equivalent to the right hand side of the expression, but I'm not sure I 100% understand the column strategies. In particular, I don't really understand the
Your point about the ethics is good. Beyond privacy, we have a responsibility to represent works accurately as informed by the upstream data. Keeping removed or changed tags (as we do now by merging old and new data) already creates a problem there. Storing them in S3 doesn't fix that, and I don't know what would be useful about it other than in the case of needing to recover from a mistake. However, I think the right way to do that would be to track the rows we changed, when and why we changed them, and what method we used to change them, in some sort of provenance log. And if we needed to recover corrupted or incorrect data after this reingestion, we could just reingest them. We have database snapshots anyway, if things were really dire. We aren't an archive, we do not currently have any mandate (whether internal or external) to archive the history of works from upstream providers. Good thing to keep in mind, and the ethics you've raised are a good prompt for that, for sure. How about this: we export the table we create of identifiers of works to reingest to S3 (and document it appropriately), such that we could easily track down the rows that were modified by this process? Maybe that's something that should go into the batched update anyway, as a way of tracking the fact of changes (if not the specific changes in every case) to rows from that DAG. |
I need to hand this work off, so pinging @krysal as I guess this is part of the data normalisation work now? I've worked with @stacimc to create an outline of how to approach targeted ingestion (and therefore reingestion) of works for a provider. The term "targeted" refers to the idea of "targeting a specific set of works" for a given provider. The overall plan is to adapt existing Each batch will be pulled in an override of
The targeted data ingester will adapt The targeted ingester will use I have started on some of this in the following commit: 59d2e4c#diff-20a97d97a930c3375c7d432419db3838b806c384842e41ee9c031debe3953391 The code there is just a sketch, and is messy, and doesn't follow some of the specific things above, as I wrote it to show Staci and work from in our discussion. Staci's additionally advised that for Flickr in particular, due to the regular ingester being time delineated, it will be easier to understand the provider if we extract the record building into a record-builder class similar to For additional clarification:
@krysal I'm unable to continue working on this myself because it's much more significant work than we originally planned, and I need to focus on some infrastructure things, including the #3336 IP, and some pending security improvements we need to make, and infrastructure planning. In other words, I'm rather spread thin, and as this is on the edge of my typical responsibilities and work on the project, it would be better for me to hand it off to someone, whether that is you or someone else on @WordPress/openverse-catalog. @stacimc please do correct and/or clarify anything I've shared above as you see fit. |
Oh, I would also add, it might be a good idea to split this into a couple PRs, and maybe use a simpler provider than Flickr to proof-of-concept the idea, before getting too bogged down with Flickr details, if necessary. On the other hand, the complexity of Flickr might be the best way to prove the concept in the first place, but multiple PRs will probably be easier to understand and review nonetheless! |
@sarayourfriend I wouldn't mind working on this, myself, given that I've already been a big part of the discussion! I'm wrapping up the last big (non-test) ingestion server removal PR this week, so as long as it's not critical that work continue on this immediately I'm actually already invested/very interested in this issue. Happy to answer questions if @krysal really wants to take it though! |
I have seen this and will respond to this discussion when I have time! |
@stacimc I've assigned you to this issue based on your expressed interest! There is no urgency to begin the work until you're able 😄 I've moved it to the backlog as well until you're ready to prioritize it. |
batched_update
DAG
Title edited to reflect the new intention for this work, using targeted reingestion. |
Problem
In #4143, @obulat proposed to add new cleaning steps to the fix tags in the Catalog, but the option of including them in the Ingestion Server was declined in favor of using the
batched_update
DAG.Description
We want to take the functions that were planned to include in said PR and translate them into parameters for this DAG. Given the complexity of the decoding transformation it might require some advanced functions of PostgreSQL, like a combination of pattern matching and PL/Python Functions.
In #1566, duplicated tags were previously removed, so we will apply the same solution given the decoding may cause new duplicates.
Additional context
Related to #4125 and #4199 (similar issue).
The text was updated successfully, but these errors were encountered: