-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DAG to decode and deduplicate image tags with escaped literal unicode sequences #4475
Conversation
These were ingested before the transfer of the project from Creative Commons to WordPress. @zackkrida will correct me if I'm wrong, but I think we received all the catalog database data as parquet files that we then inserted into the catalog database. So, there are no TSVs for the items that were ingested before the transfer.
I've tried investigating, and it seems that the incorrectly encoded tags were saved mostly before 2021. I selected some items that have an accented character in the title, "é". It seems that the titles encoding was fixed, but the tags (and
Here are some of the results:
The tags extracted, without the clarifai tags:
These examples show that we do have many items with incorrectly encoded tags that do not have a correct duplicate. Thank you for sharing the example of an item that has an error due to us hot-fixing the encoding errors incorrectly. What if we run this DAG in 3 stages:
|
@sarayourfriend and @obulat I analyzed the Sentry-reported records for this issue and found that:
This is correct, yes. No original TSVs, just parquet files produced from the then-postgres catalog DB. lil' ts/bun script I used
|
Your plan sounds good to me, Olga. The first two steps I think can go into one, right? The regexes are combined anyway, and those transformations we're pretty sure are safe, as far as I know. Thanks for the further analysis, Zack. It is good to know the issue has some scope smaller than all of unicode... but there could still be overlaps with genuine sequences. I can't think of an actual example, but on Flickr especially people often include tags of camera models, or references to dated events with hashtags and things. Maybe some of these are already considered suboptimal records, but guessing at them feels really wrong to me, especially if doing so makes an actually safe fix (namely, reingesting suspect records from Flickr) more difficult. Even if we narrow the range of unescaped unicode sequences, I still feel that this is a very broad, naive approach for a rather narrow issue. I like Olga's plan because it pushes off the decision, and allows us to fix some of these. However, it's clear from Olga's query that this is a significant subset of the problematic records which will need addressing no matter what. Is there a particular hang up with reingesting these records from upstream, considering there aren't TSVs for us to use for these? I know it would require writing a new method of ingestion, but the risk of permanent damage is pretty big here, and the chances of discovering damage are basically zero (needle in a haystack, finding the small number of works we mangled in a review after the fact would be really hard, especially if the text is already obscure, even if intentional). Say we decided to apply this naive fix now, but save the identifiers of all affected rows. What's the next step? I can't imagine any way of reviewing the changes1, so wouldn't then the only way to know we hadn't caused a problem just be to compare to upstream again? At that point, we're one step away from reingestion, right? To my mind, why not just go straight for the real, safe fix? Unless the intention is to apply this and not care whether false positives are incorrectly mangled. But then why fix this subset of cases at all? There's of course a difference between mangling what looks like an obscure string of characters (u01f0) and fixing a problem where I'm not personally comfortable with the solution for unescaped sequences, even if the problem space were reduced to a single unicode sequence. If that were the case, and if the number of records was small, we could manually review and fix them maybe. And even then, we'd probably have to consult with upstream to be sure anyway. I can imagine other use cases for the ability to reingest records (e.g., verifying dead links aren't just changed upstream metadata), so I don't think work to do so for records suspected of having unescaped unicode sequences would In any case, I'll update this PR to remove handling unescaped sequences altogether. It significantly simplifies the code to do so anyway, so hooray! Footnotes
|
7a5a75d
to
7eccb9a
Compare
I've got the changes in to only work on tags with escaped sequences (actually, just removing the unescaped handling). However, I'm working through a bug where the transformation was being applied to strings that were already unicode encoded, which causes them to get mangled. |
@sarayourfriend Thanks for raising this concern, I was having similar thought but didn't have the time to tip into the solution (or alternatives).
Specifically with Flickr, there is in fact a re-ingestion DAG but it's paused given the many problems it were bringing before (rate limits/unstable results returned). I think given the size of Flickr a paginated API might not be an adequate tool, and old items like this can be even more problematic. I don't know if there is alternativate to re-ingest these records, @stacimc could know better since she worked a lot on Flickr (re)ingestion DAGs. |
I'd assume reingestion of a known subset of records (rather than date ranges) would be able to utilise a single-item API, specifically: https://www.flickr.com/services/api/flickr.photos.getInfo.html That route requires only the foreign ID of the work, which we have, and returns tags, description, title, license, etc. everything we'd need to fully reingest the record (from a data perspective, never mind architecting the process of getting it back into the catalog, whether that would use an intermediate TSV, etc). I'm not talking about a backfill, which is what I think you're referencing by reingestion. I mean targeting the specific works we're worried have mangled data and pulling them individually from upstream. Irrespective of the amount of time that would take due to rate limits (and being generally respectful), it is a stable and predictable way to retrieve corrected data for each work. We could do something like the following, based on the batched update DAG's method:
That allows us to review the pulled data and decide how we'd like to upstream it into the catalog. Can we use the batched update DAG ( Regardless, it seems like a solvable problem, right? And one that would be useful beyond this? And also the only solution I can think of that has zero risk of causing further damage to our data? |
As you say, that would be literally a reingestion of a subset of image rows. I agree that, in theory, it sounds doable (regardless of how long it may take) and the safest solution to the problem.
I was thinking more of directly assigning the new tags to the suspicious items, provided they are correctly decoded, of course. I understood the whole point of doing a custom re-ingestion one-off DAG is to avoid having to review the entries manually. Perhaps I'm missing some process we could/should(?) perform in the middle. I don't see the need to touch on S3. If it can be narrowed to the realm of Aiflow and Postgres, I believe that would be preferable. |
I mean review more as, just make sure the data makes sense generally, not that the individual tags are fixed or expected. If we're going to overwrite records that we don't even have TSVs for, I think it'd be nice to not smash them directly into the database where the only recovery option is then restoring them from a database snapshot. Considering how close we've come recently to (1) completely destroying Clarifai tags, (2) destorying correctly encoded unicode strings (in this PR), I think moving in stages where we can take a step back and say "do we still think we're going in the right direction with this" behooves us. |
@sarayourfriend's suggestion in this comment sounds good to me. I also initially assumed you meant using the existing Flickr provider dags to perform reingestion. It sounds like that's not what's being discussed here, but I want to add for additional context for why the proposed solution does differ from literal reingestion (and that's a good thing): it is a known problem that when a record gets reingested, old tags are not deleted or updated. Eg if a record is initially ingested with a tag value of "foo", and then that tag is deleted and replaced with the tag "bar" at its source, when it is reingested the record in our catalog will have tags "foo" and "bar". (@AetherUnbound and I have discussed this but I can't find an issue or that discussion at the moment, maybe she remembers?) There are also many reliability problems with the Flickr DAG which mean we can never be certain that it has actually ingested all records for a particular time frame. That being said, using the single image endpoint and a batched update with custom SQL to overwrite old tags can certainly avoid all these problems 👍 I think it would probably be easiest to use the MediaStore to validate/enrich tags in the initial one-off DAG and avoid trying to upload to S3 and load, although I might be misunderstanding what you were suggesting there. We want to use the batched update rather than the Flickr DAG to load, because the loader steps will not replace the old tags. |
1857169
to
8993c42
Compare
Full-stack documentation: https://docs.openverse.org/_preview/4475 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
I do remember discussing the tags issue @stacimc, but I also don't recall where that notion might be captured. I think it'd be a great thing to add to our field documentation for the catalog, I'll make an issue for it! +1 to Sara's idea of a targeted backfill for correcting the tags not covered by this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works great for me! Thank you for including the new sample data.
That is a beast of a SQL update query, nicely done 😄
task_id="trigger_batched_update", | ||
trigger_dag_id=BATCHED_UPDATE_DAG_ID, | ||
wait_for_completion=True, | ||
execution_timeout=timedelta(hours=5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you confident in this timeout? I don't think the triggered dagrun will be timed out if this task times out (eg, I think we'd get a timeout of this DAG but the batched update would continue running just fine) but I'm not actually 100% on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not at all confident in this timeout 😅 And the difference between the task timing out and the triggered DAG timing out did not occur to me.
I guess I should be using the update_timeout
etc on the batched update DAG conf
instead of the task timeout. That's a very interesting difference between the two!
It will affect the trim and deduplication DAG as well.
I am actually super curious if the straight SQL is more performant than Sort of... knowing how to judge where the cut off is, basically, aside from "which tool is easiest" which is a valid singular approach so long as the performance difference isn't an issue. But anyway, I was proud I sorted it out in SQL 🙊 😊 |
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @dhruvkb Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I ran some examples for the ov_unistr
function locally and it seems to work perfectly. Thanks!
I just need to push a small change to fix the timeout settings as pointed out by @stacimc, and then will repush and merge. |
8993c42
to
b2c2b14
Compare
Fixes
Fixes to #4452 by @krysal
Description
Introduce a new temporary DAG,
decode_and_deduplicate_image_tags
, to find tags with escaped literal Unicode sequences, and to process them into actual unicode strings.I've introduced a function using PL/Python3u
ov_unistr
. See the documentation string on the task that creates it for motivation and details. It essentially implements part of the fix from #4143, just the part that works on escaped unicode sequences.See the previous versions of this PR description (and the discussion below) for information on why we are not touching unescaped sequences. The short version is: it's entirely unsafe to guess what is unescaped unicode and what is just regular text.
Testing Instructions
To test this locally:
./ov just down -v && ./ov just catalog/init
../ov just catalog/pgcli
and executeselect jsonb_array_elements(tags) from image where identifier = 'aeba0547-61da-42ee-b561-27c8fc817d5a';
to observe the testing data. Note the following:muséo
already exists in processed unicode for two providersbatched_update
DAG and thedecode_and_deduplicate_image_tags
DAG. Run the latter and wait for it to complete.muséo
only exists once per providerChecklist
Update index.md
).main
) or a parent feature branch.just catalog/generate-docs
for catalogPRs) or the media properties generator (
just catalog/generate-docs media-props
for the catalog or
just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin