-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove duplicated tags #1566
Comments
Here's another example: https://openverse.org/image/9ab8329c-8037-4fd4-8d55-8a52eb0dca90 (with a lot more duplicates). |
@AetherUnbound, the catalog database for the item you shared contains duplicate tags:
This should not happen with items that have been recently re-ingested because the tags column removes non-unique tags when upserting new values into the table: openverse/catalog/dags/common/storage/columns.py Lines 688 to 690 in ff1b5d5
The item in question, however, was not updated since 2020-10-01, and I'm not sure whether the code to remove duplicate tags was ran at ingestion at that time. What would the best fix for such items be in the catalog? Would running a batched update that would select items that have not been updated since some date in the past, and run something like |
That sounds like it could work! The only issue is the number of records that might be affected by that update could be significant 🤔 Do you think we could get a count of the records that have duplicates in them before trying to run any batched update? That might give us a sense of how pervasive the issue is. |
The only way I can think of getting this information is by adding a new step to the data cleanup in the ingestion server. But this would make the data refresh run longer. I wonder how many items were not updated since 2020? Maybe the non-updated set of items is smaller? |
After some fiddling around, it looks like we are able to query for this!! deploy@localhost:openledger>
SELECT *
FROM (
SELECT identifier, provider, updated_on,
tags || '[]'::jsonb AS tags,
(SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM image
) u
WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags) limit 1;
-[ RECORD 1 ]-------------------------
identifier | 95aeace6-f70d-420c-a08e-84271ba6cad3
provider | animaldiversity
updated_on | 2019-02-15 14:47:33.392219+00
tags | [{"name": "Reproduction", "provider": "animaldiversity"}, {"name": "Body Parts", "provider": "animaldiversity"}, {"name": "Photo", "provider": "animaldiversity"}, {"name": "Live Animal", "provider": "animaldiversity"}, {"name": "Female", "provider": "animaldiversity"}, {"name": "Female", "provider": "animaldiversity"}, {"name": "Adult/Sexually Mature", "provider": "animaldiversity"}, {"name": "Reproductive Organ", "provider": "animaldiversity"}, {"name": "insect", "accuracy": "0.99904", "provider": "clarifai"}, {"name": "dragonfly", "accuracy": "0.99758", "provider": "clarifai"}, {"name": "nature", "accuracy": "0.99558", "provider": "clarifai"}, {"name": "fly", "accuracy": "0.99301", "provider": "clarifai"}, {"name": "animal", "accuracy": "0.99085", "provider": "clarifai"}, {"name": "wildlife", "accuracy": "0.98856", "provider": "clarifai"}, {"name": "pest", "accuracy": "0.98646", "provider": "clarifai"}, {"name": "invertebrate", "accuracy": "0.9839", "provider": "clarifai"}, {"name": "leaf", "accuracy": "0.97928", "provider": "clarifai"}, {"name": "damselfly", "accuracy": "0.96964", "provider": "clarifai"}, {"name": "outdoors", "accuracy": "0.96327", "provider": "clarifai"}, {"name": "little", "accuracy": "0.9622", "provider": "clarifai"}, {"name": "wing", "accuracy": "0.95252", "provider": "clarifai"}, {"name": "close", "accuracy": "0.95173", "provider": "clarifai"}, {"name": "spider", "accuracy": "0.94385", "provider": "clarifai"}, {"name": "wild", "accuracy": "0.94351", "provider": "clarifai"}, {"name": "closeup", "accuracy": "0.94158", "provider": "clarifai"}, {"name": "garden", "accuracy": "0.93974", "provider": "clarifai"}, {"name": "entomology", "accuracy": "0.93532", "provider": "clarifai"}]
unique_tags | [{"name": "Adult/Sexually Mature", "provider": "animaldiversity"}, {"name": "Body Parts", "provider": "animaldiversity"}, {"name": "Female", "provider": "animaldiversity"}, {"name": "Live Animal", "provider": "animaldiversity"}, {"name": "Photo", "provider": "animaldiversity"}, {"name": "Reproduction", "provider": "animaldiversity"}, {"name": "Reproductive Organ", "provider": "animaldiversity"}, {"name": "animal", "accuracy": "0.99085", "provider": "clarifai"}, {"name": "close", "accuracy": "0.95173", "provider": "clarifai"}, {"name": "closeup", "accuracy": "0.94158", "provider": "clarifai"}, {"name": "damselfly", "accuracy": "0.96964", "provider": "clarifai"}, {"name": "dragonfly", "accuracy": "0.99758", "provider": "clarifai"}, {"name": "entomology", "accuracy": "0.93532", "provider": "clarifai"}, {"name": "fly", "accuracy": "0.99301", "provider": "clarifai"}, {"name": "garden", "accuracy": "0.93974", "provider": "clarifai"}, {"name": "insect", "accuracy": "0.99904", "provider": "clarifai"}, {"name": "invertebrate", "accuracy": "0.9839", "provider": "clarifai"}, {"name": "leaf", "accuracy": "0.97928", "provider": "clarifai"}, {"name": "little", "accuracy": "0.9622", "provider": "clarifai"}, {"name": "nature", "accuracy": "0.99558", "provider": "clarifai"}, {"name": "outdoors", "accuracy": "0.96327", "provider": "clarifai"}, {"name": "pest", "accuracy": "0.98646", "provider": "clarifai"}, {"name": "spider", "accuracy": "0.94385", "provider": "clarifai"}, {"name": "wild", "accuracy": "0.94351", "provider": "clarifai"}, {"name": "wildlife", "accuracy": "0.98856", "provider": "clarifai"}, {"name": "wing", "accuracy": "0.95252", "provider": "clarifai"}]
SELECT 1
Time: 0.085s (Note that a duplicate I'm going to run a count for this to see how many we have 😄 |
@stacimc was able to run this for me: deploy@localhost:openledger> select count(*) from (select identifier, provider, updated_on, tags || '[]'::jsonb as tags, (sele
ct jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) as unique_tags from image) u where jsonb_array
_length(u.tags) > jsonb_array_length(u.unique_tags);
+--------+
| count |
|--------|
| 206799 |
+--------+
SELECT 1
Time: 7422.500s (2 hours 3 minutes 42 seconds), executed in: 7422.493s (2 hours 3 minutes 42 seconds) This feels like something we can correct for just those records! What do you think @obulat? |
Before reading the replies, I added this to the DN milestone and thought it had more to do with modifying the Should we create a new issue to filter out possible duplicates from provider scripts? I don't understand why
|
I don't know that I have the full context on That said though, if we could 1) ensure that the |
@AetherUnbound Glad we're on the same page! I created an issue for the code changes in the catalog: #3926. Do any of you want to take charge of this? 😄 Since you all already discussed it and saw a potential and relatively quick solution. CC @obulat @stacimc |
Preparing for once the code in the catalog is updated, I see two options to solve this:
-- Remove tags duplicates from the audio table
WITH tags_temp AS (
SELECT identifier, tags || '[]'::jsonb AS tags,
(SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags
FROM audio
)
UPDATE audio SET tags = tags_temp.unique_tags FROM tags_temp
WHERE tags_temp.identifier = audio.identifier
AND jsonb_array_length(tags_temp.tags) > jsonb_array_length(tags_temp.unique_tags); You can try this locally, as there are some duplicates in the sample data. -- Get the number of duplicates
SELECT COUNT(*)
FROM (
SELECT identifier, provider, updated_on,
tags || '[]'::jsonb AS tags,
(SELECT jsonb_agg(DISTINCT x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags
FROM audio
) u
WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags);
-- and/or see one specifically, e.g.:
SELECT jsonb_array_elements(tags) FROM audio WHERE identifier='209937e6-5a7f-4c31-bdc2-c6e39c2b7cb3'; It seems feasible, especially for the smaller audio table, but given the amount of duplicates shown for images, it could also work in a reasonable time.
{
"query_id": "with_test",
"table_name": "audio",
"with_query": "WITH wt AS (SELECT identifier AS wt_identifier, tags || '[]'::jsonb AS current_tags, (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM audio)",
"select_query": "JOIN wt ON identifier = wt_identifier WHERE jsonb_array_length(wt.current_tags) > jsonb_array_length(wt.unique_tags)",
"update_query": "SET tags = (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x))",
"batch_size": 100,
"update_timeout": 3600,
"resume_update": false,
"dry_run": false
} So folks, do you prefer how to solve this or see any other alternative here? |
It's definitely nice to run the update through a DAG so we have a little more visibility, although not strictly necessary. I don't see the branch you mentioned so I haven't looked at your updates to the WHERE identifier IN (
SELECT identifier FROM (
SELECT
identifier,
tags || '[]'::jsonb AS tags,
(select jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags
FROM image
) u
WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags)
); It's a bit more nested than I'd like but that worked for me locally to update the 362 images with duplicate tags in about 14 seconds. When I ran just the SELECT in production for Either way we'll want to wait to run this until after confirming with @sarayourfriend that the catalog deployments are complete and DAGs can be unpaused. |
@stacimc That's awesome! Better if we don't require modifications to the DAG, glad that you found a query that works! I'll mark this as blocked until the other work is finished. |
Started the DagRun for audio with the following configuration that worked locally for me: {
"query_id": "rm_duplicate_tags_audio",
"table_name": "audio",
"select_query": "WHERE identifier IN ( SELECT identifier FROM ( SELECT identifier, tags || '[]'::jsonb AS tags, (select jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM audio ) u WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags) )",
"update_query": "SET tags = (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x))",
"batch_size": 10000,
"update_timeout": 3600,
"resume_update": false,
"dry_run": false
} |
It was super quick given there was only 106 rows with duplicate tags in the audio table, so I kicked off the run for the image table. The config: {
"query_id": "rm_duplicate_tags_image",
"table_name": "image",
"select_query": "WHERE identifier IN ( SELECT identifier FROM ( SELECT identifier, tags || '[]'::jsonb AS tags, (select jsonb_agg(distinct x) from jsonb_array_elements(tags || '[]'::jsonb) t(x)) AS unique_tags FROM image ) u WHERE jsonb_array_length(u.tags) > jsonb_array_length(u.unique_tags) )",
"update_query": "SET tags = (SELECT jsonb_agg(distinct x) FROM jsonb_array_elements(tags || '[]'::jsonb) t(x))",
"batch_size": 10000,
"update_timeout": 3600,
"resume_update": false,
"dry_run": false
} |
This is done ✅ |
Problem
Here's an example on the frontend of an image with duplicated tags: https://search-staging.openverse.engineering/image/9fc28cae-ec9b-437a-a960-98f48db2cac8 (
coffee
,lol
,lol
).Description
We should remove duplicated tags as part of the cleaning steps when loading ingested data.
For backfilling: Non-dated DAGs consume all the data on each run and ought to update old records on the next DAG run. For dated DAGs, we may need to write a backfill query.
Implementation
The text was updated successfully, but these errors were encountered: