Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DAG to decode and deduplicate image tags with escaped literal unicode sequences #4475

Merged
merged 3 commits into from
Jun 26, 2024

Conversation

sarayourfriend
Copy link
Collaborator

@sarayourfriend sarayourfriend commented Jun 12, 2024

Fixes

Fixes to #4452 by @krysal

Description

Introduce a new temporary DAG, decode_and_deduplicate_image_tags, to find tags with escaped literal Unicode sequences, and to process them into actual unicode strings.

I've introduced a function using PL/Python3u ov_unistr. See the documentation string on the task that creates it for motivation and details. It essentially implements part of the fix from #4143, just the part that works on escaped unicode sequences.

See the previous versions of this PR description (and the discussion below) for information on why we are not touching unescaped sequences. The short version is: it's entirely unsafe to guess what is unescaped unicode and what is just regular text.

Testing Instructions

To test this locally:

  1. Initialise the upstream DB with the new sample data, ./ov just down -v && ./ov just catalog/init.
  2. Run ./ov just catalog/pgcli and execute select jsonb_array_elements(tags) from image where identifier = 'aeba0547-61da-42ee-b561-27c8fc817d5a'; to observe the testing data. Note the following:
  • muséo already exists in processed unicode for two providers
  • The same string also exists as escaped unicode as well as unescaped unicode, for the flickr provider.
  1. Now open Airflow (localhost:9090) and login with airflow/airflow as username and password. Enable the batched_update DAG and the decode_and_deduplicate_image_tags DAG. Run the latter and wait for it to complete.
  2. Run the select from step 2 again, and confirm the following:
  • The unescaped unicode strings are unmodified and still present
  • muséo only exists once per provider
  • There are no escaped literal unicode sequences remaining in the tags

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (just catalog/generate-docs for catalog
    PRs) or the media properties generator (just catalog/generate-docs media-props
    for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 🧱 stack: catalog Related to the catalog and Airflow DAGs 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Jun 12, 2024
@sarayourfriend sarayourfriend requested review from a team as code owners June 12, 2024 07:37
@obulat
Copy link
Contributor

obulat commented Jun 12, 2024

Do y'all know the provenance of these broken tags? Are these tags this way (incorrectly escaped, etc) in the provider TSVs?

If not, if this error happened after the TSVs, then can we reingest these specific records from the TSVs?

These were ingested before the transfer of the project from Creative Commons to WordPress. @zackkrida will correct me if I'm wrong, but I think we received all the catalog database data as parquet files that we then inserted into the catalog database. So, there are no TSVs for the items that were ingested before the transfer.

If so, then can we use the select query to build a temporary table of works that need to be reingested from upstream altogether? Rather than try to fix these tags in place, can we instead pull these results from upstream fresh? That way, we would completely avoid these kinds of shenanigans and causing potentially catastrophic harm to even more of our data.

I've tried investigating, and it seems that the incorrectly encoded tags were saved mostly before 2021. I selected some items that have an accented character in the title, "é". It seems that the titles encoding was fixed, but the tags (and meta_data.description) weren't.

SELECT *
FROM image
WHERE created_on > '2021-01-01'
  AND provider='flickr'
  AND tags is not null
  AND title like '%é%' limit 1;

Here are some of the results:

identifier	created_on	updated_on	ingestion_type	provider	source	foreign_identifier	foreign_landing_url	url	title	meta_data	tags	last_synced_with_source
543adf1b-b485-4e37-b4a1-17afd489abd5	2020-04-10 19:09:23.968806 +00:00	2024-04-24 23:18:35.776853 +00:00	provider_api	flickr	flickr	10451976383	https://www.flickr.com/photos/12950131@N06/10451976383	https://live.staticflickr.com/7398/10451976383_d64455cbc0_b.jpg	Parroquia San Pedro y San Pablo Apóstoles,Calimaya,Estado de México	{"views": "2115", "pub_date": "1382586863", "date_taken": "2013-10-17 12:54:49", "description": "Parroquia San Pedro y San Pablo Apu00c3u00b3stoles,Calimaya,Estado de Mu00c3u00a9xico Parroquia San Pedro y San Pablo Apu00c3u00b3stoles Pbro: R.R.P.P. Zona Pastoral: Decanato: Direcciu00c3u00b3n: Juu00c3u00a1rez S/N Colonia: Calimaya Ciudad : Calimaya Municipio: Calimaya de CP: 52200 Tel. 01 717 171 50 60 / 2 12 51 51 Fax: E-Mail:[calimaya@diocesistoluca.org.mx](mailto:calimaya@diocesistoluca.org.mx) Visita la Pagina Facebook y da clik en me gusta [www.facebook.com/catedralesiglesias](http://www.facebook.com/catedralesiglesias) u00c2u00a9 u00c3lbum 1842 By Catedrales e Iglesias By Cathedrals and Churches By Catedrais e Igrejas Par Cathu00c3u00a9drales et Eglises Diu00c3u00b3cesis de Toluca [www.catedraleseiglesias.com](http://www.catedraleseiglesias.com/) Los franciscanos inician en 1561 la construcciu00c3u00b3n del convento, uno de los mu00c3u00a1s grandes que se construyeron en la zona. Anexo a el se levantaron las capillas abiertas, al gran u00c3u00a1trio, las cruces atriales, las capillas posas; muestra de elementos arquitectu00c3u00b3nicos novohispu00c3u00a1nicos de la arquitectura conventual del siglo XVI. En un extremo se levantu00c3u00b3 en la misma u00c3u00a9poca la Capilla de la Tercera Orden. Ejemplos representativos del barroco popular dentro del municipio de Calimaya son la portada de la Iglesia de Santa Maru00c3u00ada Nativitas que es un bello ejemplar del barroco en argamasa, la capilla de San Andru00c3u00a9s Ocotlu00c3u00a1n, cuya portada ostenta elementos decorativos como las columnas salomu00c3u00b3nicas; posiblemente su reconstrucciu00c3u00b3n se realizu00c3u00b3 a principios del siglo XVIII; la capilla de San Juan Bautista y la de Nuestra Seu00c3u00b1ora de los u00c3ngeles; la iglesia de Nuestra Seu00c3u00b1ora de la Concepciu00c3u00b3n Coatipac, que tiene uno de los retablos populares mu00c3u00a1s valiosos de la entidad; su portada es sencilla. La capilla de San Bartolito, en el pueblo del mismo nombre, cuya portada es rica en ornamentaciu00c3u00b3n. El 29 de julio, tres mayordomos organizan la fiesta mayor, dedicada a San Pedro y San Pablo, patronos de nuestra parroquia. Esta fiesta es la mu00c3u00a1s lucida de todas, aunque su magnitud depende de la ayuda que se obtiene de las familias y el ayuntamiento; nunca faltan los castillos y cohetes, cuando se puede se organizan", "license_url": "https://creativecommons.org/licenses/by/2.0/"}	[{"name": "diu00f3cesisdetoluca", "provider": "flickr"}, {"name": "architecture", "accuracy": 0.99384, "provider": "clarifai"}, {"name": "building", "accuracy": 0.97936, "provider": "clarifai"}, {"name": "castle", "accuracy": 0.8881, "provider": "clarifai"}, {"name": "church", "accuracy": 0.96808, "provider": "clarifai"}, {"name": "city", "accuracy": 0.95849, "provider": "clarifai"}, {"name": "daylight", "accuracy": 0.92505, "provider": "clarifai"}, {"name": "house", "accuracy": 0.88541, "provider": "clarifai"}, {"name": "landmark", "accuracy": 0.8769, "provider": "clarifai"}, {"name": "no person", "accuracy": 0.96392, "provider": "clarifai"}, {"name": "old", "accuracy": 0.89039, "provider": "clarifai"}, {"name": "outdoors", "accuracy": 0.95769, "provider": "clarifai"}, {"name": "park", "accuracy": 0.91706, "provider": "clarifai"}, {"name": "religion", "accuracy": 0.96332, "provider": "clarifai"}, {"name": "sky", "accuracy": 0.96682, "provider": "clarifai"}, {"name": "temple", "accuracy": 0.88566, "provider": "clarifai"}, {"name": "tourism", "accuracy": 0.9304, "provider": "clarifai"}, {"name": "town", "accuracy": 0.90139, "provider": "clarifai"}, {"name": "travel", "accuracy": 0.98298, "provider": "clarifai"}, {"name": "tree", "accuracy": 0.89195, "provider": "clarifai"}, {"name": "turning point", "accuracy": 0.88772, "provider": "clarifai"}]	2020-10-19 08:37:57.820842 +00:00
3b62ec44-5450-4668-90d8-0fa539b2478b	2020-04-16 03:47:30.037693 +00:00	2024-04-24 23:18:08.158779 +00:00	provider_api	flickr	flickr	5586856485	https://www.flickr.com/photos/12950131@N06/5586856485	https://live.staticflickr.com/5187/5586856485_082284d3d4_b.jpg	Capilla de San Judas Tadeo (Queretaro) Estado de Queretaro,México	{"views": "768", "pub_date": "1301879873", "date_taken": "2011-03-23 15:38:35", "description": "By Catedrales e Iglesias Album 2528 u00a9 [CatedraleseIglesias.com](http://catedraleseiglesias.com/) u00c1lbum 2528 Diu00f3cesis de Queru00e9taro [www.catedraleseiglesias.com/](http://www.catedraleseiglesias.com/) Capilla de San Judas Tadeo Carretera Federal Km 200 Santiago de Queru00e9taro Estado de Queru00e9taro,Mu00e9xico", "license_url": "https://creativecommons.org/licenses/by/2.0/"}	[{"name": "diu00f3cesisdequeretaro", "provider": "flickr"}, {"name": "di\\xf3cesisdequeretaro", "provider": "flickr"}]	2020-09-15 12:29:41.503524 +00:00
19b31c6e-1265-40b2-a35c-06cd71caa9e0	2020-03-31 02:43:33.570393 +00:00	2024-04-26 10:47:17.965752 +00:00	provider_api	flickr	flickr	22254457204	https://www.flickr.com/photos/72031802@N00/22254457204	https://live.staticflickr.com/5759/22254457204_c30c5062c9_b.jpg	Cérémonie des récompenses	{"views": "287", "pub_date": "1447006585", "date_taken": "2015-11-06 21:37:10", "license_url": "https://creativecommons.org/licenses/by-nd/2.0/"}	[{"name": "2015", "provider": "flickr"}, {"name": "cérémonie", "provider": "flickr"}, {"name": "challenge", "provider": "flickr"}, {"name": "cu00e9ru00e9monie", "provider": "flickr"}, {"name": "provence", "provider": "flickr"}, {"name": "récompenses", "provider": "flickr"}, {"name": "ru00e9compenses", "provider": "flickr"}, {"name": "trails", "provider": "flickr"}]	2020-08-15 08:38:42.931822 +00:00
8c043a74-467e-4593-b7a1-9a2013021d2b	2020-04-12 07:35:29.829762 +00:00	2024-04-24 23:18:35.776853 +00:00	provider_api	flickr	flickr	7208190412	https://www.flickr.com/photos/73632227@N02/7208190412	https://live.staticflickr.com/8167/7208190412_6b1fccf41e_b.jpg	Jean Auguste Dominique Ingres, <i>Napoléon Ier sur le trône impérial</i>	{"views": "936", "pub_date": "1337154725", "date_taken": "2012-03-30 16:48:53", "description": "1806, huile sur toile, 260 x 163 cm, Paris, musu00e9e de l'Armu00e9e.", "license_url": "https://creativecommons.org/licenses/by/2.0/"}	[{"name": "chu00e9zy", "provider": "flickr"}, {"name": "lavieetlesartsu00e0parisdepuisnapolu00e9onier", "provider": "flickr"}, {"name": "plumesetpinceaux", "provider": "flickr"}, {"name": "reproductionsdetableaux", "provider": "flickr"}, {"name": "ancient", "accuracy": 0.97437, "provider": "clarifai"}, {"name": "armor", "accuracy": 0.99245, "provider": "clarifai"}, {"name": "art", "accuracy": 0.97399, "provider": "clarifai"}, {"name": "costume", "accuracy": 0.93348, "provider": "clarifai"}, {"name": "culture", "accuracy": 0.92622, "provider": "clarifai"}, {"name": "god", "accuracy": 0.97131, "provider": "clarifai"}, {"name": "gold", "accuracy": 0.9361, "provider": "clarifai"}, {"name": "Gothic", "accuracy": 0.97152, "provider": "clarifai"}, {"name": "helmet", "accuracy": 0.9616, "provider": "clarifai"}, {"name": "historic", "accuracy": 0.94033, "provider": "clarifai"}, {"name": "knight", "accuracy": 0.99223, "provider": "clarifai"}, {"name": "man", "accuracy": 0.93431, "provider": "clarifai"}, {"name": "old", "accuracy": 0.94737, "provider": "clarifai"}, {"name": "people", "accuracy": 0.9516, "provider": "clarifai"}, {"name": "religion", "accuracy": 0.98319, "provider": "clarifai"}, {"name": "sword", "accuracy": 0.98784, "provider": "clarifai"}, {"name": "symbol", "accuracy": 0.94008, "provider": "clarifai"}, {"name": "traditional", "accuracy": 0.95411, "provider": "clarifai"}, {"name": "warrior", "accuracy": 0.96902, "provider": "clarifai"}, {"name": "weapon", "accuracy": 0.9558, "provider": "clarifai"}]	2020-08-04 06:23:38.926710 +00:00
3e59636c-6b39-4c85-b83b-eb141daa4ba1	2020-04-02 21:49:22.269745 +00:00	2024-04-24 23:19:51.737553 +00:00	provider_api	flickr	flickr	12602374323	https://www.flickr.com/photos/12950131@N06/12602374323	https://live.staticflickr.com/3809/12602374323_3949c081ef_b.jpg	Parroquia San Antonio de Padua,Morelos,Estado de Zacatecas,México	{"views": "519", "pub_date": "1392688202", "date_taken": "2014-01-30 10:57:36", "description": "Parroquia San Antonio de Padua San Antonio No. 61 Apartado Postal 15 C.P.98100 Tel. (492) 931-0169 Morelos, Zac. Visita la Pagina Facebook y da clik en me gusta [www.facebook.com/catedralesiglesias](http://www.facebook.com/catedralesiglesias) u00a9 u00c1lbum 3241 By Catedrales e Iglesias By Cathedrals and Churches Par Cathu00e9drales et Eglises By catedrals i esglu00e9sies Diu00f3cesis de Zacatecas [www.catedraleseiglesias.com](http://www.catedraleseiglesias.com/) la parroquia del Sr. San Antonio de Padua, su construcciu00f3n se iniciu00f3 en 1888, tiene un excelente altar de mu00e1rmol, por su estilo y belleza, u00fanico en la regiu00f3n. Capilla a la Virgen de Guadalupe construida en 1927 y reconstruida en 1979. El monumento a la bandera construido en 1963. El palacio municipal construido en 1902 y reconstruido en 1985. El jardu00edn del centro de la poblaciu00f3n construido en 1905, entonces recibu00eda el nombre de Jardu00edn Porfirio Du00edaz, posteriormente ha sido remode lado y embellecido varias veces. Son dignas de admirar las bellas imu00e1genes dedicadas a Cristo, a Maru00eda Santu00edsima y a algunos Santos que se encuentran en el templo parroquial desde el au00f1o de 1895.", "license_url": "https://creativecommons.org/licenses/by/2.0/"}	[{"name": "diócesisdezacatecas", "provider": "flickr"}, {"name": "diu00f3cesisdezacatecas", "provider": "flickr"}]	2020-08-17 07:11:06.045230 +00:00
2cc5a2c5-bd0e-451f-8eee-9d3411e3d3af	2020-04-25 14:57:31.530511 +00:00	2024-04-30 06:39:35.373781 +00:00	provider_api	flickr	flickr	3945670983	https://www.flickr.com/photos/22155587@N02/3945670983	https://live.staticflickr.com/2660/3945670983_d618602ff3_b.jpg	Árvore no topo do prédio	{"views": "394", "pub_date": "1253667383", "date_taken": "2009-06-04 22:20:50", "description": "Esse pru00e9dio fica na avenida do Estado, pru00f3ximo a Estau00e7u00e3o Ana Neri (o antigo Fura-Fila) Foto Daniel Lescano", "license_url": "https://creativecommons.org/licenses/by/2.0/"}	[{"name": "avenidadoestado", "provider": "flickr"}, {"name": "su00e3opaulo", "provider": "flickr"}, {"name": "u00e1rvore", "provider": "flickr"}, {"name": "architecture", "accuracy": 0.9331, "provider": "clarifai"}, {"name": "building", "accuracy": 0.9591, "provider": "clarifai"}, {"name": "city", "accuracy": 0.87329, "provider": "clarifai"}, {"name": "construction", "accuracy": 0.90054, "provider": "clarifai"}, {"name": "daylight", "accuracy": 0.81923, "provider": "clarifai"}, {"name": "expression", "accuracy": 0.88874, "provider": "clarifai"}, {"name": "fence", "accuracy": 0.85442, "provider": "clarifai"}, {"name": "glass items", "accuracy": 0.81614, "provider": "clarifai"}, {"name": "house", "accuracy": 0.82026, "provider": "clarifai"}, {"name": "landscape", "accuracy": 0.84593, "provider": "clarifai"}, {"name": "museum", "accuracy": 0.84428, "provider": "clarifai"}, {"name": "no person", "accuracy": 0.96234, "provider": "clarifai"}, {"name": "outdoors", "accuracy": 0.91014, "provider": "clarifai"}, {"name": "roof", "accuracy": 0.91687, "provider": "clarifai"}, {"name": "sky", "accuracy": 0.978, "provider": "clarifai"}, {"name": "steel", "accuracy": 0.86769, "provider": "clarifai"}, {"name": "sun", "accuracy": 0.81948, "provider": "clarifai"}, {"name": "travel", "accuracy": 0.84344, "provider": "clarifai"}, {"name": "urban", "accuracy": 0.83056, "provider": "clarifai"}, {"name": "web", "accuracy": 0.86616, "provider": "clarifai"}]	2020-08-28 12:46:01.123546 +00:00
ac151548-5981-426c-90f4-62045daeca4b	2020-04-06 05:42:29.051510 +00:00	2020-12-09 09:07:44.332781 +00:00	provider_api	flickr	flickr	15792631340	https://www.flickr.com/photos/12950131@N06/15792631340	https://live.staticflickr.com/7519/15792631340_fdca83a522_b.jpg	Catedral de Tulancingo 'San Juan Bautista' Tulancingo,Estado de Hidalgo,México	{"views": "432", "pub_date": "1418089122", "date_taken": "2014-12-03 14:36:54", "description": "Catedral de Tulancingo San Juan Bautista, El Sagrario de Catedral Erec. Noviembre 16 de 1754 Reg. Const.: SGAR/694:55/94 Dom. Plaza de la Constitución s/n Col. Centro, Tulancingo Estado de Hidalgo,México C. P. 43600, A. P. 29 Tel. 01 (775) 75 3 11 31 Fax: 01 (775) 75 3 00 91 La Catedral de Tulancingo, dedicada a San Juan Bautista, es una obra de la arquitectura religiosa del México colonial construida a partir de 1528 por la Orden Franciscana. Imponente y a la vez sencilla destaca en el centro Histórico de Tulancingo, Hidalgo, frente a la plaza principal La Floresta. El edificio originalmente fue de menores proporciones, edificado por los franciscanos, quienes evangelizaron en la zona. Fue remozado y ampliado en el año de 1788 por el arquitecto José Damián Ortiz de Castro, quién también colaboró en la planeación y terminación de la Catedral de México. Es comprensible la modificación del estilo por este arquitecto, debido al México que en aquel entonces pasaba por la transición del barroco, a la sencillez del Neoclásico Interior Catedral. De cantera gris, sobrio y elegante. Muestra en la portada un frontón de estilo neoclásico, de forma triángular, sostenido por dos colmnas y pilastras que son de estilo jonico (dos a cada lado de la entrada), alcanzando los 17 metros de altura. Éstas enmarcan el acceso principal al templo. cuenta con dos pequeñas torres de un solo cuerpo, de sencillas proporciones. La cúpula es de forma octagonal y cuenta con una pequeña linternilla. En el interior, también renovado al estilo neoclásico, destaca el altar principal, la pila bautismal labrada en piedra y un púlpito de madera con decoración en relieve, así como una cruz atrial y reloj de sol en sus patios. © [CatedraleseIglesias.com](http://catedraleseiglesias.com/) Álbum 0636 Arquidiócesis de Tulancingo [www.catedraleseiglesias.com/](http://www.catedraleseiglesias.com/)", "license_url": "https://creativecommons.org/licenses/by/2.0/"}	[{"name": "arquidiócesisdetulancingo", "provider": "flickr"}, {"name": "arquidiu00f3cesisdetulancingo", "provider": "flickr"}]	2020-12-09 09:07:44.332781 +00:00

The tags extracted, without the clarifai tags:

[{"name": "diu00f3cesisdetoluca", "provider": "flickr"}]
[{"name": "diu00f3cesisdequeretaro", "provider": "flickr"}, {"name": "di\\xf3cesisdequeretaro", "provider": "flickr"}]
[{"name": "2015", "provider": "flickr"}, {"name": "cérémonie", "provider": "flickr"}, {"name": "challenge", "provider": "flickr"}, {"name": "cu00e9ru00e9monie", "provider": "flickr"}, {"name": "provence", "provider": "flickr"}, {"name": "récompenses", "provider": "flickr"}, {"name": "ru00e9compenses", "provider": "flickr"}, {"name": "trails", "provider": "flickr"}]
[{"name": "chu00e9zy", "provider": "flickr"}, {"name": "lavieetlesartsu00e0parisdepuisnapolu00e9onier", "provider": "flickr"}, {"name": "plumesetpinceaux", "provider": "flickr"}, {"name": "reproductionsdetableaux", "provider": "flickr"}, ]
[{"name": "arquidiócesisdetulancingo", "provider": "flickr"}, {"name": "arquidiu00f3cesisdetulancingo", "provider": "flickr"}]

These examples show that we do have many items with incorrectly encoded tags that do not have a correct duplicate.
I did not see any double-backslash-u-encoded characters in the data, but that probably means that I didn't look well enough.

Thank you for sharing the example of an item that has an error due to us hot-fixing the encoding errors incorrectly.

What if we run this DAG in 3 stages:

  1. replace and deduplicate\\x-escaped characters
  2. replace and deduplicate \\u-escaped characters
  3. Select all items with u-escaped characters.
    This will allow us to see how long this kind of DAG runs, and will not cause any data corruption in the first 2 steps.
    As for the 3rd step, selecting data (or doing a dry-run and saving the selected data) will allow us to assess the shape of data and help us find a way of fixing encoding without corrupting the data?

@zackkrida
Copy link
Member

zackkrida commented Jun 12, 2024

@sarayourfriend and @obulat I analyzed the Sentry-reported records for this issue and found that:

  • All reported errors came from Flickr tags
  • All tags with issues were latin alphabet characters with diacritic marks. These are all in the unicode range of \u00C0 to \u01FF, which is basically a superset of "Latin-1 Supplement" and "Latin Extended-A".

@zackkrida will correct me if I'm wrong, but I think we received all the catalog database data as parquet files that we then inserted into the catalog database. So, there are no TSVs for the items that were ingested before the transfer.

This is correct, yes. No original TSVs, just parquet files produced from the then-postgres catalog DB.

lil' ts/bun script I used
const identifiers = [
  "e5e02148-242e-479f-a15c-ee3324070b62",
  "1a59fe28-d2e3-4968-bda0-1ea63ddf9a96",
  "8b336a65-df09-4d8d-bd09-c1a3444a38b3",
  "b643f8d8-2426-4b65-97cb-e2780850cc8e",
  "b903b523-fcfc-46d7-a3f8-4d98e19ed6ed",
  "b903b523-fcfc-46d7-a3f8-4d98e19ed6ed",
  "b903b523-fcfc-46d7-a3f8-4d98e19ed6ed",
  "90035343-9492-404c-95a3-7352c6865355",
  "90035343-9492-404c-95a3-7352c6865355",
  "6bbf97c0-8b98-458b-a8d4-702423208d33",
  "9d1500f6-134c-4fcc-aca4-7eeded50ad3b",
  "61fa36d9-2a2d-49cd-840c-32ab6a24d4be",
  "cbf25f37-f9e7-4893-a9da-d45022ede7e0",
  "ea94d7bc-d5c8-4a96-8bc4-c8fd06720182",
  "9ed87347-a979-448a-8a44-e86264bc1211",
  "a9bd5d5e-d11f-4bf5-8d96-d2d1fbccff2b",
  "de85d5a0-7948-45e9-8008-79501be5d072",
  "6ca87636-2d6f-4deb-a940-1e23bcc82fda",
  "20136057-3f09-46bb-8de8-0f3494d226fc",
  "1ea36e4b-a2b9-4501-bd71-2d2abbf1630e",
  "92cad4b9-8777-4d91-af2b-4d464a5d0daf",
  "1f3918c5-a00e-4479-8e69-806ed0c33fd5",
  "ea3ae437-52bc-4560-8614-b96c70357262",
  "1b5fa3b5-5f0f-4ca3-a4b9-c5d605787779",
  "32e1a3ef-eb4d-49eb-a8d2-27bf8cd9c93a",
  "9875b486-9685-4ef9-aff2-80db4121fb2f",
  "8efa2282-4af5-46bf-8580-82031cc8e24b",
  "cf5a5518-8476-4ae6-8030-9e95c0cb3398",
  "8d25bd2e-6424-42f0-b42f-58fef9518847",
  "7c0908ee-1f10-4cef-bfd1-1d84625b396e",
  "964ebd07-5dc1-4801-a26b-d073cb3083b7",
  "6cff3b3f-3857-488d-9e8a-59448de2a142",
  "e730a650-accb-45a6-bfd1-8dada00b2b27",
  "893e1578-37aa-4227-9d38-ac1ef9497577",
  "893e1578-37aa-4227-9d38-ac1ef9497577",
  "c93e9d56-190c-4a63-9cfa-4a370da4ca92",
  "076a0436-1d9e-449a-a224-01301f4d8c8e",
  "9d1500f6-134c-4fcc-aca4-7eeded50ad3b",
  "1d4cbf20-a9a0-410e-94cc-4c629d332bb9",
  "0be56341-d9fc-4ca0-bf00-42556bb005a5",
  "0be56341-d9fc-4ca0-bf00-42556bb005a5",
  "0be56341-d9fc-4ca0-bf00-42556bb005a5"
]

interface Tag {
  name: string
}
type Tags = Tag[]


(async function () {
  const tags = new Set();

  for (const id of identifiers) {
    await fetch(`https://api.openverse.engineering/v1/images/${id}?format=json`).then(result => result.ok && result.json()).then(json => {
      console.log(json)
      for (const tag of json.tags as Tags) {
        tags.add(tag.name)
      }
    })
  }

  console.info(tags.values())
})();

@sarayourfriend
Copy link
Collaborator Author

Your plan sounds good to me, Olga. The first two steps I think can go into one, right? The regexes are combined anyway, and those transformations we're pretty sure are safe, as far as I know.

Thanks for the further analysis, Zack. It is good to know the issue has some scope smaller than all of unicode... but there could still be overlaps with genuine sequences. I can't think of an actual example, but on Flickr especially people often include tags of camera models, or references to dated events with hashtags and things. Maybe some of these are already considered suboptimal records, but guessing at them feels really wrong to me, especially if doing so makes an actually safe fix (namely, reingesting suspect records from Flickr) more difficult.

Even if we narrow the range of unescaped unicode sequences, I still feel that this is a very broad, naive approach for a rather narrow issue.

I like Olga's plan because it pushes off the decision, and allows us to fix some of these. However, it's clear from Olga's query that this is a significant subset of the problematic records which will need addressing no matter what. Is there a particular hang up with reingesting these records from upstream, considering there aren't TSVs for us to use for these? I know it would require writing a new method of ingestion, but the risk of permanent damage is pretty big here, and the chances of discovering damage are basically zero (needle in a haystack, finding the small number of works we mangled in a review after the fact would be really hard, especially if the text is already obscure, even if intentional). Say we decided to apply this naive fix now, but save the identifiers of all affected rows. What's the next step? I can't imagine any way of reviewing the changes1, so wouldn't then the only way to know we hadn't caused a problem just be to compare to upstream again? At that point, we're one step away from reingestion, right?

To my mind, why not just go straight for the real, safe fix? Unless the intention is to apply this and not care whether false positives are incorrectly mangled. But then why fix this subset of cases at all? There's of course a difference between mangling what looks like an obscure string of characters (u01f0) and fixing a problem where muséo is mangled. But maybe that obscure string is meaningful to someone2.

I'm not personally comfortable with the solution for unescaped sequences, even if the problem space were reduced to a single unicode sequence. If that were the case, and if the number of records was small, we could manually review and fix them maybe. And even then, we'd probably have to consult with upstream to be sure anyway.

I can imagine other use cases for the ability to reingest records (e.g., verifying dead links aren't just changed upstream metadata), so I don't think work to do so for records suspected of having unescaped unicode sequences would


In any case, I'll update this PR to remove handling unescaped sequences altogether. It significantly simplifies the code to do so anyway, so hooray!

Footnotes

  1. Unless the problem space is genuinely so small that a human can review it, in which case, we can just manually write the fixed rows or at least review the rows that would be affected... though that needs to somehow be "frozen"...

  2. I could actually really easily see that being part of a GLAM institution's accession number (just occurred to me as another example to the camera model or dated event hashtags). Check out this Flickr result as an example of something that looks just as obscure as an unescaped unicode sequence but is apparently meaningful: https://openverse.org/image/ca4c9d9d-c22b-45ad-a375-e8431f8d5cec/. In that case, it's the name that has the obscure string, but a different institution could just as easily have put them into the tags.

@sarayourfriend sarayourfriend marked this pull request as draft June 12, 2024 21:34
@sarayourfriend sarayourfriend marked this pull request as draft June 12, 2024 21:34
@sarayourfriend sarayourfriend force-pushed the add/decode-dedup-tags branch from 7a5a75d to 7eccb9a Compare June 13, 2024 21:38
@sarayourfriend
Copy link
Collaborator Author

I've got the changes in to only work on tags with escaped sequences (actually, just removing the unescaped handling). However, I'm working through a bug where the transformation was being applied to strings that were already unicode encoded, which causes them to get mangled. muséo turned into muséo. No good! Right now I'm trying to handle that in SQL by only passing strings that need the fix through ov_unistr, but it occurs to me now that it might be easier (a lot easier, in fact) to just check in ov_unistr whether the string is already unicode... but then we're really just duplicating the regex check that found these tags to begin with (so you can see why I'd like to do this in SQL, juggling the escaped regex between PG strings and a Python template is tedious!).

@krysal
Copy link
Member

krysal commented Jun 14, 2024

@sarayourfriend Thanks for raising this concern, I was having similar thought but didn't have the time to tip into the solution (or alternatives).

Is there a particular hang up with reingesting these records from upstream, considering there aren't TSVs for us to use for these?

Specifically with Flickr, there is in fact a re-ingestion DAG but it's paused given the many problems it were bringing before (rate limits/unstable results returned). I think given the size of Flickr a paginated API might not be an adequate tool, and old items like this can be even more problematic. I don't know if there is alternativate to re-ingest these records, @stacimc could know better since she worked a lot on Flickr (re)ingestion DAGs.

@sarayourfriend
Copy link
Collaborator Author

I'd assume reingestion of a known subset of records (rather than date ranges) would be able to utilise a single-item API, specifically: https://www.flickr.com/services/api/flickr.photos.getInfo.html

That route requires only the foreign ID of the work, which we have, and returns tags, description, title, license, etc. everything we'd need to fully reingest the record (from a data perspective, never mind architecting the process of getting it back into the catalog, whether that would use an intermediate TSV, etc).

I'm not talking about a backfill, which is what I think you're referencing by reingestion. I mean targeting the specific works we're worried have mangled data and pulling them individually from upstream. Irrespective of the amount of time that would take due to rate limits (and being generally respectful), it is a stable and predictable way to retrieve corrected data for each work.

We could do something like the following, based on the batched update DAG's method:

  1. In a one-off DAG, create a temporary table of the works in question (if it does not already exist), matching the schema of the batched update DAG's temporary table, but with an additional unindexed jsonb column retrieved_data (or something like that) and an unindexed foreign_id column. I suppose we'd also need to include provider should this issue exist for more than just Flickr (like Wikimedia).
  2. In a scheduled 5-minutely DAG, process batches from the temporary table, sized appropriately to be within a respectful rate limit for 5 minutes, and that can actually finish within 5 minutes. The DAG has 1 max allowed simultaneous runs. The DAG will identify the batch it should work on by querying the temporary table ordered for rows WHERE retrieved_data IS NULL LIMIT _X_, ordered by row ID (might be a way to optimise this query if needed?). Pull each work and put the response in retrieved_data. Maybe also put the response in S3 where a dated DAG would be able to reliably retrieve it based on execution date?

That allows us to review the pulled data and decide how we'd like to upstream it into the catalog. Can we use the batched update DAG (SET tags = ov_format_tags(SELECT ndt.tags FROM new_data_table ndt WHERE ndt.identifier = image.identifier)) ? Presumably the new data should pass through the image store?

Regardless, it seems like a solvable problem, right? And one that would be useful beyond this? And also the only solution I can think of that has zero risk of causing further damage to our data?

@krysal
Copy link
Member

krysal commented Jun 14, 2024

As you say, that would be literally a reingestion of a subset of image rows. I agree that, in theory, it sounds doable (regardless of how long it may take) and the safest solution to the problem.

That allows us to review the pulled data and decide how we'd like to upstream it into the catalog. Can we use the batched update DAG (SET tags = ov_format_tags(SELECT ndt.tags FROM new_data_table ndt WHERE ndt.identifier = image.identifier)) ? Presumably the new data should pass through the image store?

I was thinking more of directly assigning the new tags to the suspicious items, provided they are correctly decoded, of course. I understood the whole point of doing a custom re-ingestion one-off DAG is to avoid having to review the entries manually. Perhaps I'm missing some process we could/should(?) perform in the middle. I don't see the need to touch on S3. If it can be narrowed to the realm of Aiflow and Postgres, I believe that would be preferable.

@sarayourfriend
Copy link
Collaborator Author

I mean review more as, just make sure the data makes sense generally, not that the individual tags are fixed or expected. If we're going to overwrite records that we don't even have TSVs for, I think it'd be nice to not smash them directly into the database where the only recovery option is then restoring them from a database snapshot. Considering how close we've come recently to (1) completely destroying Clarifai tags, (2) destorying correctly encoded unicode strings (in this PR), I think moving in stages where we can take a step back and say "do we still think we're going in the right direction with this" behooves us.

@stacimc
Copy link
Collaborator

stacimc commented Jun 14, 2024

@sarayourfriend's suggestion in this comment sounds good to me.

I also initially assumed you meant using the existing Flickr provider dags to perform reingestion. It sounds like that's not what's being discussed here, but I want to add for additional context for why the proposed solution does differ from literal reingestion (and that's a good thing): it is a known problem that when a record gets reingested, old tags are not deleted or updated. Eg if a record is initially ingested with a tag value of "foo", and then that tag is deleted and replaced with the tag "bar" at its source, when it is reingested the record in our catalog will have tags "foo" and "bar". (@AetherUnbound and I have discussed this but I can't find an issue or that discussion at the moment, maybe she remembers?)

There are also many reliability problems with the Flickr DAG which mean we can never be certain that it has actually ingested all records for a particular time frame. That being said, using the single image endpoint and a batched update with custom SQL to overwrite old tags can certainly avoid all these problems 👍

I think it would probably be easiest to use the MediaStore to validate/enrich tags in the initial one-off DAG and avoid trying to upload to S3 and load, although I might be misunderstanding what you were suggesting there. We want to use the batched update rather than the Flickr DAG to load, because the loader steps will not replace the old tags.

@sarayourfriend sarayourfriend force-pushed the add/decode-dedup-tags branch 2 times, most recently from 1857169 to 8993c42 Compare June 17, 2024 01:10
@sarayourfriend sarayourfriend marked this pull request as ready for review June 17, 2024 01:10
@sarayourfriend sarayourfriend requested a review from a team as a code owner June 17, 2024 01:10
@sarayourfriend sarayourfriend changed the title DO NOT MERGE: Add (unusable) DAG to decode and deduplicate tags Add DAG to decode and deduplicate image tags with escaped literal unicode sequences Jun 17, 2024
@openverse-bot openverse-bot added the 🧱 stack: documentation Related to Sphinx documentation label Jun 17, 2024
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4475

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

@AetherUnbound
Copy link
Collaborator

I do remember discussing the tags issue @stacimc, but I also don't recall where that notion might be captured. I think it'd be a great thing to add to our field documentation for the catalog, I'll make an issue for it!

+1 to Sara's idea of a targeted backfill for correcting the tags not covered by this PR.

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great for me! Thank you for including the new sample data.

That is a beast of a SQL update query, nicely done 😄

task_id="trigger_batched_update",
trigger_dag_id=BATCHED_UPDATE_DAG_ID,
wait_for_completion=True,
execution_timeout=timedelta(hours=5),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you confident in this timeout? I don't think the triggered dagrun will be timed out if this task times out (eg, I think we'd get a timeout of this DAG but the batched update would continue running just fine) but I'm not actually 100% on that.

Copy link
Collaborator Author

@sarayourfriend sarayourfriend Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not at all confident in this timeout 😅 And the difference between the task timing out and the triggered DAG timing out did not occur to me.

I guess I should be using the update_timeout etc on the batched update DAG conf instead of the task timeout. That's a very interesting difference between the two!

It will affect the trim and deduplication DAG as well.

@sarayourfriend
Copy link
Collaborator Author

sarayourfriend commented Jun 20, 2024

That is a beast of a SQL update query, nicely done

I am actually super curious if the straight SQL is more performant than tags = ov_some_python_function(tags). I'd be interested to come up with some kind of benchmark we could run to compare things like this and know whether it's worth spending time on a pure SQL expression of something that might be much simpler for most of us to reason about in plain Python.

Sort of... knowing how to judge where the cut off is, basically, aside from "which tool is easiest" which is a valid singular approach so long as the performance difference isn't an issue.

But anyway, I was proud I sorted it out in SQL 🙊 😊

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@dhruvkb
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I ran some examples for the ov_unistr function locally and it seems to work perfectly. Thanks!

@sarayourfriend
Copy link
Collaborator Author

I just need to push a small change to fix the timeout settings as pointed out by @stacimc, and then will repush and merge.

@sarayourfriend sarayourfriend force-pushed the add/decode-dedup-tags branch from 8993c42 to b2c2b14 Compare June 26, 2024 01:19
@sarayourfriend sarayourfriend merged commit aef5a1b into main Jun 26, 2024
41 checks passed
@sarayourfriend sarayourfriend deleted the add/decode-dedup-tags branch June 26, 2024 01:39
@krysal krysal mentioned this pull request Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants