Some images have duplicate incorrectly decoded unicode tags #1303
Labels
💻 aspect: code
Concerns the software code in the repository
🛠 goal: fix
Bug fix
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: catalog
Related to the catalog and Airflow DAGs
Description
Some media with non-ascii characters in tags that were ingested a long time ago has duplicate tags: one with a correct utf-8 letter and one with an incorrectly escaped sequence.
Reproduction
"arapça"
with aç
with acedilla
, andarapu00e7a
, where that character was replaced with an incorrectly escapedç
asu00e7
(this is the unicode code point for this letter, without the\
control character){"name": "arapça", "provider": "flickr"}, {"name": "arapu00e7a", "provider": "flickr"},
Screenshots
Tags displayed on the frontend:
Additional context
I think we also had the same problem for other details such as title and description, but most of them were fixed when re-ingested. When we upsert the tags, we add all the tags that are different from the ones already saved. And since the new tag appears different than the mangled one, both were saved.
This item has a non-mangled title and mangled and non-mangled tags, which suggests that the titles were fixed, and the tags were simply added to:
https://api.openverse.engineering/v1/images/829eb0a7-3ce8-44ca-8194-4a78757a88aa/
There is also an error of over-correction of the unicode decoding error. Instead of removing the backslash before
u
, the backslash is escaped by another backslash, soarapu00e7a
becomesarap\\u00e7a
.On the frontend, we compensate for this problem for title, creator and tag name in
decode-string
: https://github.com/WordPress/openverse-frontend/blob/26fb744449cbe4c25b895c75fad57ab2646b1737/src/utils/decode-data.tsThe text was updated successfully, but these errors were encountered: