Skip to content

Commit 9e3d2cb

Browse files
committed
Apply editorial suggestions
Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu> Co-authored-by: Olga Bulat <obulat@gmail.com> Fix and add links Add suggested extra issue and adjust the Expected Outcomes Include smart_open in the Tools & packages section Apply editorial suggestions
1 parent e39a2cf commit 9e3d2cb

File tree

1 file changed

+81
-59
lines changed

1 file changed

+81
-59
lines changed

documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md

Lines changed: 81 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -19,122 +19,140 @@ the project are clear, as defined in the project thread. In doubt, check the
1919

2020
## Overview
2121

22-
This document describes a solution for incorrect data in the catalog database
23-
(DB) that has to be cleaned up every time a data refresh is run, avoiding wasted
24-
resources.
22+
This document describes a mechanism for rectifying incorrect data in the catalog
23+
database (DB) that currently has to be cleaned up every time a data refresh is
24+
run. This one-time fix is an effort to avoid wasting resources and data refresh
25+
runtime.
2526

2627
## Background
2728

2829
One of the steps of the [data refresh process for images][img-data-refresh] is
2930
cleaning the data that is not fit for production. This process is triggered
30-
weekly by an Airflow DAG, and then runs in the Ingestion Server, taking
31+
weekly by an Airflow DAG, which then runs in the Ingestion Server, taking
3132
approximately just over **20 hours** to complete, according to a inspection of
32-
latest executions. The cleaned data is only saved to the API database, which is
33-
replaced each time during the same data refresh, causing it to have to be
34-
repeated each time to make the _same_ corrections.
35-
36-
This cleaning process was designed this way to speed the rows update up since
37-
the relevant part was to provide the correct data to users via the API. Most of
38-
the rows affected were added previous to the creation of the `MediaStore` class
39-
in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is
40-
nowadays responsible for validating the provider data. However, it entails a
41-
problem of wasting resources both in time, which continues to increase, and in
42-
the machines (CPU) it uses, which could easily be avoided making the changes
43-
permanent by saving them in the upstream database.
44-
45-
[img-data-refresh]:
46-
https://github.com/WordPress/openverse-catalog/blob/main/DAGs.md#image_data_refresh
33+
recent executions as of the time of drafting this document. The cleaned data is
34+
only saved to the API database, which is replaced each time during the same data
35+
refresh, meaning this process has to be repeated each time to make the _same_
36+
corrections.
37+
38+
This cleaning process was designed this way to optimize writes to the API
39+
database, since the most important factor was to provide the correct data to
40+
users via the API. Most of the rows affected were added prior to the creation of
41+
the `MediaStore` class in the Catalog (possibly by the discontinued CommonCrawl
42+
ingestion) which is nowadays responsible for validating the provider data prior
43+
to upserting the records into the upstream database. However, the current
44+
approach entails a problem of wasting resources both in time, which continues to
45+
increase, and in the machines (CPU) it uses, which could easily be avoided
46+
making the changes permanent by saving them in the upstream database.
47+
48+
[img-data-refresh]: ./../../../catalog/reference/DAGs.md#image_data_refresh
4749

4850
## Expected Outcomes
4951

5052
<!-- List any succinct expected products from this implementation plan. -->
5153

52-
- The catalog database (upstream) preserves the cleaned data results of the
54+
- The catalog database (upstream) contains the cleaned data outputs of the
5355
current Ingestion Server's cleaning steps
54-
- The image Data Refresh process is simplified by reducing the cleaning steps
55-
time to nearly zero (and optionally removing them).
56+
- The image Data Refresh process is simplified by reducing significantly
57+
cleaning times.
58+
59+
<!-- removing the cleaning steps from the Ingestion Server. -->
5660

5761
## Step-by-step plan
5862

59-
The cleaning functions that the Ingestion Server applies are already implemented
60-
in the Catalog in the `MediaStore` class: see its `_tag_blacklisted` method
63+
The cleaning functions that the Ingestion Server applies (see the
64+
[cleanup][ing_server_cleanup] file) are already implemented in the Catalog in
65+
the `MediaStore` class: see its [`_tag_blacklisted` method][tag_blacklisted]
6166
(which probably should be renamed) and the [url utilities][url_utils] file. The
6267
only part that it's not there and can't be ported is the filtering of
6368
low-confidence tags, since provider scripts don't save an "accuracy" by tag.
6469

6570
With this the plan then starts in the Ingestion Server with the following steps:
6671

6772
1. [Save TSV files of cleaned data to AWS S3](#save-tsv-files-of-cleaned-data-to-aws-s3)
68-
1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanup)
73+
1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanups)
6974
1. [Run an image Data Refresh to confirm cleaning time is reduced](#run-an-image-data-refresh-to-confirm-cleaning-time-is-reduced)
7075

76+
[ing_server_cleanup]:
77+
https://github.com/WordPress/openverse/blob/f8971fdbea36fe0eaf5b7d022b56e4edfc03bebd/ingestion_server/ingestion_server/cleanup.py#L79-L168
78+
[tag_blacklisted]:
79+
https://github.com/WordPress/openverse/blob/f8971fdbea36fe0eaf5b7d022b56e4edfc03bebd/catalog/dags/common/storage/media.py#L245-L259
7180
[url_utils]:
7281
https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/catalog/dags/common/urls.py
7382

7483
## Step details
7584

7685
### Save TSV files of cleaned data to AWS S3
7786

78-
In a previous exploration, it was set to store TSV files of the cleaned data in
79-
the form of `<identifier> <cleaned_field>`, which can be used later to perform
80-
the updates efficiently in the catalog DB, which only had indexes for the
81-
`identifier` field. These files are saved to the disk of the Ingestion Server
82-
EC2 instances, and worked fine for files with URL corrections since this type of
83-
fields is relatively short, but became a problem when trying to save tags, as
84-
the file turned too large and filled up the disk, causing problems to the data
85-
refresh execution.
86-
87-
The alternative is to upload TSV files to the Amazon Simple Storage Service
88-
(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
89-
benefit of using S3 buckets is that they have streaming capabilities and will
90-
allow us to read the files in chunks later if necessary for performance. The
91-
downside is that objects in S3 don't allow appending, so it may require to
92-
upload files with different part numbers or evaluate if the [multipart upload
93-
process][aws_mpu] will serve us here.
87+
In a previous exploration, the Ingestion Server was set to [store TSV files of
88+
the cleaned data][pr-saving-tsv] in the form of `<identifier> <cleaned_field>`,
89+
which can be used later to perform the updates efficiently in the catalog DB,
90+
which only had indexes for the `identifier` field. These files are saved to the
91+
disk of the Ingestion Server EC2 instances, and worked fine for files with URL
92+
corrections since this type of fields is relatively short, but became a problem
93+
when trying to save tags, as the file turned too large and filled up the disk,
94+
causing issues to the data refresh execution.
9495

9596
[aws_mpu]:
9697
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
9798

99+
To have some numbers of the problem we are dealing with, the following table
100+
shows the number of records cleaned by field for last runs at the moment of
101+
writing this IP, except for tags, which we don't have accurate registries since
102+
file saving was disabled.
103+
98104
| timestamp (UTC) | 'url' | 'creator_url' | 'foreign_landing_url' | 'tags' |
99105
| ------------------- | :---: | :-----------: | :-------------------: | :----: |
100106
| 2024-02-27 04:05:26 | 22156 | 9035458 | 8809213 | 0 |
101107
| 2024-02-20 04:06:56 | 22157 | 9035456 | 8809209 | 0 |
102108
| 2024-02-13 04:41:22 | 22155 | 9035451 | 8809204 | 0 |
103109

104-
To have some numbers of the problem we are delaing with, the previous table
105-
shows the number of records cleaned by field for last runs at the moment of
106-
writing this IP, except for tags, which we don't have accurate registries since
107-
file saving was disabled.
110+
The alternative is to upload TSV files to the Amazon Simple Storage Service
111+
(S3), creating a new bucket or using a subfolder within `openverse-catalog`. The
112+
benefit of using S3 buckets is that they have streaming capabilities and will
113+
allow us to read the files in chunks later if necessary for performance. The
114+
downside is that objects in S3 don't allow appending natviely, so it may require
115+
to upload files with different part numbers or evaluate if the [multipart upload
116+
process][aws_mpu] or more easily, the [`smart_open`][smart_open] package could
117+
serve us here.
108118

109-
### Make and run a batched update DAG for one-time cleanup
119+
[smart_open]: https://github.com/piskvorky/smart_open
120+
121+
### Make and run a batched update DAG for one-time cleanups
110122

111123
A batched catalog cleaner DAG (or potentially a `batched_update_from_file`)
112-
should take the files of the previous step to perform an batched update on the
124+
should take the files of the previous step to perform a batched update on the
113125
catalog's image table, while handling deadlocking and timeout concerns, similar
114126
to the [batched_update][batched_update]. This table is constantly in use by
115-
other DAGs, such as those from API providers or the data refresh process, and
116-
ideally can't be singly blocked by any DAG.
127+
other DAGs, such as those from providers ingestion or the data refresh process,
128+
and ideally can't be singly blocked by any DAG.
117129

118130
[batched_update]: ./../../../catalog/reference/DAGs.md#batched_update
119131

120132
A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601)
121133
consisted of uploading each file to temporary `UNLOGGED` DB tables (which
122134
provides huge gains in writing performance while their disadventages are not
123-
relevant since they won't be permanent), and including a `row_id` serial number
124-
used later to query it in batches. Adding an index in this last column after
125-
filling up the table could improve the query performance. An adaptation will be
126-
needed to handle the column type of tags (`jsonb`).
135+
relevant to us, they won't be permanent), and include a `row_id` serial number
136+
used later to query it in batches. The following must be included:
137+
138+
- Add an index for the `identifier` column in the temporary table after filling
139+
it up, to improve the query performance
140+
- An adaptation to handle the column type of tags (`jsonb`) and modify the
141+
`metadata`
142+
- Include an DAG task for reporting the number of rows affected by column to
143+
Slack
127144

128145
### Run an image data refresh to confirm cleaning time is reduced
129146

130147
Finally, after the previous steps are done, running a data refresh will confirm
131148
there are no more updates applied at ingestion. If time isn't significantly
132149
reduced then it will be necessary to check what was missing in the previous
133-
steps.
150+
steps. Looking at files generated in the
151+
[step 1](#save-tsv-files-of-cleaned-data-to-aws-s3) may yield clues.
134152

135153
If confirmed the time is reduced to zero, optionally the cleaning steps can be
136154
removed, or leave them in case we want to perform a similar cleaning effort
137-
later.
155+
later, e.g. see the [Other projects or work](#other-projects-or-work) section.
138156

139157
## Dependencies
140158

@@ -143,10 +161,11 @@ later.
143161
No changes needed. The Ingestion Server already has the credentials required to
144162
[connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28).
145163

146-
<!--
147164
### Tools & packages
148165

149-
Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
166+
<!-- Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
167+
168+
Requires installing and familiarizing with the [smart_open][smart_open] utility.
150169

151170
### Other projects or work
152171

@@ -157,6 +176,8 @@ related issues are:
157176
- [Some images have duplicate incorrectly decoded unicode tags #1303](https://github.com/WordPress/openverse/issues/1303)
158177
- [Provider scripts may include html tags in record titles #1441](https://github.com/WordPress/openverse/issues/1441)
159178
- [Fix Wikimedia image titles #1728](https://github.com/WordPress/openverse/issues/1728)
179+
- [Add filetype to all images in the catalog DB #1560](https://github.com/WordPress/openverse/issues/1560),
180+
for when the file type can be derived from the URL.
160181

161182
This will also open up space for more structural changes to the Openverse DB
162183
schemas in a [second phase](https://github.com/WordPress/openverse/issues/244)
@@ -192,7 +213,8 @@ What risks are we taking with this solution? Are there risks that once taken can
192213

193214
- Previous attempt from cc-archive: [Clean preexisting data using ImageStore
194215
#517][mathemancer_pr]
195-
- @obulat's PR to
196-
[add logging and save cleaned up data in the Ingestion Server](https://github.com/WordPress/openverse/pull/904)
216+
- @obulat's PR to [add logging and save cleaned up data in the Ingestion
217+
Server][pr-saving-tsv]
197218

219+
[pr-saving-tsv]: https://github.com/WordPress/openverse/pull/904
198220
[mathemancer_pr]: https://github.com/cc-archive/cccatalog/pull/517

0 commit comments

Comments
 (0)