You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>
Co-authored-by: Olga Bulat <obulat@gmail.com>
Fix and add links
Add suggested extra issue and adjust the Expected Outcomes
Include smart_open in the Tools & packages section
Apply editorial suggestions
A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601)
121
133
consisted of uploading each file to temporary `UNLOGGED` DB tables (which
122
134
provides huge gains in writing performance while their disadventages are not
123
-
relevant since they won't be permanent), and including a `row_id` serial number
124
-
used later to query it in batches. Adding an index in this last column after
125
-
filling up the table could improve the query performance. An adaptation will be
126
-
needed to handle the column type of tags (`jsonb`).
135
+
relevant to us, they won't be permanent), and include a `row_id` serial number
136
+
used later to query it in batches. The following must be included:
137
+
138
+
- Add an index for the `identifier` column in the temporary table after filling
139
+
it up, to improve the query performance
140
+
- An adaptation to handle the column type of tags (`jsonb`) and modify the
141
+
`metadata`
142
+
- Include an DAG task for reporting the number of rows affected by column to
143
+
Slack
127
144
128
145
### Run an image data refresh to confirm cleaning time is reduced
129
146
130
147
Finally, after the previous steps are done, running a data refresh will confirm
131
148
there are no more updates applied at ingestion. If time isn't significantly
132
149
reduced then it will be necessary to check what was missing in the previous
133
-
steps.
150
+
steps. Looking at files generated in the
151
+
[step 1](#save-tsv-files-of-cleaned-data-to-aws-s3) may yield clues.
134
152
135
153
If confirmed the time is reduced to zero, optionally the cleaning steps can be
136
154
removed, or leave them in case we want to perform a similar cleaning effort
137
-
later.
155
+
later, e.g. see the [Other projects or work](#other-projects-or-work) section.
138
156
139
157
## Dependencies
140
158
@@ -143,10 +161,11 @@ later.
143
161
No changes needed. The Ingestion Server already has the credentials required to
144
162
[connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28).
145
163
146
-
<!--
147
164
### Tools & packages
148
165
149
-
Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
166
+
<!-- Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
167
+
168
+
Requires installing and familiarizing with the [smart_open][smart_open] utility.
150
169
151
170
### Other projects or work
152
171
@@ -157,6 +176,8 @@ related issues are:
157
176
-[Some images have duplicate incorrectly decoded unicode tags #1303](https://github.com/WordPress/openverse/issues/1303)
158
177
-[Provider scripts may include html tags in record titles #1441](https://github.com/WordPress/openverse/issues/1441)
0 commit comments