Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize data models #244

Open
4 of 29 tasks
obulat opened this issue May 20, 2022 · 1 comment
Open
4 of 29 tasks

Normalize data models #244

obulat opened this issue May 20, 2022 · 1 comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: frontend Related to the Nuxt frontend

Comments

@obulat
Copy link
Contributor

obulat commented May 20, 2022

Problem

This is a meta issue to track all the data model normalization work across all the repositories.
All open issues from this meta issue. You can also track the progress using the GitHub Project view.
Some data we have in the database was ingested a long time ago when we had a different set of required fields. This makes consuming the data difficult because the pieces that are set as required can be unavailable in the database.
We need to make sure that we have up-to-date data models across the stack, and that our data in the database confirms to them.

Description

To establish trust in our data, we need to make sure that we clearly describe what data we have, and to check that the database actually has all the data outlined. Also, we should remove the duplication of data classification/data cleaning between the Catalog and the API layers.

Here are the specific fields we should normalize:

All media

These fields are common for all media, however some fields only have NULL values in images, not in audio.

URL

License URL

Watermarked

Last synced with source

Mature (new column)

Description (new column)

Image

Thumbnail

Filetype

563 004 660 images

Category

563 622 992 images

Width & height

12 571 694 images

Filesize

561 894 897 images

Tags


Improvements


More investigation needed

Additional context

Updates that can be done with existing data:

  • tags
  • license_url
  • watermarked
  • last_synced_with

Updates that will require additional fetching from providers:

  • filetype
  • filesize
  • width
  • height

Message from @AetherUnbound with details from the database (from the Public Slack discussion):
Here are a count of NULL values for all fields that don't have a NOT NULL constraint. Unfortunately this doesn't give us information on license_url, if that's supposed to come from the meta_data field

deploy@localhost:openledger> SELECT
 COUNT(*) as total,
 COUNT(*) FILTER (WHERE ingestion_type IS NULL) as ingestion_type,
 COUNT(*) FILTER (WHERE provider IS NULL) as provider,
 COUNT(*) FILTER (WHERE source IS NULL) as source,
 COUNT(*) FILTER (WHERE foreign_identifier IS NULL) as foreign_identifier,
 COUNT(*) FILTER (WHERE foreign_landing_url IS NULL) as foreign_landing_url,
 COUNT(*) FILTER (WHERE thumbnail IS NULL) as thumbnail,
 COUNT(*) FILTER (WHERE filetype IS NULL) as filetype,
 COUNT(*) FILTER (WHERE duration IS NULL) as duration,
 COUNT(*) FILTER (WHERE bit_rate IS NULL) as bit_rate,
 COUNT(*) FILTER (WHERE sample_rate IS NULL) as sample_rate,
 COUNT(*) FILTER (WHERE category IS NULL) as category,
 COUNT(*) FILTER (WHERE genres IS NULL) as genres,
 COUNT(*) FILTER (WHERE audio_set IS NULL) as audio_set,
 COUNT(*) FILTER (WHERE set_position IS NULL) as set_position,
 COUNT(*) FILTER (WHERE alt_files IS NULL) as alt_files,
 COUNT(*) FILTER (WHERE filesize IS NULL) as filesize,
 COUNT(*) FILTER (WHERE license_version IS NULL) as license_version,
 COUNT(*) FILTER (WHERE creator IS NULL) as creator,
 COUNT(*) FILTER (WHERE creator_url IS NULL) as creator_url,
 COUNT(*) FILTER (WHERE title IS NULL) as title,
 COUNT(*) FILTER (WHERE meta_data IS NULL) as meta_data,
 COUNT(*) FILTER (WHERE tags IS NULL) as tags,
 COUNT(*) FILTER (WHERE watermarked IS NULL) as watermarked,
 COUNT(*) FILTER (WHERE last_synced_with_source IS NULL) as last_synced_with_source
 FROM audio;
-[ RECORD 1 ]-------------------------
total                   | 175858
ingestion_type          | 0
provider                | 0
source                  | 0
foreign_identifier      | 0
foreign_landing_url     | 0
thumbnail               | 86720
filetype                | 0
duration                | 0
bit_rate                | 89223
sample_rate             | 149241
category                | 13844
genres                  | 86720
audio_set               | 34914
set_position            | 86720
alt_files               | 115789
filesize                | 89138
license_version         | 0
creator                 | 10
creator_url             | 118
title                   | 0
meta_data               | 0
tags                    | 30092
watermarked             | 0
last_synced_with_source | 0
SELECT 1
Time: 0.149s


deploy@localhost:openledger> SELECT
 COUNT(*) as total,
 COUNT(*) FILTER (WHERE ingestion_type IS NULL) as ingestion_type,
 COUNT(*) FILTER (WHERE provider IS NULL) as provider,
 COUNT(*) FILTER (WHERE source IS NULL) as source,
 COUNT(*) FILTER (WHERE foreign_identifier IS NULL) as foreign_identifier,
 COUNT(*) FILTER (WHERE foreign_landing_url IS NULL) as foreign_landing_url,
 COUNT(*) FILTER (WHERE thumbnail IS NULL) as thumbnail,
 COUNT(*) FILTER (WHERE width IS NULL) as width,
 COUNT(*) FILTER (WHERE height IS NULL) as height,
 COUNT(*) FILTER (WHERE filesize IS NULL) as filesize,
 COUNT(*) FILTER (WHERE license_version IS NULL) as license_version,
 COUNT(*) FILTER (WHERE creator IS NULL) as creator,
 COUNT(*) FILTER (WHERE creator_url IS NULL) as creator_url,
 COUNT(*) FILTER (WHERE title IS NULL) as title,
 COUNT(*) FILTER (WHERE meta_data IS NULL) as meta_data,
 COUNT(*) FILTER (WHERE tags IS NULL) as tags,
 COUNT(*) FILTER (WHERE watermarked IS NULL) as watermarked,
 COUNT(*) FILTER (WHERE last_synced_with_source IS NULL) as last_synced_with_source,
 COUNT(*) FILTER (WHERE filetype IS NULL) as filetype,
 COUNT(*) FILTER (WHERE category IS NULL) as category
 FROM image;
-[ RECORD 1 ]-------------------------
total                   | 563667181
ingestion_type          | 0
provider                | 0
source                  | 0
foreign_identifier      | 0
foreign_landing_url     | 1
thumbnail               | 57584529
width                   | 12571694
height                  | 12571694
filesize                | 561894897
license_version         | 0
creator                 | 4459805
creator_url             | 22751618
title                   | 1096025
meta_data               | 366974
tags                    | 243751835
watermarked             | 1105608
last_synced_with_source | 554237
filetype                | 563004660
category                | 563622992
SELECT 1
Time: 2480.183s (41 minutes 20 seconds), executed in: 2480.182s (41 minutes 20 seconds)
@obulat obulat added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature labels May 20, 2022
@obulat obulat added the 💻 aspect: code Concerns the software code in the repository label May 20, 2022
@obulat obulat added data normalization 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels May 20, 2022
@AetherUnbound
Copy link
Collaborator

A note on data refreshes & normalization that @obulat brought up: We should continue performing full data refreshes in dev until we are confident in our data normalization. Until we get everything normalized, we may continue to find issues in production that can't be replicated in staging unless we refresh the catalog in its entirety.

I've also made https://github.com/WordPress/openverse-infrastructure/issues/120 to track this

@obulat obulat moved this to In Progress in Openverse Data Normalization Jun 15, 2022
@obulat obulat self-assigned this Aug 24, 2022
This was referenced Feb 16, 2023
dhruvkb pushed a commit that referenced this issue Feb 20, 2023
Make most of the Vuex store modules namespaced
dhruvkb added a commit that referenced this issue Feb 20, 2023
Organise and document `justfile`
@obulat obulat added the 🧱 stack: frontend Related to the Nuxt frontend label Feb 22, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Feb 23, 2023
@dhruvkb dhruvkb added this to the Data normalization milestone Dec 2, 2023
@obulat obulat removed their assignment Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: frontend Related to the Nuxt frontend
Projects
Status: 📋 Backlog
Status: In Progress
Development

No branches or pull requests

3 participants