Replace materialized views with incremental roll ups#45
Replace materialized views with incremental roll ups#45ramyaragupathy wants to merge 4 commits intoAmericanRedCross:masterfrom
Conversation
| @@ -1,5 +1,4 @@ | |||
| -- view with a schema that matches the legacy changesets table | |||
| CREATE VIEW changesets AS | |||
| CREATE TABLE changesets AS | |||
There was a problem hiding this comment.
Why this change? As-is, this will duplicate data in raw_changesets.
The historical reason for raw_changesets (table) vs changesets (view) was that I needed new table names for replacement data when I was updating v1. The API (etc.) assume the existence of relations (tables or views) without the raw_ prefix, so creating views was the easiest way to preserve compatibility.
raw_* should probably go away in favor of tables without the prefix.
There was a problem hiding this comment.
Are you planning on updating the parts of this project that write raw changeset stats to write to these tables (retiring raw_changesets)? If so, the API project may also need to be updated to refer to the correct tables.
src/housekeeping.js
Outdated
| JOIN raw_changesets c ON c.id = ch.changeset_id | ||
| JOIN raw_hashtags h ON h.id = ch.hashtag_id | ||
| GROUP BY hashtag | ||
| ON CONFLICT do nothing;` |
There was a problem hiding this comment.
I see where you're going with this..nice.
The key to the incremental updates is the ON CONFLICT clause as well as tracking which augmented diffs / changesets have already had their stats contributed.
| -- view with a schema that matches the legacy changesets table | ||
| CREATE VIEW changesets AS | ||
| CREATE TABLE changesets AS | ||
| SELECT |
There was a problem hiding this comment.
Rather than SELECTing (and subsequently truncating, if all you wanted was the structure), CREATE TABLE changesets LIKE raw_changesets INCLUDING ALL will replicate the structure and indices without any of the data.
There was a problem hiding this comment.
(Though in this case, some of the column names changed, so yeah... ;-)
| @@ -1,5 +1,4 @@ | |||
| -- view with a schema that matches the legacy changesets table | |||
| CREATE VIEW changesets AS | |||
| CREATE TABLE changesets AS | |||
There was a problem hiding this comment.
Are you planning on updating the parts of this project that write raw changeset stats to write to these tables (retiring raw_changesets)? If so, the API project may also need to be updated to refer to the correct tables.
| JOIN raw_changesets c ON c.id = ch.changeset_id | ||
| JOIN raw_hashtags h ON h.id = ch.hashtag_id | ||
| GROUP BY hashtag | ||
| ON CONFLICT do update;` |
There was a problem hiding this comment.
This looks incomplete.
The other aggregated tables also need to be updated incrementally.
More importantly, this doesn't actually update incrementally; it's effectively the same as what REFRESH MATERIALIZED VIEW does, rewriting each of the rows in the aggregated table (it'll take just as long on a fully-populated table). Instead, it should detect which changesets rows changed since the last run and INSERT new aggregated values / UPDATE existing aggregated values by adding to them / updating counts / updating max values.
From AmericanRedCross/osm-stats#56, started with replacing the materialized views approach. Here's what has been done so far:
On running housekeeping script with this modification I could see new updates in hashtag_stats
Next Actions:
cc @dakotabenjamin @smit1678