Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

868 miovision find gaps bugs #869

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

gabrielwol
Copy link
Collaborator

@gabrielwol gabrielwol commented Feb 12, 2024

What this pull request accomplishes:

fixes undesired behaviours in find_gaps:

  1. Remove the 15 minutes lookback (23:45 to 00:00 + 1 day) previously used to identify small (10 to 15 minute) gaps that overlap midnight. This doesn't work nicely with our daily pipeline, since if we identify a gap from 23:45 - 00:00, we don't follow up by re-aggregating previous day which leaves a discrepency between unacceptable_gaps and volumes_15min_mvt. This was the source of error for 3 of the gaps identified in the original issue.
  2. Stop gaps from extending for multiple days (this happens during backfilling with a >1 day range). Not technically wrong, but different in style than our daily airflow gaps. Solution: change to use generate_series instead of just values(start_date, end_date) so they get broken up by day. To fix this we can just edit existing multiday gaps.
  3. Fix behaviour where gaps were not identified for days with no data by changing from daily_intersections to full intersections cross join. Issue: we currently have full intersection x days of just zeros in volume_15min_mvt when they should be nulls.

Issue(s) this solves:

What, in particular, needs to reviewed:

What needs to be done by a sysadmin after this PR is merged

  • There are a bunch of cases we will need to re-run aggregation for. I think we can do all of these more surgically instead of updating all of unacceptable_gaps, which is very slow.

  • Case 1: '15 minute lookback'. 60 gaps which should be re-run to remove midnight overlap. Should be fast enough to re-run all these aggregations based on my testing (7s for aggregating 1 day, 1 intersection):

SELECT '--agg --intersection ' || intersection_uid || ' --start_date ' || datetime_bin::date || ' --end_date ' || datetime_bin::date + interval '2 day'
FROM miovision_api.unacceptable_gaps AS un
WHERE gap_start < dt
AND datetime_bin = gap_start
  • Case 2: 'multi day gaps' 60 gaps affected. Split these up by day. Volumes_15min_mvt unaffected.
SELECT DISTINCT dt, intersection_uid, gap_start, gap_end FROM miovision_api.unacceptable_gaps WHERE gap_end > gap_start + interval '1 day'

Update: ✅

--split old gaps longer than 1 day into max 1 day.
UPDATE miovision_api.unacceptable_gaps AS old_
SET dt = new_.datetime_bin::date,
    gap_start = date_trunc('day', new_.datetime_bin),
    gap_end = date_trunc('day', new_.datetime_bin) + interval '1 day' - interval '1 minute'
FROM miovision_api.unacceptable_gaps AS new_
WHERE
    new_.gap_end > new_.gap_start + interval '1 day'
    AND new_.intersection_uid = old_.intersection_uid
    AND new_.datetime_bin = old_.datetime_bin
  • Case 3: here are 8M records in volumes_15min_mvt corresponding to full day outages where we should have nulls instead of zeros. Set both v15 tables to null and add to unnacceptable_gaps. volumes_daily is unaffected because it sums from volumes table directly.
  • We will want to run this update again for only 2024 partitions because the old find_gaps has continued to run for a few days since we ran the below update.
--sum = 7905762
SELECT SUM(count) FROM (
    SELECT intersection_uid, datetime_bin::date, COUNT(*)
    FROM miovision_api.volumes_15min_mvt
    GROUP BY 1, 2
    HAVING SUM(volume) = 0
) AS agg

Updated ✅ UPDATE 5875075:

--update to nulls and add unacceptable_gaps
WITH zero_days AS (
    SELECT intersection_uid, datetime_bin::date AS dt, COUNT(*)
    FROM miovision_api.volumes_15min_mvt
    GROUP BY 1, 2
    HAVING SUM(volume) = 0
),

add_gaps AS (
    INSERT INTO miovision_api.unacceptable_gaps (intersection_uid, dt, gap_start, gap_end, datetime_bin, gap_minutes_total, gap_minutes_15min)
    SELECT
        intersection_uid,
        dt,
        dt::timestamp AS gap_start,
        dt + interval '1 day' - interval '1 minute' AS gap_end,
        bins.datetime_bin, 
        24 * 60 - 1 AS gap_minutes_total,
        15 AS gap_minutes_15min
    FROM zero_days, 
    generate_series(
            dt,
            dt + interval '1 day' - interval '15 minutes',
            interval '15 minutes'
    ) AS bins(datetime_bin)
    --there are a few which already exist in unacceptable_gaps but aren't properly reflected as nulls in v15
    ON CONFLICT (intersection_uid, datetime_bin) DO NOTHING
),

update_v15_mvt AS (
    UPDATE miovision_api.volumes_15min_mvt AS v15
    SET volume = null
    FROM zero_days
    WHERE
        v15.intersection_uid = zero_days.intersection_uid
        AND v15.datetime_bin >= zero_days.dt
        AND v15.datetime_bin < zero_days.dt + interval '1 day'
)

UPDATE miovision_api.volumes_15min AS v15
SET volume = null
FROM zero_days
WHERE
    v15.intersection_uid = zero_days.intersection_uid
    AND v15.datetime_bin >= zero_days.dt
    AND v15.datetime_bin < zero_days.dt + interval '1 day'

remove 15 minute buffer + 
change to adding artificial point at each day
@chmnata

This comment was marked as outdated.

@gabrielwol gabrielwol added this to the Miovision pipeline updates milestone Jul 4, 2024
@gabrielwol gabrielwol marked this pull request as draft August 1, 2024 18:03
@gabrielwol
Copy link
Collaborator Author

Close #1024 before this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Miovision: discrepencies between volumes_15min_mvt and unacceptable_gaps
2 participants