Job Scheduler: Performance Improvements #5204

hugoghx · 2024-10-25T13:41:42Z

This PR contains work pertaining to improving database operation performance on the job scheduler:

Remove job_run db entries as soon as the jobs successfully complete;
Update the RunJobs query to not use the job_jobs_to_run view as its definition was causing database contention in the job_run table

internal/db/schema/migrations/oss/postgres/93/01_job_run_clean.up.sql

louisruch

Looks good left a couple comments

internal/scheduler/job/repository_run_test.go

internal/scheduler/job/run.go

internal/scheduler/additional_verification_test.go

internal/db/schema/migrations/oss/postgres/93/01_job_run_clean.up.sql

louisruch · 2024-10-26T00:10:56Z

Mike merged a fairly extensive PR related to jobs so this is now also conflicting

hugoghx · 2024-11-04T13:58:29Z

Mike merged a fairly extensive PR related to jobs so this is now also conflicting

Yep - I've resolved the conflicts locally and will update this PR once everyone has reviewed the fixup changes

louisruch

One style comment otherwise this looks great

internal/scheduler/job/query.go

Before this change, upon the completion of a job run, the scheduler would update a job_run entry with the completed status as well as the number of objects affected by the run. Additionally, we then had a cleaner job whose scope was deleting these completed entries from job_run. This commit changes this design to drop the cleaner job the and rolls its functionality directly into the scheduler. Now, when any given job run is complete, the scheduler will automatically delete it from the job_run table itself. In addition, the completed status has been removed as a valid status as it is not set anymore. This change follows a decision that given the data in these job_run entries is already being deleted by the cleaner job, updating its state when a job completes just to have it deleted is not efficient. Additionally, no use-case for this data was surfaced.

louisruch

LGTM, @tmessi should be back tomorrow so might be good to wait for him to 👍 since he did have some comments

tmessi

I think this looks good, can you move it out of draft so the schema-diff workflow runs? It would be nice to review the schema diff just to be sure.

hugoghx · 2024-11-06T15:08:29Z

I think this looks good, can you move it out of draft so the schema-diff workflow runs? It would be nice to review the schema diff just to be sure.

Yep, done!

I will also add the complete query change to this PR, likely later today

This commit changes the logic around selecting jobs to run. Before this, Boundary used a view "job_jobs_to_run" which was a collation of all jobs that were not running and scheduled to run. When jobs were requested to run, Boundary would look at this view, grab a certain amount of jobs and insert them into the "job_run" table. Importantly, while the amount of jobs to be scheduled at once defaulted to 1 in the scheduler package, this was entirely overriden in the controller package to -1 (no limit, run all jobs that the query would return). With this commit, Boundary now looks at the "job" table directly for jobs to run, regardless of job run status. This was done to improve DB performance as the view (which queried all "job_run" rows) was causing significant DB contention. Additionally, the option to limit job runs is now removed since it was already hardcoded to -1 and not user-configurable anyways. Should the need arise to implement this limit again, this will have to be done in a different way since the new query does not account for running jobs and limiting the rows returned after jobs are already running would yield a no-op (the new query inserts nothing into "job_run" if any given job is already running).

hugoghx · 2024-11-06T16:30:14Z

@tmessi @louisruch Pushed up the changes to the job scheduling logic: f227020

github-actions · 2024-11-06T16:32:59Z

Database schema diff between main and hugo-jobrun-slowqueries @ f227020

To understand how these diffs are generated and some limitations see the
documentation of the script.

Functions

Unchanged

Tables

diff --git a/.schema-diff/tables_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run.sql b/.schema-diff/tables_ad03e96532d85de63832a78057630a11189b99d2/job_run.sql
index 75158503f..17356053f 100644
--- a/.schema-diff/tables_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run.sql
+++ b/.schema-diff/tables_ad03e96532d85de63832a78057630a11189b99d2/job_run.sql
@@ -47,7 +47,7 @@ create table public.job_run (
 -- name: table job_run; type: comment; schema: public; owner: -
 --
 
-comment on table public.job_run is 'job_run is a table where each row represents an instance of a job run that is either actively running or has already completed.';
+comment on table public.job_run is 'job_run is a table where each row represents an instance of a job run that is either actively running or has failed in some way.';
 
 
 --
diff --git a/.schema-diff/tables_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run_status_enm.sql b/.schema-diff/tables_ad03e96532d85de63832a78057630a11189b99d2/job_run_status_enm.sql
index 613a5e666..49bc07ece 100644
--- a/.schema-diff/tables_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run_status_enm.sql
+++ b/.schema-diff/tables_ad03e96532d85de63832a78057630a11189b99d2/job_run_status_enm.sql
@@ -26,7 +26,7 @@ set default_table_access_method = heap;
 
 create table public.job_run_status_enm (
     name text not null,
-    constraint only_predefined_job_status_allowed check ((name = any (array['running'::text, 'completed'::text, 'failed'::text, 'interrupted'::text])))
+    constraint only_predefined_job_status_allowed check ((name = any (array['running'::text, 'failed'::text, 'interrupted'::text])))
 );

Views

diff --git a/.schema-diff/views_62c41191ffce6ab68fbed810559dd857fa8a2528/job_jobs_to_run.sql b/.schema-diff/views_62c41191ffce6ab68fbed810559dd857fa8a2528/job_jobs_to_run.sql
deleted file mode 100644
index 56f6ed5e6..000000000
--- a/.schema-diff/views_62c41191ffce6ab68fbed810559dd857fa8a2528/job_jobs_to_run.sql
+++ /dev/null
@@ -1,47 +0,0 @@
---
--- postgresql database dump
---
-
--- dumped from database version 13.16
--- dumped by pg_dump version 14.13 (ubuntu 14.13-1.pgdg22.04+1)
-
-set statement_timeout = 0;
-set lock_timeout = 0;
-set idle_in_transaction_session_timeout = 0;
-set client_encoding = 'utf8';
-set standard_conforming_strings = on;
-select pg_catalog.set_config('search_path', '', false);
-set check_function_bodies = false;
-set xmloption = content;
-set client_min_messages = warning;
-set row_security = off;
-
---
--- name: job_jobs_to_run; type: view; schema: public; owner: -
---
-
-create view public.job_jobs_to_run as
- with running_jobs(job_plugin_id, job_name) as (
-         select job_run.job_plugin_id,
-            job_run.job_name
-           from public.job_run
-          where (job_run.status = 'running'::text)
-        ), final(job_plugin_id, job_name, next_scheduled_run) as (
-         select j.plugin_id,
-            j.name,
-            j.next_scheduled_run
-           from public.job j
-          where ((j.next_scheduled_run <= current_timestamp) and (not (exists ( select
-                   from running_jobs
-                  where (((running_jobs.job_plugin_id)::text = (j.plugin_id)::text) and ((running_jobs.job_name)::text = (j.name)::text))))))
-        )
- select final.job_plugin_id,
-    final.job_name,
-    final.next_scheduled_run
-   from final;
-
-
---
--- postgresql database dump complete
---
-

Triggers

Unchanged

Indexes

diff --git a/.schema-diff/indexes_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run_status_ix.sql b/.schema-diff/indexes_ad03e96532d85de63832a78057630a11189b99d2/job_run_status_ix.sql
index 09ab0c10d..0ae4fc847 100644
--- a/.schema-diff/indexes_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run_status_ix.sql
+++ b/.schema-diff/indexes_ad03e96532d85de63832a78057630a11189b99d2/job_run_status_ix.sql
@@ -29,7 +29,7 @@ create index job_run_status_ix on public.job_run using btree (status);
 -- name: index job_run_status_ix; type: comment; schema: public; owner: -
 --
 
-comment on index public.job_run_status_ix is 'the job_run_status_ix is used by the job run cleaner job';
+comment on index public.job_run_status_ix is 'the job_run_status_ix indexes the commonly-used status field';
 
 
 --

Constraints

Unchanged

Foreign Key Constraints

Unchanged

tmessi

Nice work!

louisruch

hugoghx self-assigned this Oct 25, 2024

hugoghx requested review from louisruch and tmessi October 25, 2024 13:41

github-actions bot added core/db core core/proto core/sql core/daemon labels Oct 25, 2024

hugoghx added this to the 0.19.x milestone Oct 25, 2024

tmessi reviewed Oct 25, 2024

View reviewed changes

internal/db/schema/migrations/oss/postgres/93/01_job_run_clean.up.sql Outdated Show resolved Hide resolved

louisruch requested changes Oct 25, 2024

View reviewed changes

hugoghx requested review from tmessi and louisruch November 4, 2024 13:50

hugoghx force-pushed the hugo-jobrun-slowqueries branch from 9028a1e to cbcd514 Compare November 5, 2024 12:43

louisruch reviewed Nov 5, 2024

View reviewed changes

internal/scheduler/job/query.go Outdated Show resolved Hide resolved

hugoghx force-pushed the hugo-jobrun-slowqueries branch from 77bec14 to 27c432a Compare November 5, 2024 17:10

louisruch approved these changes Nov 6, 2024

View reviewed changes

tmessi reviewed Nov 6, 2024

View reviewed changes

hugoghx marked this pull request as ready for review November 6, 2024 15:08

This comment has been minimized.

Sign in to view

hugoghx requested review from louisruch and tmessi November 6, 2024 16:30

tmessi approved these changes Nov 6, 2024

View reviewed changes

louisruch approved these changes Nov 6, 2024

View reviewed changes

hugoghx merged commit f227020 into main Nov 7, 2024
62 of 63 checks passed

hugoghx deleted the hugo-jobrun-slowqueries branch November 7, 2024 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job Scheduler: Performance Improvements #5204

Job Scheduler: Performance Improvements #5204

hugoghx commented Oct 25, 2024 •

edited

Loading

louisruch left a comment

louisruch commented Oct 26, 2024

hugoghx commented Nov 4, 2024

louisruch left a comment

louisruch left a comment

tmessi left a comment

hugoghx commented Nov 6, 2024

This comment has been minimized.

hugoghx commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

tmessi left a comment

louisruch left a comment

Job Scheduler: Performance Improvements #5204

Job Scheduler: Performance Improvements #5204

Conversation

hugoghx commented Oct 25, 2024 • edited Loading

louisruch left a comment

Choose a reason for hiding this comment

louisruch commented Oct 26, 2024

hugoghx commented Nov 4, 2024

louisruch left a comment

Choose a reason for hiding this comment

louisruch left a comment

Choose a reason for hiding this comment

tmessi left a comment

Choose a reason for hiding this comment

hugoghx commented Nov 6, 2024

This comment has been minimized.

hugoghx commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

Functions

Tables

Views

Triggers

Indexes

Constraints

Foreign Key Constraints

tmessi left a comment

Choose a reason for hiding this comment

louisruch left a comment

Choose a reason for hiding this comment

hugoghx commented Oct 25, 2024 •

edited

Loading