Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job Scheduler: Performance Improvements #5204

Merged
merged 2 commits into from
Nov 7, 2024
Merged

Conversation

hugoghx
Copy link
Collaborator

@hugoghx hugoghx commented Oct 25, 2024

This PR contains work pertaining to improving database operation performance on the job scheduler:

  1. Remove job_run db entries as soon as the jobs successfully complete;
  2. Update the RunJobs query to not use the job_jobs_to_run view as its definition was causing database contention in the job_run table

Copy link
Collaborator

@louisruch louisruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good left a couple comments

internal/scheduler/job/repository_run_test.go Show resolved Hide resolved
internal/scheduler/job/run.go Outdated Show resolved Hide resolved
internal/scheduler/additional_verification_test.go Outdated Show resolved Hide resolved
@louisruch
Copy link
Collaborator

Mike merged a fairly extensive PR related to jobs so this is now also conflicting

@hugoghx
Copy link
Collaborator Author

hugoghx commented Nov 4, 2024

Mike merged a fairly extensive PR related to jobs so this is now also conflicting

Yep - I've resolved the conflicts locally and will update this PR once everyone has reviewed the fixup changes

Copy link
Collaborator

@louisruch louisruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One style comment otherwise this looks great

internal/scheduler/job/query.go Outdated Show resolved Hide resolved
Before this change, upon the completion of a job run, the scheduler
would update a job_run entry with the completed status as well as the
number of objects affected by the run. Additionally, we then had a
cleaner job whose scope was deleting these completed entries from
job_run.

This commit changes this design to drop the cleaner job the and rolls
its functionality directly into the scheduler. Now, when any given job
run is complete, the scheduler will automatically delete it from the
job_run table itself.

In addition, the completed status has been removed as a valid status as
it is not set anymore.

This change follows a decision that given the data in these job_run
entries is already being deleted by the cleaner job, updating its state
when a job completes just to have it deleted is not efficient.
Additionally, no use-case for this data was surfaced.
Copy link
Collaborator

@louisruch louisruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @tmessi should be back tomorrow so might be good to wait for him to 👍 since he did have some comments

Copy link
Member

@tmessi tmessi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, can you move it out of draft so the schema-diff workflow runs? It would be nice to review the schema diff just to be sure.

@hugoghx hugoghx marked this pull request as ready for review November 6, 2024 15:08
@hugoghx
Copy link
Collaborator Author

hugoghx commented Nov 6, 2024

I think this looks good, can you move it out of draft so the schema-diff workflow runs? It would be nice to review the schema diff just to be sure.

Yep, done!

I will also add the complete query change to this PR, likely later today

This comment has been minimized.

This commit changes the logic around selecting jobs to run. Before this,
Boundary used a view "job_jobs_to_run" which was a collation of all jobs
that were not running and scheduled to run. When jobs were requested to
run, Boundary would look at this view, grab a certain amount of jobs and
insert them into the "job_run" table. Importantly, while the amount of
jobs to be scheduled at once defaulted to 1 in the scheduler package,
this was entirely overriden in the controller package to -1 (no limit,
run all jobs that the query would return).

With this commit, Boundary now looks at the "job" table directly for
jobs to run, regardless of job run status. This was done to improve DB
performance as the view (which queried all "job_run" rows) was causing
significant DB contention.

Additionally, the option to limit job runs is now removed since it was
already hardcoded to -1 and not user-configurable anyways.

Should the need arise to implement this limit again, this will have to
be done in a different way since the new query does not account for
running jobs and limiting the rows returned after jobs are already
running would yield a no-op (the new query inserts nothing into
"job_run" if any given job is already running).
@hugoghx
Copy link
Collaborator Author

hugoghx commented Nov 6, 2024

@tmessi @louisruch Pushed up the changes to the job scheduling logic: f227020

Copy link

github-actions bot commented Nov 6, 2024

Database schema diff between main and hugo-jobrun-slowqueries @ f227020

To understand how these diffs are generated and some limitations see the
documentation of the script.

Functions

Unchanged

Tables

diff --git a/.schema-diff/tables_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run.sql b/.schema-diff/tables_ad03e96532d85de63832a78057630a11189b99d2/job_run.sql
index 75158503f..17356053f 100644
--- a/.schema-diff/tables_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run.sql
+++ b/.schema-diff/tables_ad03e96532d85de63832a78057630a11189b99d2/job_run.sql
@@ -47,7 +47,7 @@ create table public.job_run (
 -- name: table job_run; type: comment; schema: public; owner: -
 --
 
-comment on table public.job_run is 'job_run is a table where each row represents an instance of a job run that is either actively running or has already completed.';
+comment on table public.job_run is 'job_run is a table where each row represents an instance of a job run that is either actively running or has failed in some way.';
 
 
 --
diff --git a/.schema-diff/tables_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run_status_enm.sql b/.schema-diff/tables_ad03e96532d85de63832a78057630a11189b99d2/job_run_status_enm.sql
index 613a5e666..49bc07ece 100644
--- a/.schema-diff/tables_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run_status_enm.sql
+++ b/.schema-diff/tables_ad03e96532d85de63832a78057630a11189b99d2/job_run_status_enm.sql
@@ -26,7 +26,7 @@ set default_table_access_method = heap;
 
 create table public.job_run_status_enm (
     name text not null,
-    constraint only_predefined_job_status_allowed check ((name = any (array['running'::text, 'completed'::text, 'failed'::text, 'interrupted'::text])))
+    constraint only_predefined_job_status_allowed check ((name = any (array['running'::text, 'failed'::text, 'interrupted'::text])))
 );
 
 

Views

diff --git a/.schema-diff/views_62c41191ffce6ab68fbed810559dd857fa8a2528/job_jobs_to_run.sql b/.schema-diff/views_62c41191ffce6ab68fbed810559dd857fa8a2528/job_jobs_to_run.sql
deleted file mode 100644
index 56f6ed5e6..000000000
--- a/.schema-diff/views_62c41191ffce6ab68fbed810559dd857fa8a2528/job_jobs_to_run.sql
+++ /dev/null
@@ -1,47 +0,0 @@
---
--- postgresql database dump
---
-
--- dumped from database version 13.16
--- dumped by pg_dump version 14.13 (ubuntu 14.13-1.pgdg22.04+1)
-
-set statement_timeout = 0;
-set lock_timeout = 0;
-set idle_in_transaction_session_timeout = 0;
-set client_encoding = 'utf8';
-set standard_conforming_strings = on;
-select pg_catalog.set_config('search_path', '', false);
-set check_function_bodies = false;
-set xmloption = content;
-set client_min_messages = warning;
-set row_security = off;
-
---
--- name: job_jobs_to_run; type: view; schema: public; owner: -
---
-
-create view public.job_jobs_to_run as
- with running_jobs(job_plugin_id, job_name) as (
-         select job_run.job_plugin_id,
-            job_run.job_name
-           from public.job_run
-          where (job_run.status = 'running'::text)
-        ), final(job_plugin_id, job_name, next_scheduled_run) as (
-         select j.plugin_id,
-            j.name,
-            j.next_scheduled_run
-           from public.job j
-          where ((j.next_scheduled_run <= current_timestamp) and (not (exists ( select
-                   from running_jobs
-                  where (((running_jobs.job_plugin_id)::text = (j.plugin_id)::text) and ((running_jobs.job_name)::text = (j.name)::text))))))
-        )
- select final.job_plugin_id,
-    final.job_name,
-    final.next_scheduled_run
-   from final;
-
-
---
--- postgresql database dump complete
---
-

Triggers

Unchanged

Indexes

diff --git a/.schema-diff/indexes_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run_status_ix.sql b/.schema-diff/indexes_ad03e96532d85de63832a78057630a11189b99d2/job_run_status_ix.sql
index 09ab0c10d..0ae4fc847 100644
--- a/.schema-diff/indexes_62c41191ffce6ab68fbed810559dd857fa8a2528/job_run_status_ix.sql
+++ b/.schema-diff/indexes_ad03e96532d85de63832a78057630a11189b99d2/job_run_status_ix.sql
@@ -29,7 +29,7 @@ create index job_run_status_ix on public.job_run using btree (status);
 -- name: index job_run_status_ix; type: comment; schema: public; owner: -
 --
 
-comment on index public.job_run_status_ix is 'the job_run_status_ix is used by the job run cleaner job';
+comment on index public.job_run_status_ix is 'the job_run_status_ix indexes the commonly-used status field';
 
 
 --

Constraints

Unchanged

Foreign Key Constraints

Unchanged

Copy link
Member

@tmessi tmessi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Copy link
Collaborator

@louisruch louisruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@hugoghx hugoghx merged commit f227020 into main Nov 7, 2024
62 of 63 checks passed
@hugoghx hugoghx deleted the hugo-jobrun-slowqueries branch November 7, 2024 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants