Implement offline DB update #25

micaeljtoliveira · 2022-06-15T01:40:25Z

Closes #24

This implements the offline update of the DB roughly as discussed in #24. The different steps are implemented in a Jenkins pipeline. The Jenkins configuration file is now part of this repository.

The update procedure goes like this:

Create a time-stamped copy of the database from the database symlink, which should point to the latest and most up-to-date version.
Try to update the database copy.
If no errors occurred during the update, change the symlink to point to the new file. Otherwise delete the failed update attempt.

Some design decisions worth mentioning:

Only one update can be done per day.
The time-stamped versions of the database are stored in a sub-directory named daily (as the update is done daily).
Database files older than 2 weeks are pruned (only if database update was successful, as to be sure that there's always at least one copy left).

aidanheerdegen · 2022-06-15T02:48:52Z

Awesome!

I would recommend creating the databases in a dedicated sub-directory with a symbolic link to the current DB at /g/data/ik11/databases/cosima_master.db (as discussed).

/g/data/ik11/databases/ is a bit messy as others have also made one-off DBs and put them there. So I think it makes it more obvious which is the correct one to use if there is a single (link) named cosima_master.db.

In which case it would make sense to clone this repo at that location, e.g. DB_PATH=/g/data/ik11/databases/master_index.

Andrew's suggestions are good IMO. So something like this:

Have a (private) DB file (e.g. PRIVATE_DB=$DB_PATH/current.db) without read-permissions (guaranteed noone accidentally accessing it). If it doesn't exist, create it and index from scratch.
Change permissions to $PRIVATE_DB to allow only user rw
Update $PRIVATE_DB (index new files)
Copy $PRIVATE_DB to time-stamped copy
Remove rw permissions from $PRIVATE_DB
Update link (/g/data/ik11/databases/cosima_master.db) to point to latest time-stamped copy

Having a current DB ($PRIVATE_DB) that is never accessed saves having to copy a DB which might be currently being accessed. Changing permissions on $PRIVATE_DB is a "belt and braces" approach, but I like it. Also has the benefit of not allowing the owner of $PRIVATE_DB to accidentally overwrite/delete it.

This whole process could be run as-is without the symlink step. Set it running and let it update a few times, then add in the symlink step and it is done.

I plan to make /g/data/ik11/databases/cosima_master.db a symlink before this goes live so the change-over can be seamless.

Did you also want to include keeping any snapshots at all? I'm not fussed either way, but we did discuss it in person, so thought it should be documented as being on the table and explicitly not added at this stage if that was the decision.

micaeljtoliveira · 2022-06-15T03:39:39Z

I would recommend creating the databases in a dedicated sub-directory with a symbolic link to the current DB at /g/data/ik11/databases/cosima_master.db (as discussed).

/g/data/ik11/databases/ is a bit messy as others have also made one-off DBs and put them there. So I think it makes it more obvious which is the correct one to use if there is a single (link) named cosima_master.db.

Yes, that's a good idea. I'll change the script accordingly.

In which case it would make sense to clone this repo at that location, e.g. DB_PATH=/g/data/ik11/databases/master_index.

The way how this is setup, there's no need to clone this repo. Jenkins will take care of doing that and the update script will be submitted from the Jenkins workspace.

Andrew's suggestions are good IMO. So something like this:

Have a (private) DB file (e.g. PRIVATE_DB=$DB_PATH/current.db) without read-permissions (guaranteed noone accidentally accessing it). If it doesn't exist, create it and index from scratch.

Change permissions to $PRIVATE_DB to allow only user rw

Update $PRIVATE_DB (index new files)

Copy $PRIVATE_DB to time-stamped copy

Remove rw permissions from $PRIVATE_DB

Update link (/g/data/ik11/databases/cosima_master.db) to point to latest time-stamped copy

Having a current DB ($PRIVATE_DB) that is never accessed saves having to copy a DB which might be currently being accessed. Changing permissions on $PRIVATE_DB is a "belt and braces" approach, but I like it. Also has the benefit of not allowing the owner of $PRIVATE_DB to accidentally overwrite/delete it.

I think we should avoid always doing the update on the same file. If for some reason the update takes too long and a new update is started before the old one finished, we might get some weird stuff happening. Instead, we can achieve the same level of protection of the live database by keeping the workflow that I implemented, but changing the permissions of the time stamped files. Assuming all the database files are read-only, this would look like this:

Copy the latest version of the database (the one pointed at by the link) to new time-stamped file
Change permissions of new file to rw
Update the database
Change permissions of new file back to read-only
Update symlink

Currently steps 2 and 4 are missing, but are trivial to add.

I plan to make /g/data/ik11/databases/cosima_master.db a symlink before this goes live so the change-over can be seamless.

Sounds good.

Did you also want to include keeping any snapshots at all? I'm not fussed either way, but we did discuss it in person, so thought it should be documented as being on the table and explicitly not added at this stage if that was the decision.

The way I implemented it, the step that updates the database does not delete any files, so all time stamped files are kept. Pruning of old versions is done at the last stage of the pipeline. Currently this deletes the files older than 2 weeks, as Andrew suggested.

…ission script now uses the environment variables set by pipeline to set the database name and path.

aidanheerdegen · 2022-06-15T05:19:28Z

The way how this is setup, there's no need to clone this repo. Jenkins will take care of doing that and the update script will be submitted from the Jenkins workspace.

Good point. Scratch that.

I think we should avoid always doing the update on the same file. If for some reason the update takes too long and a new update is started before the old one finished, we might get some weird stuff happening. Instead, we can achieve the same level of protection of the live database by keeping the workflow that I implemented, but changing the permissions of the time stamped files.

Agreed. The latest DB file can be open for read-only access, but this shouldn't preclude a copy being made safely.

Assuming all the database files are read-only, this would look like this:

Copy the latest version of the database (the one pointed at by the link) to new time-stamped file
Change permissions of new file to rw
Update the database
Change permissions of new file back to read-only
Update symlink

However, doesn't your point above also apply to this workflow? If you have a DB update in progress, and another one starts, won't it try to copy the currently open (and being written to) DB file as it is the one with the most recent timestamp? If so that is an unsafe operation. The copied DB cannot be guaranteed to be in a safe state.

However, if the Jenkins machinery won't allow another scheduled job to start until the previous one has finished that should not be an issue: we wouldn't want more than one update to occur simultaneously in any case. You can't safely copy the currently updating file, so the next update would just end up re-indexing all the same files the current update is struggling to do.

micaeljtoliveira · 2022-06-15T05:33:00Z

However, doesn't your point above also apply to this workflow? If you have a DB update in progress, and another one starts, won't it try to copy the currently open (and being written to) DB file as it is the one with the most recent timestamp? If so that is an unsafe operation. The copied DB cannot be guaranteed to be in a safe state.

No, this should not be an issue. The copy is done from the symlink and the symlink is only updated after the update as been successful. So one never copies a database that has only been partially updated. The other possible issue is if we start two updates the same day, as in that case both updates would try to create the same file, as I'm only using the day/month/year to time-stamp the files. In principle this should not happen, as the Jenkins job is scheduled to run once a day, but just in case, I've added a check for this and the job will abort if the file it's trying to create already exists.

However, if the Jenkins machinery won't allow another scheduled job to start until the previous one has finished that should not be an issue: we wouldn't want more than one update to occur simultaneously in any case. You can't safely copy the currently updating file, so the next update would just end up re-indexing all the same files the current update is struggling to do.

With the above scheme, it's fine if a Jenkins job starts before the previous one is finished, as long as it starts the next day. Of course, it might not be good to let a job running for too long, but we do have a timeout in place, as the PBS script sets the a walltime.

micaeljtoliveira · 2022-06-15T05:47:21Z

Just to summarize a bit.

The way I implemented this, the following should never occur:

An update is started from an unfinished update
The symlink points to an unfinished update

This should shield the users from any issues in the update. The problematic case that could still occur is if the indexing takes so long that the PBS job terminates before the update is finished. I think this case requires some form of human intervention, i.e., someone needs to see what is going on and decide an appropriate way of fixing the problem.

Does this sound reasonable?

aidanheerdegen · 2022-06-15T06:08:50Z

Thanks for clarifying.

Does this sound reasonable?

Yes. But I don't think there is any use-case where it would make sense to have one index job run before the previous one has finished, because the second indexing job will replicate the same indexing as the currently running one, because they will both be starting from the same DB state.

The problematic case that could still occur is if the indexing takes so long that the PBS job terminates before the update is finished. I think this case requires some form of human intervention, i.e., someone needs to see what is going on and decide an appropriate way of fixing the problem.

Once COSIMA/cosima-cookbook#292 is merged it will lessen the impact of a terminated job, because the indexing up to that point should be saved in the DB. So a subsequent indexing job will not be certain to fail, at least if the "failure" was caused by running over some time limit.

P.S. Assuming that a "failed" job (one that ran out of time) will update the DB. Might have to differentiate between failure modes: running out of time is fine, failing and leaving the DB corrupted, not so much.

micaeljtoliveira · 2022-06-15T06:18:25Z

Thanks for clarifying.

Does this sound reasonable?

Yes. But I don't think there is any use-case where it would make sense to have one index job run before the previous one has finished, because the second indexing job will replicate the same indexing as the currently running one, because they will both be starting from the same DB state.

Makes sense. I've just changed the pipeline configuration to forbid concurrent builds.

Once COSIMA/cosima-cookbook#292 is merged it will lessen the impact of a terminated job, because the indexing up to that point should be saved in the DB. So a subsequent indexing job will not be certain to fail, at least if the "failure" was caused by running over some time limit.

P.S. Assuming that a "failed" job (one that ran out of time) will update the DB. Might have to differentiate between failure modes: running out of time is fine, failing and leaving the DB corrupted, not so much.

In that case I need to find a way to determine why a job failed within Jenkins. Maybe the information could be recovered from the PBS logs. I need to check this.

…ate does not exist yet. If non concurrent Jenkins runs are allowed, then it should be safe to restart an indexing job from an unfinished one.

aidanheerdegen · 2022-06-15T07:21:45Z

The indexing is only done in a PBS job because memory and time blew out. Memory might not be such an issue with COSIMA/cosima-cookbook#292 but there are some time limits on interactive jobs, or their used to be.

I'll run a test overnight to update the copy of the DB I've been regenerating to see how long it will run. If it can be done interactively that might solve your issue. It can't timeout and you can check the returned error code.

aidanheerdegen · 2022-06-16T04:21:51Z

So a single process run overnight without a problem on a login node, so based on that it should be possible to run this interactively on a login node. The only thing to consider then is that you have a multi-node execution agent so this doesn't block any other tasks you have, as it might run for multiple days.

aidanheerdegen · 2022-06-16T04:33:28Z

I should say that one advantage of not using PBS jobs is the indexing job isn't then dependent on there being enough SUs remaining in the project to run. It is fairly common for projects to exhaust their quota a day or two before the end of a quarter, so the jobs fail until the quota is refreshed.

micaeljtoliveira · 2022-06-16T05:23:13Z

@aidanheerdegen I've been having some fun today with Jenkins and PBS and I'm starting to somehow dislike Jenkins :(

Anyway, looks like a PBS job that reaches the walltime does not return an error. On the other hand, if an error occurs during the script's execution, then it does return an error. So it's very easy to tell apart those two cases.

So we could keep using PBS jobs without changing the Jenkins pipeline logic. If a job reaches the walltime, we can still make the resulting database available to users, so that they can access any new files indexed overnight, and keep indexing the missing files the next time the job runs. In any case, I wouldn't run the update without a time out of some sort.

…ral steps to run at the end of the pipeline, depending on the result of the update. Change the name of the PBS logs, so that we can easily archive then with Jenkins.

aidanheerdegen · 2022-06-16T06:09:56Z

Sure. I just wanted to make it clear I only started using PBS jobs because resource (memory) limits. If those are solved then it might be simpler to revert to running on the login node. Totally up to you of course.

I have an updated script that I was using in my testing, I'll put it in a PR (#26) and you can use it or not as you like.

micaeljtoliveira · 2022-06-16T09:22:32Z

I just had a look at #26 and I like your changes. I'll give it a try with the Jenkins pipeline running the update interactively and see how it goes.

In the meantime I realized that I was mislead by Jenkins (bad Jenkins!) about the exit status of a PBS job that exceeds the walltime, so the script would have to a bit more complicated.

Use parallel to allow simultaneous indexing and shuffling inputs. Set file chunks to 1000.

…o pass the database to be updated through the command line.

…hout using PBS. We are still setting a time limit for the update, so that updates that run for longer than a reasonable amount of time don't go unnoticed. An update that fails to complete because of the time limit will mark the build as unstable.

- Do not allow to update the database twice during the same day. - Remove the new database file if an update fails. - Remove the new database file if the run is aborted by a user.

aidanheerdegen · 2022-06-22T01:00:54Z

At the risk of over-loading this PR, I wonder if there is some merit in taking into account #28

micaeljtoliveira · 2022-06-22T01:09:45Z

At the risk of over-loading this PR, I wonder if there is some merit in taking into account #28

I definitely see the merit of #28, but II would prefer to tackle it in a different PR. This one has already a reasonable amount of changes that would be nice to merge.

aidanheerdegen

Take whatever you like from those comments. Happy for you to merge regardless.

build_master_index

Jenkinsfile

build_master_index

Jenkinsfile

aidanheerdegen · 2022-07-15T04:19:50Z

Looks good! Just need @aekiss to approve and good to merge.

Jenkinsfile

aekiss · 2022-07-15T04:32:37Z

Jenkinsfile

+    // Update was successful, so we can now update the symlink to point to this new version of the database
+    sh "ln -sf ${DB} ${DB_LINK}"
+    // Prune old versions of the database that have not been accessed in the last 14 days.
+    sh 'find ${DB_PATH}/daily -type f -atime +14 -name "cosima_master_????-??-?.db" -exec rm -fv {} \\;'


should this also ensure that at least the last N>1 databases are maintained, in case the jenkins update is paused for a couple of weeks?

That's a good question. I agree that having a situation where only one copy of the database is left can be risky. Would N=7 be acceptable?

sounds good, that should be plenty

Jenkinsfile

aekiss · 2022-07-15T04:42:45Z

Jenkinsfile

+            successulDBUpdate()
+        }
+        unstable {
+            successulDBUpdate()


do we really want to prune old DBs and link to an unstable DB? I'd have thought we should keep trying to update it but not make it available to users until successfully updated without error

Here unstable simply means that the update timed-out, but without any other errors. The reason I'm treating it differently is so that people responsible for the update (I guess that's the two of us) are aware that the database is not fully up-to-date and can check what is going on in case it keeps taking too long to update. Otherwise I assumed that there was no problem in making a more up-to-date database available to users, even if it's not fully up-to-date. But as I wrote above, it's up to you to decide what is best to make available to users.

Yep, agreed as above

aekiss

6cfe5c2 won't guarantee that 7 files are retained - if there are >7 files but they're all >14d old, they'll all be deleted.

Also on 2nd thoughts it would be better to retain the last 14 (not 7) files, for consistency with the date-based criterion which should keep 14 files when updates are happening daily.

How about something like this? (NB: untested!)

    sh '''shopt -s failglob
          for f in $(ls -1t ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n +15); do
               find ${f} -type f -atime +14 -name "cosima_master_????-??-??.db" -exec rm -fv {} \\;
          fi'''

aidanheerdegen · 2022-07-20T02:22:54Z

If wanting to keep the last n files irrespective of date don't bother with the atime criteria, just make a list of all the matching DB files and delete the n+1, n+2, ... files. They may span a greater date range than 14 days, but no big deal.

If there is a need to keep any DB that has been accessed within 14 days in case someone is using it (which they shouldn't, this isn't supported), then loop over the list of files to delete and only do so if atime is > 14 days.

aekiss · 2022-07-20T02:39:42Z

We want to do the latter, ie keep every file accessed in the last 14 days, plus the most recent 14 files.

If there is a need to keep any DB that has been accessed within 14 days in case someone is using it (which they shouldn't, this isn't supported), then loop over the list of files to delete and only do so if atime is > 14 days.

Unless I've misunderstood I think my code implements what you suggest for this case.

aidanheerdegen · 2022-07-20T03:09:10Z

My apologies, yes it does. And it seems find is the best way to do it, so I'd suggest collapsing it, so something like this

find $(ls ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n 15) -atime +14 -delete

but maybe others find that less readable?

aekiss · 2022-07-20T03:32:47Z

Ah that's neater, and I think more readable. It would need to be this though (with ls -1t and tail -n +15):

find $(ls -1t ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n +15) -atime +14 -delete

micaeljtoliveira · 2022-07-20T03:40:53Z

Oops, that was a silly mistake. I like the collapsed form better, so I think it'll go for that.

micaeljtoliveira · 2022-07-20T03:47:15Z

find $(ls -1t ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n +15) -atime +14 -delete

Unfortunately this does not work if there are less than 14 database files, as the subshell will return an empty string and the find command will try to delete all the files that have not been accessed for less than 14 days.

micaeljtoliveira · 2022-07-20T04:07:18Z

Here is my take on this:

if [ $(find ${DB_PATH}/daily -type f -atime -14 -name "cosima_master_????-??-??.db" | wc -l) -gt 7 ]; then
    find ${DB_PATH}/daily -type f -atime +14 -name "cosima_master_????-??-??.db" -delete
fi

It has the advantage of avoiding the "for" loop from @aekiss original suggestion, which I find confusing, as it calls find on a single file.

aidanheerdegen · 2022-07-20T04:43:08Z

I love bash (I don't love bash)

…f the database.

aekiss · 2022-07-21T00:38:15Z

Looks good - I like that your version will stop pruning if the daily update has stalled.

Jenkinsfile

aekiss

looks good to me, thanks @micaeljtoliveira

micaeljtoliveira requested review from aidanheerdegen and aekiss June 15, 2022 01:41

micaeljtoliveira marked this pull request as draft June 15, 2022 01:47

micaeljtoliveira added 3 commits June 15, 2022 13:57

First version of Jenkinsfile.

f8cac74

Update Jenkins pipeline to submit job updating the database. The subm…

82172c0

…ission script now uses the environment variables set by pipeline to set the database name and path.

Make the updated database file read-only.

2dc9790

Remove the check that the file we are trying to create during the upd…

e898053

…ate does not exist yet. If non concurrent Jenkins runs are allowed, then it should be safe to restart an indexing job from an unfinished one.

Reorganize the different stages of the Jenkins pipeline and move seve…

eca3c41

…ral steps to run at the end of the pipeline, depending on the result of the update. Change the name of the PBS logs, so that we can easily archive then with Jenkins.

aidanheerdegen and others added 7 commits June 20, 2022 11:16

Refactor to separate bash and python scripts.

bafb5c7

Use parallel to allow simultaneous indexing and shuffling inputs. Set file chunks to 1000.

Do not run in parallel.

47f4ebf

Update update script to use an older version of the cookbook. Allow t…

838ebbe

…o pass the database to be updated through the command line.

Better handling of some situations (plus some comments):

34a7895

- Do not allow to update the database twice during the same day. - Remove the new database file if an update fails. - Remove the new database file if the run is aborted by a user.

Better document the Jenkinsfile.

bec388f

Simplify use of environment variables.

3722473

micaeljtoliveira changed the title ~~WIP: Implement offline DB update~~ Implement offline DB update Jul 15, 2022

micaeljtoliveira marked this pull request as ready for review July 15, 2022 00:01

aidanheerdegen previously approved these changes Jul 15, 2022

View reviewed changes

build_master_index Outdated Show resolved Hide resolved

Jenkinsfile Outdated Show resolved Hide resolved

Jenkinsfile Outdated Show resolved Hide resolved

Jenkinsfile Outdated Show resolved Hide resolved

build_master_index Show resolved Hide resolved

micaeljtoliveira dismissed aidanheerdegen’s stale review via ae401c2 July 15, 2022 03:24

aidanheerdegen reviewed Jul 15, 2022

View reviewed changes

Jenkinsfile Outdated Show resolved Hide resolved

micaeljtoliveira added the enhancement New feature or request label Jul 15, 2022

micaeljtoliveira added 5 commits July 19, 2022 09:30

Only set one output file for the PBS job.

abe7177

Make sure the group for the database file is explicitly set to ik11.

0fe7db5

Move some duplicated Jenkins code to functions.

699c060

Add Andrew to list of emails recipients.

5d011d6

Set DB path to be the production one.

f015ab9

aekiss reviewed Jul 19, 2022

View reviewed changes

micaeljtoliveira added 2 commits July 20, 2022 08:39

Remove group reading permissions from database file while updating.

092b606

Fix some typos.

1557726

aekiss reviewed Jul 20, 2022

View reviewed changes

Only prune the old database files if there are more than 7 versions o…

ba4c72e

…f the database.

aekiss reviewed Jul 21, 2022

View reviewed changes

Jenkinsfile Show resolved Hide resolved

aekiss approved these changes Jul 21, 2022

View reviewed changes

micaeljtoliveira merged commit e30acbb into COSIMA:master Jul 21, 2022

micaeljtoliveira deleted the new_pipeline branch July 26, 2022 05:08

Implement offline DB update #25

Implement offline DB update #25

Conversation

micaeljtoliveira commented Jun 15, 2022 • edited Loading

aidanheerdegen commented Jun 15, 2022

micaeljtoliveira commented Jun 15, 2022

aidanheerdegen commented Jun 15, 2022

micaeljtoliveira commented Jun 15, 2022 • edited Loading

micaeljtoliveira commented Jun 15, 2022

aidanheerdegen commented Jun 15, 2022 • edited Loading

micaeljtoliveira commented Jun 15, 2022

aidanheerdegen commented Jun 15, 2022

aidanheerdegen commented Jun 16, 2022

aidanheerdegen commented Jun 16, 2022

micaeljtoliveira commented Jun 16, 2022

aidanheerdegen commented Jun 16, 2022 • edited Loading

micaeljtoliveira commented Jun 16, 2022

aidanheerdegen commented Jun 22, 2022

micaeljtoliveira commented Jun 22, 2022

aidanheerdegen left a comment

Choose a reason for hiding this comment

aidanheerdegen commented Jul 15, 2022

aekiss Jul 15, 2022

Choose a reason for hiding this comment

micaeljtoliveira Jul 19, 2022

Choose a reason for hiding this comment

aekiss Jul 19, 2022

Choose a reason for hiding this comment

micaeljtoliveira Jul 19, 2022

Choose a reason for hiding this comment

aekiss Jul 15, 2022

Choose a reason for hiding this comment

micaeljtoliveira Jul 19, 2022

Choose a reason for hiding this comment

aekiss Jul 21, 2022

Choose a reason for hiding this comment

aekiss left a comment • edited Loading

Choose a reason for hiding this comment

aidanheerdegen commented Jul 20, 2022

aekiss commented Jul 20, 2022 • edited Loading

aidanheerdegen commented Jul 20, 2022

aekiss commented Jul 20, 2022

micaeljtoliveira commented Jul 20, 2022

micaeljtoliveira commented Jul 20, 2022

micaeljtoliveira commented Jul 20, 2022 • edited Loading

aidanheerdegen commented Jul 20, 2022

aekiss commented Jul 21, 2022

aekiss left a comment

Choose a reason for hiding this comment

micaeljtoliveira commented Jun 15, 2022 •

edited

Loading

micaeljtoliveira commented Jun 15, 2022 •

edited

Loading

aidanheerdegen commented Jun 15, 2022 •

edited

Loading

aidanheerdegen commented Jun 16, 2022 •

edited

Loading

aekiss left a comment •

edited

Loading

aekiss commented Jul 20, 2022 •

edited

Loading

micaeljtoliveira commented Jul 20, 2022 •

edited

Loading