Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement offline DB update #25

Merged
merged 24 commits into from
Jul 21, 2022
Merged

Implement offline DB update #25

merged 24 commits into from
Jul 21, 2022

Conversation

micaeljtoliveira
Copy link
Contributor

@micaeljtoliveira micaeljtoliveira commented Jun 15, 2022

Closes #24

This implements the offline update of the DB roughly as discussed in #24. The different steps are implemented in a Jenkins pipeline. The Jenkins configuration file is now part of this repository.

The update procedure goes like this:

  1. Create a time-stamped copy of the database from the database symlink, which should point to the latest and most up-to-date version.
  2. Try to update the database copy.
  3. If no errors occurred during the update, change the symlink to point to the new file. Otherwise delete the failed update attempt.

Some design decisions worth mentioning:

  • Only one update can be done per day.
  • The time-stamped versions of the database are stored in a sub-directory named daily (as the update is done daily).
  • Database files older than 2 weeks are pruned (only if database update was successful, as to be sure that there's always at least one copy left).

@micaeljtoliveira micaeljtoliveira marked this pull request as draft June 15, 2022 01:47
@aidanheerdegen
Copy link
Contributor

Awesome!

I would recommend creating the databases in a dedicated sub-directory with a symbolic link to the current DB at /g/data/ik11/databases/cosima_master.db (as discussed).

/g/data/ik11/databases/ is a bit messy as others have also made one-off DBs and put them there. So I think it makes it more obvious which is the correct one to use if there is a single (link) named cosima_master.db.

In which case it would make sense to clone this repo at that location, e.g. DB_PATH=/g/data/ik11/databases/master_index.

Andrew's suggestions are good IMO. So something like this:

  1. Have a (private) DB file (e.g. PRIVATE_DB=$DB_PATH/current.db) without read-permissions (guaranteed noone accidentally accessing it). If it doesn't exist, create it and index from scratch.
  2. Change permissions to $PRIVATE_DB to allow only user rw
  3. Update $PRIVATE_DB (index new files)
  4. Copy $PRIVATE_DB to time-stamped copy
  5. Remove rw permissions from $PRIVATE_DB
  6. Update link (/g/data/ik11/databases/cosima_master.db) to point to latest time-stamped copy

Having a current DB ($PRIVATE_DB) that is never accessed saves having to copy a DB which might be currently being accessed. Changing permissions on $PRIVATE_DB is a "belt and braces" approach, but I like it. Also has the benefit of not allowing the owner of $PRIVATE_DB to accidentally overwrite/delete it.

This whole process could be run as-is without the symlink step. Set it running and let it update a few times, then add in the symlink step and it is done.

I plan to make /g/data/ik11/databases/cosima_master.db a symlink before this goes live so the change-over can be seamless.

Did you also want to include keeping any snapshots at all? I'm not fussed either way, but we did discuss it in person, so thought it should be documented as being on the table and explicitly not added at this stage if that was the decision.

@micaeljtoliveira
Copy link
Contributor Author

I would recommend creating the databases in a dedicated sub-directory with a symbolic link to the current DB at /g/data/ik11/databases/cosima_master.db (as discussed).

/g/data/ik11/databases/ is a bit messy as others have also made one-off DBs and put them there. So I think it makes it more obvious which is the correct one to use if there is a single (link) named cosima_master.db.

Yes, that's a good idea. I'll change the script accordingly.

In which case it would make sense to clone this repo at that location, e.g. DB_PATH=/g/data/ik11/databases/master_index.

The way how this is setup, there's no need to clone this repo. Jenkins will take care of doing that and the update script will be submitted from the Jenkins workspace.

Andrew's suggestions are good IMO. So something like this:

  1. Have a (private) DB file (e.g. PRIVATE_DB=$DB_PATH/current.db) without read-permissions (guaranteed noone accidentally accessing it). If it doesn't exist, create it and index from scratch.
  2. Change permissions to $PRIVATE_DB to allow only user rw
  3. Update $PRIVATE_DB (index new files)
  4. Copy $PRIVATE_DB to time-stamped copy
  5. Remove rw permissions from $PRIVATE_DB
  6. Update link (/g/data/ik11/databases/cosima_master.db) to point to latest time-stamped copy

Having a current DB ($PRIVATE_DB) that is never accessed saves having to copy a DB which might be currently being accessed. Changing permissions on $PRIVATE_DB is a "belt and braces" approach, but I like it. Also has the benefit of not allowing the owner of $PRIVATE_DB to accidentally overwrite/delete it.

I think we should avoid always doing the update on the same file. If for some reason the update takes too long and a new update is started before the old one finished, we might get some weird stuff happening. Instead, we can achieve the same level of protection of the live database by keeping the workflow that I implemented, but changing the permissions of the time stamped files. Assuming all the database files are read-only, this would look like this:

  1. Copy the latest version of the database (the one pointed at by the link) to new time-stamped file
  2. Change permissions of new file to rw
  3. Update the database
  4. Change permissions of new file back to read-only
  5. Update symlink

Currently steps 2 and 4 are missing, but are trivial to add.

I plan to make /g/data/ik11/databases/cosima_master.db a symlink before this goes live so the change-over can be seamless.

Sounds good.

Did you also want to include keeping any snapshots at all? I'm not fussed either way, but we did discuss it in person, so thought it should be documented as being on the table and explicitly not added at this stage if that was the decision.

The way I implemented it, the step that updates the database does not delete any files, so all time stamped files are kept. Pruning of old versions is done at the last stage of the pipeline. Currently this deletes the files older than 2 weeks, as Andrew suggested.

@aidanheerdegen
Copy link
Contributor

The way how this is setup, there's no need to clone this repo. Jenkins will take care of doing that and the update script will be submitted from the Jenkins workspace.

Good point. Scratch that.

I think we should avoid always doing the update on the same file. If for some reason the update takes too long and a new update is started before the old one finished, we might get some weird stuff happening. Instead, we can achieve the same level of protection of the live database by keeping the workflow that I implemented, but changing the permissions of the time stamped files.

Agreed. The latest DB file can be open for read-only access, but this shouldn't preclude a copy being made safely.

Assuming all the database files are read-only, this would look like this:

Copy the latest version of the database (the one pointed at by the link) to new time-stamped file
Change permissions of new file to rw
Update the database
Change permissions of new file back to read-only
Update symlink

However, doesn't your point above also apply to this workflow? If you have a DB update in progress, and another one starts, won't it try to copy the currently open (and being written to) DB file as it is the one with the most recent timestamp? If so that is an unsafe operation. The copied DB cannot be guaranteed to be in a safe state.

However, if the Jenkins machinery won't allow another scheduled job to start until the previous one has finished that should not be an issue: we wouldn't want more than one update to occur simultaneously in any case. You can't safely copy the currently updating file, so the next update would just end up re-indexing all the same files the current update is struggling to do.

@micaeljtoliveira
Copy link
Contributor Author

micaeljtoliveira commented Jun 15, 2022

However, doesn't your point above also apply to this workflow? If you have a DB update in progress, and another one starts, won't it try to copy the currently open (and being written to) DB file as it is the one with the most recent timestamp? If so that is an unsafe operation. The copied DB cannot be guaranteed to be in a safe state.

No, this should not be an issue. The copy is done from the symlink and the symlink is only updated after the update as been successful. So one never copies a database that has only been partially updated. The other possible issue is if we start two updates the same day, as in that case both updates would try to create the same file, as I'm only using the day/month/year to time-stamp the files. In principle this should not happen, as the Jenkins job is scheduled to run once a day, but just in case, I've added a check for this and the job will abort if the file it's trying to create already exists.

However, if the Jenkins machinery won't allow another scheduled job to start until the previous one has finished that should not be an issue: we wouldn't want more than one update to occur simultaneously in any case. You can't safely copy the currently updating file, so the next update would just end up re-indexing all the same files the current update is struggling to do.

With the above scheme, it's fine if a Jenkins job starts before the previous one is finished, as long as it starts the next day. Of course, it might not be good to let a job running for too long, but we do have a timeout in place, as the PBS script sets the a walltime.

@micaeljtoliveira
Copy link
Contributor Author

Just to summarize a bit.

The way I implemented this, the following should never occur:

  1. An update is started from an unfinished update
  2. The symlink points to an unfinished update

This should shield the users from any issues in the update. The problematic case that could still occur is if the indexing takes so long that the PBS job terminates before the update is finished. I think this case requires some form of human intervention, i.e., someone needs to see what is going on and decide an appropriate way of fixing the problem.

Does this sound reasonable?

@aidanheerdegen
Copy link
Contributor

aidanheerdegen commented Jun 15, 2022

Thanks for clarifying.

Does this sound reasonable?

Yes. But I don't think there is any use-case where it would make sense to have one index job run before the previous one has finished, because the second indexing job will replicate the same indexing as the currently running one, because they will both be starting from the same DB state.

The problematic case that could still occur is if the indexing takes so long that the PBS job terminates before the update is finished. I think this case requires some form of human intervention, i.e., someone needs to see what is going on and decide an appropriate way of fixing the problem.

Once COSIMA/cosima-cookbook#292 is merged it will lessen the impact of a terminated job, because the indexing up to that point should be saved in the DB. So a subsequent indexing job will not be certain to fail, at least if the "failure" was caused by running over some time limit.

P.S. Assuming that a "failed" job (one that ran out of time) will update the DB. Might have to differentiate between failure modes: running out of time is fine, failing and leaving the DB corrupted, not so much.

@micaeljtoliveira
Copy link
Contributor Author

Thanks for clarifying.

Does this sound reasonable?

Yes. But I don't think there is any use-case where it would make sense to have one index job run before the previous one has finished, because the second indexing job will replicate the same indexing as the currently running one, because they will both be starting from the same DB state.

Makes sense. I've just changed the pipeline configuration to forbid concurrent builds.

Once COSIMA/cosima-cookbook#292 is merged it will lessen the impact of a terminated job, because the indexing up to that point should be saved in the DB. So a subsequent indexing job will not be certain to fail, at least if the "failure" was caused by running over some time limit.

P.S. Assuming that a "failed" job (one that ran out of time) will update the DB. Might have to differentiate between failure modes: running out of time is fine, failing and leaving the DB corrupted, not so much.

In that case I need to find a way to determine why a job failed within Jenkins. Maybe the information could be recovered from the PBS logs. I need to check this.

…ate does not exist yet. If non concurrent Jenkins runs are allowed, then it should be safe to restart an indexing job from an unfinished one.
@aidanheerdegen
Copy link
Contributor

The indexing is only done in a PBS job because memory and time blew out. Memory might not be such an issue with COSIMA/cosima-cookbook#292 but there are some time limits on interactive jobs, or their used to be.

I'll run a test overnight to update the copy of the DB I've been regenerating to see how long it will run. If it can be done interactively that might solve your issue. It can't timeout and you can check the returned error code.

@aidanheerdegen
Copy link
Contributor

So a single process run overnight without a problem on a login node, so based on that it should be possible to run this interactively on a login node. The only thing to consider then is that you have a multi-node execution agent so this doesn't block any other tasks you have, as it might run for multiple days.

@aidanheerdegen
Copy link
Contributor

I should say that one advantage of not using PBS jobs is the indexing job isn't then dependent on there being enough SUs remaining in the project to run. It is fairly common for projects to exhaust their quota a day or two before the end of a quarter, so the jobs fail until the quota is refreshed.

@micaeljtoliveira
Copy link
Contributor Author

@aidanheerdegen I've been having some fun today with Jenkins and PBS and I'm starting to somehow dislike Jenkins :(

Anyway, looks like a PBS job that reaches the walltime does not return an error. On the other hand, if an error occurs during the script's execution, then it does return an error. So it's very easy to tell apart those two cases.

So we could keep using PBS jobs without changing the Jenkins pipeline logic. If a job reaches the walltime, we can still make the resulting database available to users, so that they can access any new files indexed overnight, and keep indexing the missing files the next time the job runs. In any case, I wouldn't run the update without a time out of some sort.

…ral steps to run at the end of the pipeline, depending on the result of the update.

Change the name of the PBS logs, so that we can easily archive then with Jenkins.
@aidanheerdegen
Copy link
Contributor

aidanheerdegen commented Jun 16, 2022

Sure. I just wanted to make it clear I only started using PBS jobs because resource (memory) limits. If those are solved then it might be simpler to revert to running on the login node. Totally up to you of course.

I have an updated script that I was using in my testing, I'll put it in a PR (#26) and you can use it or not as you like.

@micaeljtoliveira
Copy link
Contributor Author

I just had a look at #26 and I like your changes. I'll give it a try with the Jenkins pipeline running the update interactively and see how it goes.

In the meantime I realized that I was mislead by Jenkins (bad Jenkins!) about the exit status of a PBS job that exceeds the walltime, so the script would have to a bit more complicated.

aidanheerdegen and others added 7 commits June 20, 2022 11:16
Use parallel to allow simultaneous indexing and shuffling inputs.

Set file chunks to 1000.
…o pass the database to be updated through the command line.
…hout using PBS. We are still setting a time limit for the update, so that updates that run for longer than a reasonable amount of time don't go unnoticed. An update that fails to complete because of the time limit will mark the build as unstable.
- Do not allow to update the database twice during the same day.
- Remove the new database file if an update fails.
- Remove the new database file if the run is aborted by a user.
@aidanheerdegen
Copy link
Contributor

At the risk of over-loading this PR, I wonder if there is some merit in taking into account #28

@micaeljtoliveira
Copy link
Contributor Author

At the risk of over-loading this PR, I wonder if there is some merit in taking into account #28

I definitely see the merit of #28, but II would prefer to tackle it in a different PR. This one has already a reasonable amount of changes that would be nice to merge.

@micaeljtoliveira micaeljtoliveira changed the title WIP: Implement offline DB update Implement offline DB update Jul 15, 2022
@micaeljtoliveira micaeljtoliveira marked this pull request as ready for review July 15, 2022 00:01
aidanheerdegen
aidanheerdegen previously approved these changes Jul 15, 2022
Copy link
Contributor

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take whatever you like from those comments. Happy for you to merge regardless.

build_master_index Outdated Show resolved Hide resolved
Jenkinsfile Outdated Show resolved Hide resolved
Jenkinsfile Outdated Show resolved Hide resolved
Jenkinsfile Outdated Show resolved Hide resolved
build_master_index Show resolved Hide resolved
Jenkinsfile Outdated Show resolved Hide resolved
@aidanheerdegen
Copy link
Contributor

Looks good! Just need @aekiss to approve and good to merge.

@micaeljtoliveira micaeljtoliveira added the enhancement New feature or request label Jul 15, 2022
Jenkinsfile Outdated Show resolved Hide resolved
Jenkinsfile Outdated
// Update was successful, so we can now update the symlink to point to this new version of the database
sh "ln -sf ${DB} ${DB_LINK}"
// Prune old versions of the database that have not been accessed in the last 14 days.
sh 'find ${DB_PATH}/daily -type f -atime +14 -name "cosima_master_????-??-?.db" -exec rm -fv {} \\;'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also ensure that at least the last N>1 databases are maintained, in case the jenkins update is paused for a couple of weeks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I agree that having a situation where only one copy of the database is left can be risky. Would N=7 be acceptable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, that should be plenty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Jenkinsfile Outdated Show resolved Hide resolved
Jenkinsfile Outdated Show resolved Hide resolved
Jenkinsfile Show resolved Hide resolved
Jenkinsfile Outdated
successulDBUpdate()
}
unstable {
successulDBUpdate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to prune old DBs and link to an unstable DB? I'd have thought we should keep trying to update it but not make it available to users until successfully updated without error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here unstable simply means that the update timed-out, but without any other errors. The reason I'm treating it differently is so that people responsible for the update (I guess that's the two of us) are aware that the database is not fully up-to-date and can check what is going on in case it keeps taking too long to update. Otherwise I assumed that there was no problem in making a more up-to-date database available to users, even if it's not fully up-to-date. But as I wrote above, it's up to you to decide what is best to make available to users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agreed as above

Copy link
Contributor

@aekiss aekiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6cfe5c2 won't guarantee that 7 files are retained - if there are >7 files but they're all >14d old, they'll all be deleted.

Also on 2nd thoughts it would be better to retain the last 14 (not 7) files, for consistency with the date-based criterion which should keep 14 files when updates are happening daily.

How about something like this? (NB: untested!)

    sh '''shopt -s failglob
          for f in $(ls -1t ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n +15); do
               find ${f} -type f -atime +14 -name "cosima_master_????-??-??.db" -exec rm -fv {} \\;
          fi'''

@aidanheerdegen
Copy link
Contributor

If wanting to keep the last n files irrespective of date don't bother with the atime criteria, just make a list of all the matching DB files and delete the n+1, n+2, ... files. They may span a greater date range than 14 days, but no big deal.

If there is a need to keep any DB that has been accessed within 14 days in case someone is using it (which they shouldn't, this isn't supported), then loop over the list of files to delete and only do so if atime is > 14 days.

@aekiss
Copy link
Contributor

aekiss commented Jul 20, 2022

We want to do the latter, ie keep every file accessed in the last 14 days, plus the most recent 14 files.

If there is a need to keep any DB that has been accessed within 14 days in case someone is using it (which they shouldn't, this isn't supported), then loop over the list of files to delete and only do so if atime is > 14 days.

Unless I've misunderstood I think my code implements what you suggest for this case.

@aidanheerdegen
Copy link
Contributor

My apologies, yes it does. And it seems find is the best way to do it, so I'd suggest collapsing it, so something like this

find $(ls ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n 15) -atime +14 -delete

but maybe others find that less readable?

@aekiss
Copy link
Contributor

aekiss commented Jul 20, 2022

Ah that's neater, and I think more readable. It would need to be this though (with ls -1t and tail -n +15):

find $(ls -1t ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n +15) -atime +14 -delete

@micaeljtoliveira
Copy link
Contributor Author

Oops, that was a silly mistake. I like the collapsed form better, so I think it'll go for that.

@micaeljtoliveira
Copy link
Contributor Author

find $(ls -1t ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n +15) -atime +14 -delete

Unfortunately this does not work if there are less than 14 database files, as the subshell will return an empty string and the find command will try to delete all the files that have not been accessed for less than 14 days.

@micaeljtoliveira
Copy link
Contributor Author

micaeljtoliveira commented Jul 20, 2022

Here is my take on this:

if [ $(find ${DB_PATH}/daily -type f -atime -14 -name "cosima_master_????-??-??.db" | wc -l) -gt 7 ]; then
    find ${DB_PATH}/daily -type f -atime +14 -name "cosima_master_????-??-??.db" -delete
fi

It has the advantage of avoiding the "for" loop from @aekiss original suggestion, which I find confusing, as it calls find on a single file.

@aidanheerdegen
Copy link
Contributor

I love bash (I don't love bash)

@aekiss
Copy link
Contributor

aekiss commented Jul 21, 2022

Looks good - I like that your version will stop pruning if the daily update has stalled.

Jenkinsfile Show resolved Hide resolved
Copy link
Contributor

@aekiss aekiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thanks @micaeljtoliveira

@micaeljtoliveira micaeljtoliveira merged commit e30acbb into COSIMA:master Jul 21, 2022
@micaeljtoliveira micaeljtoliveira deleted the new_pipeline branch July 26, 2022 05:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Offline DB updates
3 participants