-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement offline DB update #25
Implement offline DB update #25
Conversation
Awesome! I would recommend creating the databases in a dedicated sub-directory with a symbolic link to the current DB at
In which case it would make sense to clone this repo at that location, e.g. Andrew's suggestions are good IMO. So something like this:
Having a current DB ( This whole process could be run as-is without the symlink step. Set it running and let it update a few times, then add in the symlink step and it is done. I plan to make Did you also want to include keeping any snapshots at all? I'm not fussed either way, but we did discuss it in person, so thought it should be documented as being on the table and explicitly not added at this stage if that was the decision. |
Yes, that's a good idea. I'll change the script accordingly.
The way how this is setup, there's no need to clone this repo. Jenkins will take care of doing that and the update script will be submitted from the Jenkins workspace.
I think we should avoid always doing the update on the same file. If for some reason the update takes too long and a new update is started before the old one finished, we might get some weird stuff happening. Instead, we can achieve the same level of protection of the live database by keeping the workflow that I implemented, but changing the permissions of the time stamped files. Assuming all the database files are read-only, this would look like this:
Currently steps 2 and 4 are missing, but are trivial to add.
Sounds good.
The way I implemented it, the step that updates the database does not delete any files, so all time stamped files are kept. Pruning of old versions is done at the last stage of the pipeline. Currently this deletes the files older than 2 weeks, as Andrew suggested. |
…ission script now uses the environment variables set by pipeline to set the database name and path.
Good point. Scratch that.
Agreed. The latest DB file can be open for read-only access, but this shouldn't preclude a copy being made safely.
However, doesn't your point above also apply to this workflow? If you have a DB update in progress, and another one starts, won't it try to copy the currently open (and being written to) DB file as it is the one with the most recent timestamp? If so that is an unsafe operation. The copied DB cannot be guaranteed to be in a safe state. However, if the Jenkins machinery won't allow another scheduled job to start until the previous one has finished that should not be an issue: we wouldn't want more than one update to occur simultaneously in any case. You can't safely copy the currently updating file, so the next update would just end up re-indexing all the same files the current update is struggling to do. |
No, this should not be an issue. The copy is done from the symlink and the symlink is only updated after the update as been successful. So one never copies a database that has only been partially updated. The other possible issue is if we start two updates the same day, as in that case both updates would try to create the same file, as I'm only using the day/month/year to time-stamp the files. In principle this should not happen, as the Jenkins job is scheduled to run once a day, but just in case, I've added a check for this and the job will abort if the file it's trying to create already exists.
With the above scheme, it's fine if a Jenkins job starts before the previous one is finished, as long as it starts the next day. Of course, it might not be good to let a job running for too long, but we do have a timeout in place, as the PBS script sets the a walltime. |
Just to summarize a bit. The way I implemented this, the following should never occur:
This should shield the users from any issues in the update. The problematic case that could still occur is if the indexing takes so long that the PBS job terminates before the update is finished. I think this case requires some form of human intervention, i.e., someone needs to see what is going on and decide an appropriate way of fixing the problem. Does this sound reasonable? |
Thanks for clarifying.
Yes. But I don't think there is any use-case where it would make sense to have one index job run before the previous one has finished, because the second indexing job will replicate the same indexing as the currently running one, because they will both be starting from the same DB state.
Once COSIMA/cosima-cookbook#292 is merged it will lessen the impact of a terminated job, because the indexing up to that point should be saved in the DB. So a subsequent indexing job will not be certain to fail, at least if the "failure" was caused by running over some time limit. P.S. Assuming that a "failed" job (one that ran out of time) will update the DB. Might have to differentiate between failure modes: running out of time is fine, failing and leaving the DB corrupted, not so much. |
Makes sense. I've just changed the pipeline configuration to forbid concurrent builds.
In that case I need to find a way to determine why a job failed within Jenkins. Maybe the information could be recovered from the PBS logs. I need to check this. |
…ate does not exist yet. If non concurrent Jenkins runs are allowed, then it should be safe to restart an indexing job from an unfinished one.
The indexing is only done in a PBS job because memory and time blew out. Memory might not be such an issue with COSIMA/cosima-cookbook#292 but there are some time limits on interactive jobs, or their used to be. I'll run a test overnight to update the copy of the DB I've been regenerating to see how long it will run. If it can be done interactively that might solve your issue. It can't timeout and you can check the returned error code. |
So a single process run overnight without a problem on a login node, so based on that it should be possible to run this interactively on a login node. The only thing to consider then is that you have a multi-node execution agent so this doesn't block any other tasks you have, as it might run for multiple days. |
I should say that one advantage of not using PBS jobs is the indexing job isn't then dependent on there being enough SUs remaining in the project to run. It is fairly common for projects to exhaust their quota a day or two before the end of a quarter, so the jobs fail until the quota is refreshed. |
@aidanheerdegen I've been having some fun today with Jenkins and PBS and I'm starting to somehow dislike Jenkins :( Anyway, looks like a PBS job that reaches the walltime does not return an error. On the other hand, if an error occurs during the script's execution, then it does return an error. So it's very easy to tell apart those two cases. So we could keep using PBS jobs without changing the Jenkins pipeline logic. If a job reaches the walltime, we can still make the resulting database available to users, so that they can access any new files indexed overnight, and keep indexing the missing files the next time the job runs. In any case, I wouldn't run the update without a time out of some sort. |
…ral steps to run at the end of the pipeline, depending on the result of the update. Change the name of the PBS logs, so that we can easily archive then with Jenkins.
Sure. I just wanted to make it clear I only started using PBS jobs because resource (memory) limits. If those are solved then it might be simpler to revert to running on the login node. Totally up to you of course. I have an updated script that I was using in my testing, I'll put it in a PR (#26) and you can use it or not as you like. |
I just had a look at #26 and I like your changes. I'll give it a try with the Jenkins pipeline running the update interactively and see how it goes. In the meantime I realized that I was mislead by Jenkins (bad Jenkins!) about the exit status of a PBS job that exceeds the walltime, so the script would have to a bit more complicated. |
Use parallel to allow simultaneous indexing and shuffling inputs. Set file chunks to 1000.
…o pass the database to be updated through the command line.
…hout using PBS. We are still setting a time limit for the update, so that updates that run for longer than a reasonable amount of time don't go unnoticed. An update that fails to complete because of the time limit will mark the build as unstable.
- Do not allow to update the database twice during the same day. - Remove the new database file if an update fails. - Remove the new database file if the run is aborted by a user.
At the risk of over-loading this PR, I wonder if there is some merit in taking into account #28 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take whatever you like from those comments. Happy for you to merge regardless.
Looks good! Just need @aekiss to approve and good to merge. |
Jenkinsfile
Outdated
// Update was successful, so we can now update the symlink to point to this new version of the database | ||
sh "ln -sf ${DB} ${DB_LINK}" | ||
// Prune old versions of the database that have not been accessed in the last 14 days. | ||
sh 'find ${DB_PATH}/daily -type f -atime +14 -name "cosima_master_????-??-?.db" -exec rm -fv {} \\;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also ensure that at least the last N>1 databases are maintained, in case the jenkins update is paused for a couple of weeks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I agree that having a situation where only one copy of the database is left can be risky. Would N=7 be acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, that should be plenty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Jenkinsfile
Outdated
successulDBUpdate() | ||
} | ||
unstable { | ||
successulDBUpdate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to prune old DBs and link to an unstable DB? I'd have thought we should keep trying to update it but not make it available to users until successfully updated without error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here unstable simply means that the update timed-out, but without any other errors. The reason I'm treating it differently is so that people responsible for the update (I guess that's the two of us) are aware that the database is not fully up-to-date and can check what is going on in case it keeps taking too long to update. Otherwise I assumed that there was no problem in making a more up-to-date database available to users, even if it's not fully up-to-date. But as I wrote above, it's up to you to decide what is best to make available to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, agreed as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6cfe5c2 won't guarantee that 7 files are retained - if there are >7 files but they're all >14d old, they'll all be deleted.
Also on 2nd thoughts it would be better to retain the last 14 (not 7) files, for consistency with the date-based criterion which should keep 14 files when updates are happening daily.
How about something like this? (NB: untested!)
sh '''shopt -s failglob
for f in $(ls -1t ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n +15); do
find ${f} -type f -atime +14 -name "cosima_master_????-??-??.db" -exec rm -fv {} \\;
fi'''
If wanting to keep the last If there is a need to keep any DB that has been accessed within 14 days in case someone is using it (which they shouldn't, this isn't supported), then loop over the list of files to delete and only do so if |
We want to do the latter, ie keep every file accessed in the last 14 days, plus the most recent 14 files.
Unless I've misunderstood I think my code implements what you suggest for this case. |
My apologies, yes it does. And it seems find $(ls ${DB_PATH}/daily/cosima_master_????-??-??.db | tail -n 15) -atime +14 -delete but maybe others find that less readable? |
Ah that's neater, and I think more readable. It would need to be this though (with
|
Oops, that was a silly mistake. I like the collapsed form better, so I think it'll go for that. |
Unfortunately this does not work if there are less than 14 database files, as the subshell will return an empty string and the find command will try to delete all the files that have not been accessed for less than 14 days. |
Here is my take on this:
It has the advantage of avoiding the "for" loop from @aekiss original suggestion, which I find confusing, as it calls |
I love bash (I don't love bash) |
Looks good - I like that your version will stop pruning if the daily update has stalled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thanks @micaeljtoliveira
Closes #24
This implements the offline update of the DB roughly as discussed in #24. The different steps are implemented in a Jenkins pipeline. The Jenkins configuration file is now part of this repository.
The update procedure goes like this:
Some design decisions worth mentioning:
daily
(as the update is done daily).