Skip to content

Bugfix Deadlock when getting the lock #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 18, 2024

Conversation

mberacochea
Copy link
Member

This is to prevent a deadlock if the code can't get the lock file.

Example:
ERROR 2024/11/14 03:55:11 PM - lifetime has expired, breaking
ERROR 2024/11/14 03:55:11 PM - lockfile exists but isn't safe to break: /hps/nobackup/rdf/metagenomics/service-team/production/automated_jobs/ERP1432/ERP143252/download.lock

If this happens the process hangs there indefinitely. With the timeout the code will fail and exit with a Timeout error at least.

I think we need to review this lock files... we can probably remove them (or at least add unit tests to check the code is doing what we expect)

…-exists

The system is retrying even if the file already exists.
This is to prevent a deadlock if the code can't get the lock file.

Example:
ERROR 2024/11/14 03:55:11 PM - lifetime has expired, breaking
ERROR 2024/11/14 03:55:11 PM - lockfile exists but isn't safe to break: /hps/nobackup/rdf/metagenomics/service-team/production/automated_jobs/ERP1432/ERP143252/download.lock

If this happens the process hangs there indefinitely. With the timeout the code will fail and exit with a Timeout error at least
@mberacochea mberacochea self-assigned this Nov 14, 2024
Copy link
Member

@SandyRogers SandyRogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mberacochea
Strange that this is happening on HPS as well as NFS. We could also try using portalocker instead (in theory it works on NFS, although in practice for EMG API concurrent log handler locking, we tell it to use a non-NFS directory for the lock files). Anyway, this looks good to make things fail explicitly.

@mberacochea
Copy link
Member Author

Thanks @mberacochea Strange that this is happening on HPS as well as NFS. We could also try using portalocker instead (in theory it works on NFS, although in practice for EMG API concurrent log handler locking, we tell it to use a non-NFS directory for the lock files). Anyway, this looks good to make things fail explicitly.

Yeah, it is weird. The library that we are using is supposed to handle NFS like filesystems. We will have to investigate at some point.

@mberacochea mberacochea merged commit a7981fa into develop Nov 18, 2024
6 checks passed
@mberacochea mberacochea deleted the bugfix/deadlock-when-getting-the-lock branch November 18, 2024 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants