-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add timed check for expired files #358
add timed check for expired files #358
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start, but I don't think it handles how chunks are downloaded into directories, and not just all dumped into files in the root of the DOWNLOAD_DATA_PATH
.
We also don't have anything that handles the case where something is expiring at the same time as it is being used. The chunk download code assumes that if it sees the chunk files downloaded at the beginning of the request, they will still be there when it goes to open them and will not have been removed in the intervening time. If we have a delete job that can run in the background, that won't necessarily be true anymore.
We might need a way to mark a chunk as being used by a request (maybe with touch
and an assumption that no request can last longer than the expiration time?) so it isn't deleted during a request that uses it, or we need to make accesses to chunk data handle the case where the file goes away and go get it again (which is harder).
console.log("cron scheduled check"); | ||
const currentTime = new Date().getTime(); | ||
// loop through these specified directories | ||
for (const dir of [DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe warn people near where these are configured that the tube map server needs to use them exclusively, and will delete anything else in there?
src/server.mjs
Outdated
const currentTime = new Date().getTime(); | ||
// loop through these specified directories | ||
for (const dir of [DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH]) { | ||
fs.readdir(dir, (err, files) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only handles one flat directory, right? Don't we download files to different subdirectories?
We have an mkdirSync()
here:
sequenceTubeMap/src/server.mjs
Lines 1285 to 1288 in 0280401
// TODO: check if this chunk has been downloaded before, to prevent duplicate fetches | |
if (!fs.existsSync(chunkDir)) { | |
fs.mkdirSync(chunkDir, { recursive: true }); | |
} |
But this cleanup code has nothing to loop through the subdirectories recursively, and no rmdir
to remove the emptied directories.
src/server.mjs
Outdated
// get file statistics | ||
fs.stat(filePath, (statErr, stats) => { | ||
const creationTime = stats.birthtime.getTime(); | ||
if (currentTime - creationTime >= config.fileExpirationTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we properly support subdirectories, a parent directory might be expired before all its contents are, because the parent directory was created before the files in it were created. But it's not possible to delete the parent directory while there are things inside it. So we'd need to handle that somehow.
We might want to handle expiration at the chunk directory/individual file upload level, and remove whole chunk directories as a unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a missing await
in the api.post("/getChunkedData")
lockDirectories
callback that is going to make the lock be released before the request has been fully handled.
Other than that, it looks like this should work. I've pointed out a couple places where we don't explicitly state guarantees that we later rely on to avoid deadlocks or to keep critical sections protected by the locks. It might make sense to document those sections better.
|
||
if (fs.statSync(filePath).isFile()) { | ||
// check to see if file needs to be deleted | ||
const lastAccessedTime = fs.statSync(filePath).atime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can safely rely on the stored file access times maintained by the filesystem, because it is very common (or even the default behavior) for the file access times to be turned off or tracked very approximately, to improve performance.
See for example the section on the relatime
mount option in the mount
man page. By default, our demo server has the relatime
option set on the filesystem where we run the tube map. Which means that the access time on files gets updated if it is older than the creation or modification times, but otherwise is left alone when the file is accessed.
But if we protect this all with the right locks, using the access times shouldn't be any worse than using the creation times; we'll just expire based on upload time and not based on usage.
// attempt to acquire a lock for the next directory, and call lockDirectories on the remaining directories | ||
const currDirectory = directoryPaths.pop(); | ||
return lockDirectory(currDirectory, lockType, async function() { | ||
lockDirectories(directoryPaths, lockType, func); | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless everyone is guaranteed to acquire locks in the same order, this pattern of "acquire each lock and do the thing when all are held" is a classic way to get deadlocks.
If lockDirectories(["dir1", "dir2"])
and lockDirectories(["dir2", "dir1"])
are running at the same time, you can have the first one lock dir1, the second one lock dir2, and then each one wait for the other to release its locks, with no progress being made, forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we do actually follow the required rule here: if locking both DOWNLOAD_DATA_PATH
and UPLOAD_DATA_PATH
at the same time, we always lock DOWNLOAD_DATA_PATH
first. But we should probably officially make that a rule we know we have to follow (or maybe just sort the inputs in this function and demand that you always use this function when locking multiple directories at once?).
// attempt to acquire a write lock for each on the directory before attemping to delete files | ||
for (const dir of [DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH]) { | ||
lockDirectory(dir, lockTypes.WRITE_LOCK, async function() { | ||
deleteExpiredFiles(dir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works without an await
because we know deleteExpiredFiles
is totally synchronous and won't return until the files are gone. Maybe we should note that it makes that guarantee in its documentation comment? Otherwise it might seem safe to make it asynchronous later, when it isn't.
promise.catch(next); | ||
lockDirectories([DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH], lockTypes.READ_LOCK, async function() { | ||
let promise = getChunkedData(req, res, next); | ||
promise.catch(next); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no await
or return
happening with promise
, so this async
function will have its result promise resolve (and will release the lock) before promise
is finished.
Either we need to await
the promise that getChunkedData
returns, or we need to return it (or a promise derived from it) to avoid releasing the lock before it has actually finished.
package.json
Outdated
@@ -29,8 +29,10 @@ | |||
"express": "^4.18.2", | |||
"fs-extra": "^10.1.0", | |||
"gh-pages": "^4.0.0", | |||
"lockfile": "^1.0.4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need to add this dependency.
#352