Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add timed check for expired files #358

Merged
merged 7 commits into from
Nov 13, 2023

Conversation

ducku
Copy link
Collaborator

@ducku ducku commented Oct 23, 2023

#352

  • Deletion of files past a certain expiration time
  • Read write locks for cached files

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start, but I don't think it handles how chunks are downloaded into directories, and not just all dumped into files in the root of the DOWNLOAD_DATA_PATH.

We also don't have anything that handles the case where something is expiring at the same time as it is being used. The chunk download code assumes that if it sees the chunk files downloaded at the beginning of the request, they will still be there when it goes to open them and will not have been removed in the intervening time. If we have a delete job that can run in the background, that won't necessarily be true anymore.

We might need a way to mark a chunk as being used by a request (maybe with touch and an assumption that no request can last longer than the expiration time?) so it isn't deleted during a request that uses it, or we need to make accesses to chunk data handle the case where the file goes away and go get it again (which is harder).

console.log("cron scheduled check");
const currentTime = new Date().getTime();
// loop through these specified directories
for (const dir of [DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe warn people near where these are configured that the tube map server needs to use them exclusively, and will delete anything else in there?

src/server.mjs Outdated
const currentTime = new Date().getTime();
// loop through these specified directories
for (const dir of [DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH]) {
fs.readdir(dir, (err, files) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only handles one flat directory, right? Don't we download files to different subdirectories?

We have an mkdirSync() here:

sequenceTubeMap/src/server.mjs

Lines 1285 to 1288 in 0280401

// TODO: check if this chunk has been downloaded before, to prevent duplicate fetches
if (!fs.existsSync(chunkDir)) {
fs.mkdirSync(chunkDir, { recursive: true });
}

But this cleanup code has nothing to loop through the subdirectories recursively, and no rmdir to remove the emptied directories.

src/server.mjs Outdated
// get file statistics
fs.stat(filePath, (statErr, stats) => {
const creationTime = stats.birthtime.getTime();
if (currentTime - creationTime >= config.fileExpirationTime) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we properly support subdirectories, a parent directory might be expired before all its contents are, because the parent directory was created before the files in it were created. But it's not possible to delete the parent directory while there are things inside it. So we'd need to handle that somehow.

We might want to handle expiration at the chunk directory/individual file upload level, and remove whole chunk directories as a unit.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a missing await in the api.post("/getChunkedData") lockDirectories callback that is going to make the lock be released before the request has been fully handled.

Other than that, it looks like this should work. I've pointed out a couple places where we don't explicitly state guarantees that we later rely on to avoid deadlocks or to keep critical sections protected by the locks. It might make sense to document those sections better.


if (fs.statSync(filePath).isFile()) {
// check to see if file needs to be deleted
const lastAccessedTime = fs.statSync(filePath).atime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can safely rely on the stored file access times maintained by the filesystem, because it is very common (or even the default behavior) for the file access times to be turned off or tracked very approximately, to improve performance.

See for example the section on the relatime mount option in the mount man page. By default, our demo server has the relatime option set on the filesystem where we run the tube map. Which means that the access time on files gets updated if it is older than the creation or modification times, but otherwise is left alone when the file is accessed.

But if we protect this all with the right locks, using the access times shouldn't be any worse than using the creation times; we'll just expire based on upload time and not based on usage.

Comment on lines +198 to +202
// attempt to acquire a lock for the next directory, and call lockDirectories on the remaining directories
const currDirectory = directoryPaths.pop();
return lockDirectory(currDirectory, lockType, async function() {
lockDirectories(directoryPaths, lockType, func);
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless everyone is guaranteed to acquire locks in the same order, this pattern of "acquire each lock and do the thing when all are held" is a classic way to get deadlocks.

If lockDirectories(["dir1", "dir2"]) and lockDirectories(["dir2", "dir1"]) are running at the same time, you can have the first one lock dir1, the second one lock dir2, and then each one wait for the other to release its locks, with no progress being made, forever.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we do actually follow the required rule here: if locking both DOWNLOAD_DATA_PATH and UPLOAD_DATA_PATH at the same time, we always lock DOWNLOAD_DATA_PATH first. But we should probably officially make that a rule we know we have to follow (or maybe just sort the inputs in this function and demand that you always use this function when locking multiple directories at once?).

// attempt to acquire a write lock for each on the directory before attemping to delete files
for (const dir of [DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH]) {
lockDirectory(dir, lockTypes.WRITE_LOCK, async function() {
deleteExpiredFiles(dir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works without an await because we know deleteExpiredFiles is totally synchronous and won't return until the files are gone. Maybe we should note that it makes that guarantee in its documentation comment? Otherwise it might seem safe to make it asynchronous later, when it isn't.

promise.catch(next);
lockDirectories([DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH], lockTypes.READ_LOCK, async function() {
let promise = getChunkedData(req, res, next);
promise.catch(next);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no await or return happening with promise, so this async function will have its result promise resolve (and will release the lock) before promise is finished.

Either we need to await the promise that getChunkedData returns, or we need to return it (or a promise derived from it) to avoid releasing the lock before it has actually finished.

package.json Outdated
@@ -29,8 +29,10 @@
"express": "^4.18.2",
"fs-extra": "^10.1.0",
"gh-pages": "^4.0.0",
"lockfile": "^1.0.4",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to add this dependency.

@adamnovak adamnovak merged commit 1f60e75 into vgteam:master Nov 13, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants