add timed check for expired files #358

ducku · 2023-10-23T02:55:49Z

Deletion of files past a certain expiration time
Read write locks for cached files

adamnovak

This looks like a good start, but I don't think it handles how chunks are downloaded into directories, and not just all dumped into files in the root of the DOWNLOAD_DATA_PATH.

We also don't have anything that handles the case where something is expiring at the same time as it is being used. The chunk download code assumes that if it sees the chunk files downloaded at the beginning of the request, they will still be there when it goes to open them and will not have been removed in the intervening time. If we have a delete job that can run in the background, that won't necessarily be true anymore.

We might need a way to mark a chunk as being used by a request (maybe with touch and an assumption that no request can last longer than the expiration time?) so it isn't deleted during a request that uses it, or we need to make accesses to chunk data handle the case where the file goes away and go get it again (which is harder).

adamnovak · 2023-10-23T21:27:08Z

src/server.mjs

+  console.log("cron scheduled check");
+  const currentTime = new Date().getTime();
+  // loop through these specified directories
+  for (const dir of [DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH]) {


Should we maybe warn people near where these are configured that the tube map server needs to use them exclusively, and will delete anything else in there?

adamnovak · 2023-10-23T21:29:48Z

src/server.mjs

+  const currentTime = new Date().getTime();
+  // loop through these specified directories
+  for (const dir of [DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH]) {
+    fs.readdir(dir, (err, files) => {


This only handles one flat directory, right? Don't we download files to different subdirectories?

We have an mkdirSync() here:

sequenceTubeMap/src/server.mjs

Lines 1285 to 1288 in 0280401

// TODO: check if this chunk has been downloaded before, to prevent duplicate fetches

if (!fs.existsSync(chunkDir)) {

fs.mkdirSync(chunkDir, { recursive: true });

}

But this cleanup code has nothing to loop through the subdirectories recursively, and no rmdir to remove the emptied directories.

adamnovak · 2023-10-23T21:32:17Z

src/server.mjs

+        // get file statistics
+        fs.stat(filePath, (statErr, stats) => {
+          const creationTime = stats.birthtime.getTime();
+          if (currentTime - creationTime >= config.fileExpirationTime) {


When we properly support subdirectories, a parent directory might be expired before all its contents are, because the parent directory was created before the files in it were created. But it's not possible to delete the parent directory while there are things inside it. So we'd need to handle that somehow.

We might want to handle expiration at the chunk directory/individual file upload level, and remove whole chunk directories as a unit.

adamnovak

There's a missing await in the api.post("/getChunkedData") lockDirectories callback that is going to make the lock be released before the request has been fully handled.

Other than that, it looks like this should work. I've pointed out a couple places where we don't explicitly state guarantees that we later rely on to avoid deadlocks or to keep critical sections protected by the locks. It might make sense to document those sections better.

adamnovak · 2023-11-06T15:45:16Z

src/server.mjs

+
+    if (fs.statSync(filePath).isFile()) {
+      // check to see if file needs to be deleted
+      const lastAccessedTime = fs.statSync(filePath).atime;


I don't think we can safely rely on the stored file access times maintained by the filesystem, because it is very common (or even the default behavior) for the file access times to be turned off or tracked very approximately, to improve performance.

See for example the section on the relatime mount option in the mount man page. By default, our demo server has the relatime option set on the filesystem where we run the tube map. Which means that the access time on files gets updated if it is older than the creation or modification times, but otherwise is left alone when the file is accessed.

But if we protect this all with the right locks, using the access times shouldn't be any worse than using the creation times; we'll just expire based on upload time and not based on usage.

adamnovak · 2023-11-06T15:49:23Z

src/server.mjs

+  // attempt to acquire a lock for the next directory, and call lockDirectories on the remaining directories
+  const currDirectory = directoryPaths.pop();
+  return lockDirectory(currDirectory, lockType, async function() {
+    lockDirectories(directoryPaths, lockType, func);
+  })


Unless everyone is guaranteed to acquire locks in the same order, this pattern of "acquire each lock and do the thing when all are held" is a classic way to get deadlocks.

If lockDirectories(["dir1", "dir2"]) and lockDirectories(["dir2", "dir1"]) are running at the same time, you can have the first one lock dir1, the second one lock dir2, and then each one wait for the other to release its locks, with no progress being made, forever.

It looks like we do actually follow the required rule here: if locking both DOWNLOAD_DATA_PATH and UPLOAD_DATA_PATH at the same time, we always lock DOWNLOAD_DATA_PATH first. But we should probably officially make that a rule we know we have to follow (or maybe just sort the inputs in this function and demand that you always use this function when locking multiple directories at once?).

adamnovak · 2023-11-06T15:51:49Z

src/server.mjs

+  // attempt to acquire a write lock for each on the directory before attemping to delete files
+  for (const dir of [DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH]) {
+    lockDirectory(dir, lockTypes.WRITE_LOCK, async function() {
+      deleteExpiredFiles(dir);


This only works without an await because we know deleteExpiredFiles is totally synchronous and won't return until the files are gone. Maybe we should note that it makes that guarantee in its documentation comment? Otherwise it might seem safe to make it asynchronous later, when it isn't.

adamnovak · 2023-11-06T15:54:30Z

src/server.mjs

-  promise.catch(next);
+  lockDirectories([DOWNLOAD_DATA_PATH, UPLOAD_DATA_PATH], lockTypes.READ_LOCK, async function() {
+    let promise = getChunkedData(req, res, next);
+    promise.catch(next);


There's no await or return happening with promise, so this async function will have its result promise resolve (and will release the lock) before promise is finished.

Either we need to await the promise that getChunkedData returns, or we need to return it (or a promise derived from it) to avoid releasing the lock before it has actually finished.

adamnovak · 2023-11-06T15:57:07Z

package.json

@@ -29,8 +29,10 @@
    "express": "^4.18.2",
    "fs-extra": "^10.1.0",
    "gh-pages": "^4.0.0",
+    "lockfile": "^1.0.4",


We no longer need to add this dependency.

add timed check for expired files

6bf3f6b

adamnovak requested changes Oct 23, 2023

View reviewed changes

ducku added 4 commits October 28, 2023 18:42

deletion of expiration files now checks for nested directories

40442ed

implement directory locking for deleting expired files

41eb87b

lockfile package changes

a722251

lock directories when processing a getchunk request

ca9def9

adamnovak requested changes Nov 6, 2023

View reviewed changes

ducku and others added 2 commits November 7, 2023 10:42

await promise before releasing lock

ef16b85

Merge branch 'master' into issues/352-cached-files-expiration

f142e94

adamnovak merged commit 1f60e75 into vgteam:master Nov 13, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add timed check for expired files #358

add timed check for expired files #358

ducku commented Oct 23, 2023 •

edited

Loading

adamnovak left a comment

adamnovak Oct 23, 2023

adamnovak Oct 23, 2023

adamnovak Oct 23, 2023

adamnovak left a comment

adamnovak Nov 6, 2023

adamnovak Nov 6, 2023

adamnovak Nov 6, 2023

adamnovak Nov 6, 2023

adamnovak Nov 6, 2023

adamnovak Nov 6, 2023

	// TODO: check if this chunk has been downloaded before, to prevent duplicate fetches
	if (!fs.existsSync(chunkDir)) {
	fs.mkdirSync(chunkDir, { recursive: true });
	}

add timed check for expired files #358

add timed check for expired files #358

Conversation

ducku commented Oct 23, 2023 • edited Loading

adamnovak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamnovak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ducku commented Oct 23, 2023 •

edited

Loading