Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature-57: deleteObject Refactor #58

Merged
merged 288 commits into from
Aug 16, 2024

Conversation

doulikecookiedough
Copy link
Contributor

@doulikecookiedough doulikecookiedough commented Jan 31, 2024

Summary of Changes:

  • Refactored existing deleteObject(String pid) to delete an object, its reference files and all associated metadata documents
  • Refactored existing deleteMetadata(String pid) to delete all associated metadata documents for the given pid
  • Refactored storeMetadata to store metadata for a given pid and formatId in a directory formed by calculating the hash of the given pid, with the document name being the hash of the formatId.
  • Refactored 'ObjectMetadata' class to include a pid in its constructor and updated storeObject methods with a pid
  • Added new overload method deleteObject(String idType, String id) to enable deletion of object, references files and metadata documents - or just the object itself
  • Added and revised junit tests
  • Updated HashStore interface javadocs and README
  • Cleaned up code to improve clarity

README.md Outdated
identifier (pid)). This process produces reference files, which allow objects to be found and
retrieved with a given identifier.
- Note 1: An identifier can only be used once
- Note 2: Objects are stored once and only once using its content identifier (a checksum generated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent plural/singular: Objects are [...] using its

Suggest: Each object is stored once and only once using its content identifier

README.md Outdated
// Validate object, if the parameters do not match, the data object associated with the objInfo
// supplied will be deleted
- deleteInvalidObject(objInfo, checksum, checksumAlgorithn, objSize)
deleteInvalidObject(objInfo, checksum, checksumAlgorithn, objSize);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteIfInvalidObject

Comment on lines 23 to 24
* disk using a given InputStream. Upon successful storage, the method returns a
* (ObjectMetadata) object containing relevant file information, such as the file's id
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*   [...] returns a
* (ObjectMetadata) object [...]

suggest:

returns an {@code ObjectMetadata} object

* The {@code storeObject} method is responsible for the atomic storage of objects to
* disk using a given InputStream. Upon successful storage, the method returns a
* (ObjectMetadata) object containing relevant file information, such as the file's id
* (which can be used to locate the object on disk), the file's size, and a hex digest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which can be used to locate the object on disk)

(which can be used by a system administrator -- but not by an API client -- to locate the object on disk)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOVE seeing a whole file be collapsed down to one line 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too, it's so clean now!

objectInfo.getHexDigests(), "objectInfo.getHexDigests()", "deleteInvalidObject");
if (objectInfo.getHexDigests().isEmpty()) {
objectInfo.hexDigests(), "objectInfo.getHexDigests()", "deleteInvalidObject");
if (objectInfo.hexDigests().isEmpty()) {
throw new MissingHexDigestsException("Missing hexDigests in supplied ObjectMetadata");
}
FileHashStoreUtility.ensureNotNull(checksum, "checksum", "deleteInvalidObject");
FileHashStoreUtility.ensureNotNull(checksumAlgorithm, "checksumAlgorithm", "deleteInvalidObject");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteIfInvalidObject

objectInfo.getHexDigests(), "objectInfo.getHexDigests()", "deleteInvalidObject");
if (objectInfo.getHexDigests().isEmpty()) {
objectInfo.hexDigests(), "objectInfo.getHexDigests()", "deleteInvalidObject");
if (objectInfo.hexDigests().isEmpty()) {
throw new MissingHexDigestsException("Missing hexDigests in supplied ObjectMetadata");
}
FileHashStoreUtility.ensureNotNull(checksum, "checksum", "deleteInvalidObject");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteIfInvalidObject

@@ -521,12 +542,10 @@ public ObjectMetadata storeObject(InputStream object) throws NoSuchAlgorithmExce
// call 'deleteInvalidObject' (optional) to check that the object is valid, and then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteIfInvalidObject

Comment on lines 544 to 545
synchronizeObjectLockedCids(cid);
synchronizeReferenceLockedPids(pid);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a downside to moving it, but do see plenty of upside - unless I'm missing something. Your calling code would be simplified to this:

        try {
            storeHashStoreRefsFiles(pid, cid);

        } catch (HashStoreRefsAlreadyExistException hsrfae) {
            // * * * cid and pid already released * * *
            // This exception is thrown when the pid and cid are already tagged appropriately
            String errMsg =
                "HashStore refs files already exist for pid " + pid + " and cid: " + cid;
            throw new HashStoreRefsAlreadyExistException(errMsg);

        } catch (PidRefsFileExistsException prfe) {
            // * * * cid and pid already released * * *
            String errMsg = "pid: " + pid + " already references another cid."
                + " A pid can only reference one cid.";
            throw new PidRefsFileExistsException(errMsg);

        } catch (Exception e) {
            // * * * cid and pid already released * * *
            // Revert the process for all other exceptions
            unTagObject(pid, cid);
            throw e;
        }

...which I believe would work in exactly the same way. The only real difference is that if storeHashStoreRefsFiles() throws an exception, it will release the locks before your catch blocks, above, instead of after (assuming storeHashStoreRefsFiles() has a finally block, of course).

Looking at the code, it looks like that re-ordering of calls should not be a problem, but I LMK if I'm missing anything important

public HashStoreRunnable(HashStore hashstore, int publicAPIMethod, InputStream objStream,
String pid) {
FileHashStoreUtility.ensureNotNull(hashstore, "hashstore",
"HashStoreServiceRequestConstructor");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see (and have flagged) more inaccuracies. You're setting your future self up for a lifelong game of whack-a-mole :-)

Try this - I think it solves all your problems, without needing to pass the method name:

// example excerpt, from checkForNotEmptyAndValidString()
//
if (string.isBlank()) {
    StackTraceElement[] stackTraceElements = Thread.currentThread().getStackTrace();
    String msg = "Calling Method: " + stackTraceElements[2].getMethodName()
        + "(): argument cannot be empty, etc, etc...";
    throw new IllegalArgumentException(msg);
}

link to original thread, since GH does its best to confuse us

@doulikecookiedough
Copy link
Contributor Author

Thank you again @artntek for reviewing my PR. I believe I have addressed all your latest feedback. Regarding moving of the synchronization code... GitHub does want to confuse me indeed. I have made the change, thank you!

Please let me know if you have any other feedback, otherwise I will apply some auto-formatting and get this merged!

@artntek
Copy link

artntek commented Aug 15, 2024

Thank you again @artntek for reviewing my PR. I believe I have addressed all your latest feedback. Regarding moving of the synchronization code... GitHub does want to confuse me indeed. I have made the change, thank you!

Thanks - this looks a lot cleaner.

However, I would urge you to think through the process carefully one more time...

This execution path matches the one from your original code, i.e.:

  1. you released the pid lock
  2. then you call unTagObject()
  3. unTagObject() applies a new pid lock
    ...like this:
  } catch (Exception e) {
      // Revert the process for all other exceptions
      // We must first release the cid and pid since 'unTagObject' is synchronized
      // If not, we will run into a deadlock.
      releaseObjectLockedCids(cid);
      releaseReferenceLockedPids(pid);
      unTagObject(pid, cid);
      throw e;
  } finally {
      ...etc

My question:

Assuming a different thread could hijack the lock on this object between you releasing it (step 1, above) and re-applying it in unTagObject() (step 3, above) -- are there any scenarios where this could pose a problem? Could you end up untagging a valid tag written by the other thread, for example? or?

* (which can be used to locate the object on disk), the file's size, and a hex digest
* dict of algorithms and checksums. Storing an object with {@code store_object} also
* tags an object (creating references) which allow the object to be discoverable.
* (@Code ObjectMetadata) object containing relevant file information, such as the file's
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to change a to an ObjectMetadata

storeHashStoreRefsFiles(pid, cid);

} catch (HashStoreRefsAlreadyExistException hsrfae) {
// *** cid and pid already released ***
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unless you really like it (I just included it in the example to provide a bit of explanation. It seems superfluous in the actual code)

// This exception is thrown when the pid and cid are already tagged appropriately
String errMsg =
"HashStore refs files already exist for pid " + pid + " and cid: " + cid;
throw new HashStoreRefsAlreadyExistException(errMsg);

} catch (PidRefsFileExistsException prfe) {
// *** cid and pid already released ***
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unless you really like it (I just included it in the example to provide a bit of explanation. It seems superfluous in the actual code)

String errMsg = "pid: " + pid + " already references another cid."
+ " A pid can only reference one cid.";
throw new PidRefsFileExistsException(errMsg);

} catch (Exception e) {
// *** cid and pid already released ***
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unless you really like it (I just included it in the example to provide a bit of explanation. It seems superfluous in the actual code)

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Aug 15, 2024

Thank you @artntek! My thoughts on your question posed above:

My question:
Assuming a different thread could hijack the lock on this object between you releasing it (step 1, above) and re-applying it in unTagObject() (step 3, above) -- are there any scenarios where this could pose a problem? Could you end up untagging a valid tag written by the other thread, for example? or?

The purpose of unTagObject is to undo the tagging process if any unexpected exception occurs - if it is called, it must be executed no matter what. The only time unTagObject is called at this moment is through tagObject itself.

All calls to tagObject are synchronized (thread-safe) based on the cid and pid combination, but this doesn't mean the order in which they execute can be guaranteed if two threads are competing for the same lock.

Since we are using a combination here, beginning with the cid, then the pid, ultimately every action on a cid or pid will be executed. Whether we tagObject or unTagObject, we will always perform each action based on the state of the reference files (which are accounted for).

  • Note 1: tagObject also does not store duplicate pid references in the cid reference file, so there should never be a situation in which a pid appears twice.

Should an unexpected exception occur, even if two threads are competing for the lock, synchronization ensures that only one thread can proceed at a time with operations on the same cid and pid combination. Therefore, unTagObject will safely undo the process without affecting the state of HashStore.

  • Note 2: the most important reference file of all is the cid reference file, and access to this tracking document is always synchronized.

The worst case scenario is that clients will have to re-upload their data object, but if there is an unexpected exception occurring where we aren't storing as expected - we probably want to stop and take a look at that before proceeding any further.

What do you think?

@artntek
Copy link

artntek commented Aug 15, 2024

The worst case scenario is that clients will have to re-upload their data object, but if there is an unexpected exception occurring where we aren't storing as expected - we probably want to stop and take a look at that before proceeding any further.

What do you think?

Here's a scenario I was thinking of:

Thread A                    Thread B
========                    ========
   :                            :
uploads object X            uploads object X
   :                            :
(gets lock)                 (awaiting lock)
   :                            :
tagging obj X,                  :
FAILS ❌                        :
(returns lock)                  :
   :                        (gets lock)
(awaiting lock)                 :
   :                        TAGS obj X ✅
   :                        (returns lock)
(gets lock)                     :
UNTAGS obj X ✅                 :
(returns lock)                  :
   :                            :
   :                            :
               DONE
* * * Obj X is orphaned/not tagged ?? * * *

LMK if this is not possible/not an issue

try {
// If no exceptions are thrown, we proceed to synchronization based on the `cid`
synchronizeObjectLockedCids(cid);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch :-)

@doulikecookiedough
Copy link
Contributor Author

@artntek Your example is accurate - an orphaned data object can be produced. I have created a new issue to discuss and address it here. I added a potential solution and additional context - when you have a chance, can you please take a look and let me know what you think?

Many thanks again for your feedback and review comments! I am going to merge this feature into develop.

@doulikecookiedough doulikecookiedough merged commit 21bcd62 into develop Aug 16, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants