Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor deleteObject for new requirements #57

Closed
doulikecookiedough opened this issue Jan 23, 2024 · 6 comments
Closed

Refactor deleteObject for new requirements #57

doulikecookiedough opened this issue Jan 23, 2024 · 6 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@doulikecookiedough
Copy link
Contributor

doulikecookiedough commented Jan 23, 2024

Currently, we have two deleteObject methods:

  • deleteObject(String pid)
    • This is the original Public API method
  • deleteObject(String cid, boolean deleteCid)
    • This is the new overload method meant for directly deleting an object (and nothing else)

These methods need to be refactored to minimize confusion and account for the following requirements below:

  1. We must be able to delete an object given a cid (content identifier)
    • Note, this should only delete the object if the cid reference file does not exist
  2. We must be able to delete an object given a pid (persistent identifier)
    • Note, this should remove the pid reference file, the pid from the cid reference file and the object itself if there are no more tags in the cid reference file
  3. We must be able to delete an object and all of its associated data (refs file, metadata docs) given a pid (persistent identifier)

To Do Items:

  1. Refactor deleteObject(String pid) to deleteObject(enum idType, String pid)
    public enum HashStoreDeletionTypes {

       cid("cid"), pid("pid");

       final String idType;

       HashStoreDeletionTypes(String idType) {
           identifierType = idType;
       }

       public String getType() {
           return idType;
       }
   }
  • If idType is pid, the method will handle requirement (1)
  • If idType is cid, the method will handle requirement (2)
  1. Add new Public API method deleteObjectAll(String pid) which handles the last requirement (3) (deleting object, relevant reference files and all related sysmeta to the given pid)
  • Note, we must now track all metadata format_id types utilized
  1. Update documentation
@doulikecookiedough doulikecookiedough self-assigned this Jan 23, 2024
@doulikecookiedough doulikecookiedough changed the title Refactor 'deleteObject' for new requirements Refactor deleteObject for new requirements Jan 23, 2024
@doulikecookiedough doulikecookiedough added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 23, 2024
@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Jan 26, 2024

@mbjones Hoping to get some feedback regarding the following requirement on this issue:

We must be able to delete an object and all of its associated data (refs file, metadata docs) given a pid (persistent identifier)

Add new Public API method deleteObjectAll(String pid) which handles the last requirement (3) (deleting object, relevant reference files and all related sysmeta to the given pid)

Context:

In order to remove all associated metadata files, we need to be able to identify all the metadata namespaces (formatIds) used in HashStore. While I can track all formatIds and create methods to modify this list, the maintenance of this list based on the actual usage seems like a path that will become quite onerous to implement (whether it's done by HashStore or the client).

If I ignore the tracking of the actual usage of a formatId, and delegate this to the client to manage within HashStore (client would manage this list) - I would not have to worry about this list growing too large. But if the client fails to maintain this list, it could slow down the deletion process due to the need to check for the existence of many metadata documents each time. I do not think we should go down this path as it feels like we'll get some surprises of the unpleasant kind down the road.

After discussing with @taojing2002 - we thought that changing the way we store metadata can possibly help address and simplify this issue.

Proposal:

  • Modify the way we store metadata:
Current

// Calculate sha256 hash of the pid + formatId
sha256(pid + formatId) = 323e0799524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7

// Location of metadata after following the HashStore config
.../metadata/32/3e/07/99524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7
Proposed

// Metadata directory path
// Get the sha256 hash of the pid
sha256(pid) = 158d7e55c36a810d7c14479c952a4d0b370f2b844808f2ea2b20d7df66768b04

// Object name
// Get the md5 hash of the format_id (to produce a shorter string)
md5(format_id) = f4ea2d07db950873462a064937197b0f

// Location of metadata after following the HashStore config
// Structure without real values: .../metadata/sha256(pid)/md5(pid)
.../metadata/15/8d/7e/55c36a810d7c14479c952a4d0b370f2b844808f2ea2b20d7df66768b04/f4ea2d07db950873462a064937197b0f

Currently, a metadata document is stored at a permanent address that is calculated by getting the sha256 hash of the pid + formatId.

  • If we change this so that each pid has its own metadata directory, ex. /metadata/15/8d/7e/55c36a810d7c14479c952a4d0b370f2b844808f2ea2b20d7df66768b04/, then the actual metadata documents related to a pid can all be stored here
    • Each metadata document can then be named by calculating the hash of the formatId (ex. md5(formatId) = f4ea2d07db950873462a064937197b0f). I chose md5 here in this example because it will produce a shorter string than sha256.
  • To delete all metadata files associated with a given pid, we could delete all documents found in the metadata directory for the given pid. This means we wouldn't have to keep track of a metadata namespace list at all - at the cost a couple really long folder and filenames.

May we please get your thoughts here? I don't quite like how long the folder and object name can be, but as of writing this I am unable to come up with any simpler paths.

@mbjones
Copy link
Member

mbjones commented Jan 30, 2024

@doulikecookiedough Let's discuss during the backend meeting tomorrow please. Overall, I like the idea of storing the metadata for a PID under the hash of that PID -- I never really liked the hash(pid + formatid) because it makes it hard to navigate. that said, I would probably stick to sha256 throughout to simplify coding -- why mix it up over a few characters in a filename? But we can discuss. One other consideration would be whether to include a directory tree to break up the PID hashes so its not a flat structure (given that we've got 45M data files, we'd have 45M PIDs to hash, which is a very large single directory).

@mbjones
Copy link
Member

mbjones commented Jan 30, 2024

Oh, also... can we please reconsider the method name deleteObjectAll()? I don't see why it needs to be different from deleteObject(), so maybe we could discuss tomorrow?

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Jan 30, 2024

@mbjones Sounds good RE: the naming of the delete all method (leave it as deleteObject). I will update this issue after our discussion this morning.

@doulikecookiedough
Copy link
Contributor Author

doulikecookiedough commented Feb 1, 2024

Update/Summary:

  • Metadata documents for a given pid are stored under a directory determined by calculating the hash of the given pid. The metadata document itself will be named by calculating the hash of the respective pid + formatId.
  • We will maintain the existing deleteObject method to reduce confusion
    • Deleting objects given a pid not only attempts to removes the object and the relevant refs files, but it also will attempt to remove all associated metadata.
  • The ObjectMetadata class will be amended to also have a pid attribute to assist in streamlining the client's store process.
    • By default, pid is null but methods are available to get/set the pid.
  • The delete process can possibly produce orphaned metadata documents, refs files and objects should any unexpected/uncontrollable error occur. Therefore, deleteObject needs to be refactored so that deletion happens altogether (all or none, or as atomic as possible).

@doulikecookiedough
Copy link
Contributor Author

This has been completed via Feature-57: deleteObject Refactor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants