Refactor `deleteObject` for new requirements #57

doulikecookiedough · 2024-01-23T22:16:16Z

Currently, we have two deleteObject methods:

deleteObject(String pid)
- This is the original Public API method
deleteObject(String cid, boolean deleteCid)
- This is the new overload method meant for directly deleting an object (and nothing else)

These methods need to be refactored to minimize confusion and account for the following requirements below:

We must be able to delete an object given a cid (content identifier)
- Note, this should only delete the object if the cid reference file does not exist
We must be able to delete an object given a pid (persistent identifier)
- Note, this should remove the pid reference file, the pid from the cid reference file and the object itself if there are no more tags in the cid reference file
We must be able to delete an object and all of its associated data (refs file, metadata docs) given a pid (persistent identifier)

To Do Items:

Refactor deleteObject(String pid) to deleteObject(enum idType, String pid)

    public enum HashStoreDeletionTypes {

       cid("cid"), pid("pid");

       final String idType;

       HashStoreDeletionTypes(String idType) {
           identifierType = idType;
       }

       public String getType() {
           return idType;
       }
   }

If idType is pid, the method will handle requirement (1)
If idType is cid, the method will handle requirement (2)

Add new Public API method deleteObjectAll(String pid) which handles the last requirement (3) (deleting object, relevant reference files and all related sysmeta to the given pid)

Note, we must now track all metadata format_id types utilized

Update documentation

The text was updated successfully, but these errors were encountered:

doulikecookiedough · 2024-01-26T00:00:23Z

@mbjones Hoping to get some feedback regarding the following requirement on this issue:

We must be able to delete an object and all of its associated data (refs file, metadata docs) given a pid (persistent identifier)

Add new Public API method deleteObjectAll(String pid) which handles the last requirement (3) (deleting object, relevant reference files and all related sysmeta to the given pid)

Context:

In order to remove all associated metadata files, we need to be able to identify all the metadata namespaces (formatIds) used in HashStore. While I can track all formatIds and create methods to modify this list, the maintenance of this list based on the actual usage seems like a path that will become quite onerous to implement (whether it's done by HashStore or the client).

If I ignore the tracking of the actual usage of a formatId, and delegate this to the client to manage within HashStore (client would manage this list) - I would not have to worry about this list growing too large. But if the client fails to maintain this list, it could slow down the deletion process due to the need to check for the existence of many metadata documents each time. I do not think we should go down this path as it feels like we'll get some surprises of the unpleasant kind down the road.

After discussing with @taojing2002 - we thought that changing the way we store metadata can possibly help address and simplify this issue.

Proposal:

Modify the way we store metadata:

Current

// Calculate sha256 hash of the pid + formatId
sha256(pid + formatId) = 323e0799524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7

// Location of metadata after following the HashStore config
.../metadata/32/3e/07/99524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7

Proposed

// Metadata directory path
// Get the sha256 hash of the pid
sha256(pid) = 158d7e55c36a810d7c14479c952a4d0b370f2b844808f2ea2b20d7df66768b04

// Object name
// Get the md5 hash of the format_id (to produce a shorter string)
md5(format_id) = f4ea2d07db950873462a064937197b0f

// Location of metadata after following the HashStore config
// Structure without real values: .../metadata/sha256(pid)/md5(pid)
.../metadata/15/8d/7e/55c36a810d7c14479c952a4d0b370f2b844808f2ea2b20d7df66768b04/f4ea2d07db950873462a064937197b0f

Currently, a metadata document is stored at a permanent address that is calculated by getting the sha256 hash of the pid + formatId.

If we change this so that each pid has its own metadata directory, ex. /metadata/15/8d/7e/55c36a810d7c14479c952a4d0b370f2b844808f2ea2b20d7df66768b04/, then the actual metadata documents related to a pid can all be stored here
- Each metadata document can then be named by calculating the hash of the formatId (ex. md5(formatId) = f4ea2d07db950873462a064937197b0f). I chose md5 here in this example because it will produce a shorter string than sha256.
To delete all metadata files associated with a given pid, we could delete all documents found in the metadata directory for the given pid. This means we wouldn't have to keep track of a metadata namespace list at all - at the cost a couple really long folder and filenames.

May we please get your thoughts here? I don't quite like how long the folder and object name can be, but as of writing this I am unable to come up with any simpler paths.

mbjones · 2024-01-30T02:20:23Z

@doulikecookiedough Let's discuss during the backend meeting tomorrow please. Overall, I like the idea of storing the metadata for a PID under the hash of that PID -- I never really liked the hash(pid + formatid) because it makes it hard to navigate. that said, I would probably stick to sha256 throughout to simplify coding -- why mix it up over a few characters in a filename? But we can discuss. One other consideration would be whether to include a directory tree to break up the PID hashes so its not a flat structure (given that we've got 45M data files, we'd have 45M PIDs to hash, which is a very large single directory).

mbjones · 2024-01-30T02:32:26Z

Oh, also... can we please reconsider the method name deleteObjectAll()? I don't see why it needs to be different from deleteObject(), so maybe we could discuss tomorrow?

doulikecookiedough · 2024-01-30T17:04:01Z

@mbjones Sounds good RE: the naming of the delete all method (leave it as deleteObject). I will update this issue after our discussion this morning.

doulikecookiedough · 2024-02-01T20:26:29Z

Update/Summary:

Metadata documents for a given pid are stored under a directory determined by calculating the hash of the given pid. The metadata document itself will be named by calculating the hash of the respective pid + formatId.
We will maintain the existing deleteObject method to reduce confusion
- Deleting objects given a pid not only attempts to removes the object and the relevant refs files, but it also will attempt to remove all associated metadata.
The ObjectMetadata class will be amended to also have a pid attribute to assist in streamlining the client's store process.
- By default, pid is null but methods are available to get/set the pid.
The delete process can possibly produce orphaned metadata documents, refs files and objects should any unexpected/uncontrollable error occur. Therefore, deleteObject needs to be refactored so that deletion happens altogether (all or none, or as atomic as possible).

doulikecookiedough · 2024-08-19T20:20:40Z

This has been completed via Feature-57: deleteObject Refactor

doulikecookiedough self-assigned this Jan 23, 2024

doulikecookiedough changed the title ~~Refactor 'deleteObject' for new requirements~~ Refactor deleteObject for new requirements Jan 23, 2024

doulikecookiedough added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 23, 2024

This was referenced Jan 24, 2024

HashStore-java 1.1.0 Release #48

Open

HashStore 1.1.0 Release DataONEorg/hashstore#84

Merged

doulikecookiedough mentioned this issue Jan 26, 2024

Refactor delete_object DataONEorg/hashstore#87

Closed

doulikecookiedough closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `deleteObject` for new requirements #57

Refactor `deleteObject` for new requirements #57

doulikecookiedough commented Jan 23, 2024 •

edited

Loading

doulikecookiedough commented Jan 26, 2024 •

edited

Loading

mbjones commented Jan 30, 2024

mbjones commented Jan 30, 2024

doulikecookiedough commented Jan 30, 2024 •

edited

Loading

doulikecookiedough commented Feb 1, 2024 •

edited

Loading

doulikecookiedough commented Aug 19, 2024

Refactor deleteObject for new requirements #57

Refactor deleteObject for new requirements #57

Comments

doulikecookiedough commented Jan 23, 2024 • edited Loading

doulikecookiedough commented Jan 26, 2024 • edited Loading

mbjones commented Jan 30, 2024

mbjones commented Jan 30, 2024

doulikecookiedough commented Jan 30, 2024 • edited Loading

doulikecookiedough commented Feb 1, 2024 • edited Loading

doulikecookiedough commented Aug 19, 2024

Refactor `deleteObject` for new requirements #57

Refactor `deleteObject` for new requirements #57

doulikecookiedough commented Jan 23, 2024 •

edited

Loading

doulikecookiedough commented Jan 26, 2024 •

edited

Loading

doulikecookiedough commented Jan 30, 2024 •

edited

Loading

doulikecookiedough commented Feb 1, 2024 •

edited

Loading