Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature-55: storeObject and Reference Files Refactor #56

Merged
merged 79 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
87b961f
Refactor 'storeObject' to store objects with their content identifier…
doulikecookiedough Dec 16, 2023
430a3ab
Add 'tagObject', 'verifyObject' and 'findObject' to HashStore interfa…
doulikecookiedough Dec 16, 2023
fdd0abf
Add new synchronization ArrayList 'referenceLockedCids' and skeleton …
doulikecookiedough Dec 16, 2023
1360477
Refactor FileHashStore constructor for 'refs' directories and add rev…
doulikecookiedough Dec 17, 2023
e221fb2
Add all related code for 'tag_object' to throw new PidRefsFileExistsE…
doulikecookiedough Dec 17, 2023
f489370
Update 'tag_object' with new method 'writePidRefsFile' and add junit …
doulikecookiedough Dec 17, 2023
2c78315
Update 'tag_object' with new method 'writeCidRefsFile' and add new ju…
doulikecookiedough Dec 17, 2023
c4a2d3e
Update 'tag_object' to move tmp refs files to their permanent locatio…
doulikecookiedough Dec 17, 2023
5bab09f
Update 'tag_object' to verify tagging process with new method 'verify…
doulikecookiedough Dec 17, 2023
43f0123
Update 'tagObject' javadoc
doulikecookiedough Dec 17, 2023
3e93551
Change 'verifyHashStoreRefFiles' method name to 'verifyHashStoreRefsF…
doulikecookiedough Dec 18, 2023
9b2d597
Fix logging statement inaccuracies
doulikecookiedough Dec 18, 2023
f94f1ae
Add new custom exception class 'PidExistsInCidRefsFileException'
doulikecookiedough Dec 18, 2023
0c0d4ec
Finalize 'tagObject' method, update javadocs, add new method 'updateC…
doulikecookiedough Dec 18, 2023
a562bac
Fix bug in 'tagObject' where wrong synchronization variable is refere…
doulikecookiedough Dec 18, 2023
a389e67
Refactor 'tagObject', clean up code and update junit tests
doulikecookiedough Dec 18, 2023
162d35f
Implement 'findObject' method, update HashStore interface and add new…
doulikecookiedough Dec 18, 2023
6f40982
Refactor 'syncPutObject' to call 'tagObject', refactor 'getHexDigest'…
doulikecookiedough Dec 19, 2023
823fece
Update HashStore interface 'storeObject' javadoc and add new override…
doulikecookiedough Dec 19, 2023
20ed145
Remove unintended print statements
doulikecookiedough Dec 19, 2023
a7c53fb
Refactor 'putObject' to remove input validation, which is already don…
doulikecookiedough Dec 19, 2023
252aa2f
Implement 'storeObject' method with just an InputStream and add/refac…
doulikecookiedough Dec 19, 2023
3b9b028
Refactor 'getRealPath' and update all affected code and junit tests
doulikecookiedough Dec 19, 2023
cee2aab
Add new method 'deletePidRefsFile' and new junit tests
doulikecookiedough Dec 19, 2023
fd0ef31
Add new methods 'deleteCidRefsPid' and 'deleteCidRefsFile' and add ne…
doulikecookiedough Dec 19, 2023
053457d
Clean up code, revise logging levels and statements
doulikecookiedough Dec 19, 2023
9c880a9
Refactor 'deleteObject' to also remove the relevant reference files
doulikecookiedough Dec 20, 2023
3a484b4
Fix redundant variable names and add missing logging statement to 'fi…
doulikecookiedough Dec 20, 2023
5cb2815
Implement 'verifyObject' method, refactor 'validateTmpObject' and upd…
doulikecookiedough Dec 20, 2023
f2b8913
Add new 'verifyObject' junit test for mismatched object size
doulikecookiedough Dec 20, 2023
8d197ba
Finalize 'HashAddress' rename to 'ObjectMetadata' by updating the Obj…
doulikecookiedough Dec 20, 2023
a762502
Clean up code and fix minor bugs
doulikecookiedough Dec 20, 2023
b04d3c1
Update java version from 1.8 to 17
doulikecookiedough Dec 20, 2023
a1b675c
Refactor 'deleteObject' to share synchronization with 'tagObject' on …
doulikecookiedough Dec 20, 2023
8dfb5c0
Clean up code
doulikecookiedough Dec 20, 2023
cb7fef2
Update HashStore interface javadoc for accuracy
doulikecookiedough Dec 20, 2023
57025be
Update README.md
doulikecookiedough Dec 21, 2023
da118dd
Update README.md with missing comments regarding deleting objects
doulikecookiedough Dec 21, 2023
eb408bf
Refactor 'tagObject' update condition and revise junit tests
doulikecookiedough Dec 21, 2023
92701cb
Add missing 'Override' declarations and update HashStore interface
doulikecookiedough Dec 21, 2023
38ae659
Move 'getHierarchicalPathString' method to FileHashStoreUtility class…
doulikecookiedough Dec 21, 2023
2e8649a
Move 'generateTmpFile' method to FileHashStoreUtility class and refac…
doulikecookiedough Dec 21, 2023
ba15322
Clean up code, update javadocs and comments
doulikecookiedough Dec 21, 2023
2654f30
Refactor updating and deleting pid references in a cid refs file to b…
doulikecookiedough Dec 21, 2023
3c5ee0b
Clean up code
doulikecookiedough Dec 21, 2023
3404d7f
Update README.md with overview of HashStore, missing section on refer…
doulikecookiedough Dec 22, 2023
8a47dba
Clean up code
doulikecookiedough Dec 22, 2023
e50f55c
Update README.md with new section on reference files and revise wording
doulikecookiedough Dec 22, 2023
e4b553b
Fix bug in HashStoreClient with incorrect boolean type for hasArg for…
doulikecookiedough Dec 27, 2023
81dedae
Fix bug in 'deleteObject' where an object is deleted even if its cid …
doulikecookiedough Jan 2, 2024
2eae407
Refactor 'store_object' to only delete the associated tmp file when s…
doulikecookiedough Jan 9, 2024
4efd1fb
Reorganize methods in FileHashStore
doulikecookiedough Jan 11, 2024
79c18c8
Remove redundant addition of new lines when writing refs related files
doulikecookiedough Jan 11, 2024
87e15de
Clean up code and comments
doulikecookiedough Jan 11, 2024
5c8aa02
Refactor 'deleteObject' to streamline the process, revise junit tests…
doulikecookiedough Jan 17, 2024
ce85fed
Delete redundant exception class 'PidObjectExistsException' and revis…
doulikecookiedough Jan 17, 2024
48a26eb
Add 'findobject' api option to HashStoreClient and new junit test
doulikecookiedough Jan 17, 2024
a431e72
Add option 'gbskip' in HashStoreClient when testing in knbvm to contr…
doulikecookiedough Jan 17, 2024
449f0b9
Update README.md
doulikecookiedough Jan 17, 2024
1f88f68
Clean up logging statements
doulikecookiedough Jan 17, 2024
be0f4ff
Revise if statement in 'HashStoreClient' to determine whether an obje…
doulikecookiedough Jan 17, 2024
6b651d5
Remove redundant method 'deleteCidRefsFile' and clean up code
doulikecookiedough Jan 18, 2024
0be692c
Move 'getPidHexDigest' method to FileHashStoreUtility class and clean…
doulikecookiedough Jan 18, 2024
913e90a
Update javadocs
doulikecookiedough Jan 18, 2024
87bfa4f
Add '.close()' statement to finally block on given stream in 'writeTo…
doulikecookiedough Jan 18, 2024
762cfa1
Clean up/revise test classes '...InterfaceTest', '...ProtectedTest', …
doulikecookiedough Jan 18, 2024
e5a3701
Fix bug in HashStoreClient where code incorrectly calls '.getInteger(…
doulikecookiedough Jan 18, 2024
074dd9d
Update HashStore interface with missing access modifiers and add todo…
doulikecookiedough Jan 22, 2024
75865c7
Optimize updating/removing pids from cid refs files and add todo items
doulikecookiedough Jan 22, 2024
63e46b0
Rename 'ObjectMetadata' class's 'id' to 'cid' and update all affected…
doulikecookiedough Jan 22, 2024
da0d245
Refactor 'findObject' to throw custom exception classes and revise/ad…
doulikecookiedough Jan 23, 2024
119a4b8
Refactor 'deleteObject' to handle orphaned pid refs files, revise 'fi…
doulikecookiedough Jan 23, 2024
dac8a34
Swallow unnecessary exceptions in 'move' and 'deleteMetadata' methods…
doulikecookiedough Jan 23, 2024
16fa960
Refactor 'validateTmpObject' to return true when object has validated…
doulikecookiedough Jan 23, 2024
9e87484
Add and implement new 'storeObject' overload method for checksum, che…
doulikecookiedough Jan 23, 2024
ed4da4a
Add new 'deleteObject(String, boolean)' overload method for deleting …
doulikecookiedough Jan 23, 2024
2e51201
Refactor 'writeTo...Checksums' method to ensure we do not calculate r…
doulikecookiedough Jan 23, 2024
4400215
Update 'ObjectMetadata' class javadocs
doulikecookiedough Jan 24, 2024
a075d2d
Refactor 'verify_object' to directly compare values and revert change…
doulikecookiedough Jan 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,142 @@ DataONE in general, and HashStore in particular, are open source, community proj

Documentation is a work in progress, and can be found on the [Metacat repository](https://github.com/NCEAS/metacat/blob/feature-1436-storage-and-indexing/docs/user/metacat/source/storage-subsystem.rst#physical-file-layout) as part of the storage redesign planning. Future updates will include documentation here as the package matures.

## HashStore Overview

HashStore is a content-addressable file management system that utilizes the content identifier of an object to address files. The system stores both objects, references (refs) and metadata in its respective directories and provides an API for interacting with the store. HashStore storage classes (like `FileHashStore`) must implement the HashStore interface to ensure the expected usage of HashStore.

###### Public API Methods
- storeObject
- verifyObject
- tagObject
- findObject
- storeMetadata
- retrieveObject
- retrieveMetadata
- deleteObject
- deleteMetadata
- getHexDigest

For details, please see the HashStore interface (HashStore.java)


###### How do I create a HashStore?

To create or interact with a HashStore, instantiate a HashStore object with the following set of properties:
- storePath
- storeDepth
- storeWidth
- storeAlgorithm
- storeMetadataNamespace

```java
String classPackage = "org.dataone.hashstore.filehashstore.FileHashStore";
Path rootDirectory = tempFolder.resolve("metacat");

Properties storeProperties = new Properties();
storeProperties.setProperty("storePath", rootDirectory.toString());
storeProperties.setProperty("storeDepth", "3");
storeProperties.setProperty("storeWidth", "2");
storeProperties.setProperty("storeAlgorithm", "SHA-256");
storeProperties.setProperty(
"storeMetadataNamespace", "http://ns.dataone.org/service/types/v2.0"
);

// Instantiate a HashStore
HashStore hashStore = HashStoreFactory.getHashStore(classPackage, storeProperties);

// Store an object
hashStore.storeObject(stream, pid)
// ...
```


###### Working with objects (store, retrieve, delete)

In HashStore, objects are first saved as temporary files while their content identifiers are calculated. Once the default hash algorithm list and their hashes are generated, objects are stored in their permanent location using the store's algorithm's corresponding hash value, the store depth and the store width. Lastly, reference files are created for the object so that they can be found and retrieved given an identifier (ex. persistent identifier (pid)). Note: Objects are also stored once and only once.

By calling the various interface methods for `storeObject`, the calling app/client can validate, store and tag an object simultaneously if the relevant data is available. In the absence of an identfiier (ex. persistent identifier (pid)), `storeObject` can be called to solely store an object. The client is then expected to call `verifyObject` when the relevant metadata is available to confirm that the object has been stored as expected. And to finalize the process (to make the object discoverable), the client calls `tagObject``. In summary, there are two expected paths to store an object:
```java
// All-in-one process which stores, validates and tags an object
objectMetadata objInfo = storeObject(InputStream, pid, additionalAlgorithm, checksum, checksumAlgorithm, objSize)

// Manual Process
// Store object
objectMetadata objInfo = storeObject(InputStream)
// Validate object, throws exceptions if there is a mismatch and deletes the associated file
verifyObject(objInfo, checksum, checksumAlgorithn, objSize)
// Tag object, makes the object discoverable (find, retrieve, delete)
tagObject(pid, cid)
```

**How do I retrieve an object if I have the pid?**
- To retrieve an object, call the Public API method `retrieveObject` which opens a stream to the object if it exists.

**How do I find an object or check that it exists if I have the pid?**
- To find the location of the object, call the Public API method `findObject` which will return the content identifier (cid) of the object.
- This cid can then be used to locate the object on disk by following HashStore's store configuration.

**How do I delete an object if I have the pid?**
- To delete an object, call the Public API method `deleteObject` which will delete the object and its associated references and reference files where relevant.
- Note, `deleteObject` and `tagObject` calls are synchronized on their content identifier values so that the shared reference files are not unintentionally modified concurrently. An object that is in the process of being deleted should not be tagged, and vice versa. These calls have been implemented to occur sequentially to improve clarity in the event of an unexpected conflict or issue.


###### Working with metadata (store, retrieve, delete)

HashStore's '/metadata' directory holds all metadata for objects stored in HashStore. To differentiate between metadata documents for a given object, HashStore includes the 'formatId' (format or namespace of the metadata) when generating the address of the metadata document to store (the hash of the 'pid' + 'formatId'). By default, calling `storeMetadata` will use HashStore's default metadata namespace as the 'formatId' when storing metadata. Should the calling app wish to store multiple metadata files about an object, the client app is expected to provide a 'formatId' that represents an object format for the metadata type (ex. `storeMetadata(stream, pid, formatId)`).

**How do I retrieve a metadata file?**
- To find a metadata object, call the Public API method `retrieveMetadata` which returns a stream to the metadata file that's been stored with the default metadata namespace if it exists.
- If there are multiple metadata objects, a 'formatId' must be specified when calling `retrieveMetadata` (ex. `retrieveMetadata(pid, formatId)`)

**How do I delete a metadata file?**
- Like `retrieveMetadata`, call the Public API method `deleteMetadata` which will delete the metadata object associated with the given pid.
- If there are multiple metadata objects, a 'formatId' must be specified when calling `deleteMetadata` to ensure the expected metadata object is deleted.


###### What are HashStore reference files?

HashStore assumes that every object to store has a respective identifier. This identifier is then used when storing, retrieving and deleting an object. In order to facilitate this process, we create two types of reference files:
- pid (persistent identifier) reference files
- cid (content identifier) reference files

These reference files are implemented in HashStore underneath the hood with no expectation for modification from the calling app/client. The one and only exception to this process when the calling client/app does not have an identifier, and solely stores an objects raw bytes in HashStore (calling `storeObject(InputStream)`).

**'pid' Reference Files**
- Pid (persistent identifier) reference files are created when storing an object with an identifier.
- Pid reference files are located in HashStores '/refs/pid' directory
- If an identifier is not available at the time of storing an object, the calling app/client must create this association between a pid and the object it represents by calling `tagObject` separately.
- Each pid reference file contains a string that represents the content identifier of the object it references
- Like how objects are stored once and only once, there is also only one pid reference file for each object.

**'cid' Reference Files**
- Cid (content identifier) reference files are created at the same time as pid reference files when storing an object with an identifier.
- Cid reference files are located in HashStore's '/refs/cid' directory
- A cid reference file is a list of all the pids that reference a cid, delimited by a new line ("\n") character


###### What does HashStore look like?

```
# Example layout in HashStore with a single file stored along with its metadata and reference files.
# This uses a store depth of 3, with a width of 2 and "SHA-256" as its default store algorithm
## Notes:
## - Objects are stored using their content identifier as the file address
## - The reference file for each pid contains a single cid
## - The reference file for each cid contains multiple pids each on its own line

.../metacat/hashstore/
└─ objects
└─ /d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
└─ metadata
└─ /15/8d/7e/55c36a810d7c14479c9...b20d7df66768b04
└─ refs
└─ pid/0d/55/5e/d77052d7e166017f779...7230bcf7abcef65e
└─ cid/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
hashstore.yaml
```


## Development build

HashStore is a Java package, and built using the [Maven](https://maven.apache.org/) build tool.
Expand Down Expand Up @@ -44,6 +180,9 @@ $ java -cp ./target/hashstore-1.0-SNAPSHOT.jar org.dataone.hashstore.HashStoreCl
# Get the checksum of a data object
$ java -cp ./target/hashstore-1.0-SNAPSHOT.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -getchecksum -pid testpid1 -algo SHA-256

# Find an object in HashStore (returns its content identifer if it exists)
$ java -cp ./target/hashstore-1.0-SNAPSHOT.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -findobject -pid testpid1

# Store a data object
$ java -cp ./target/hashstore-1.0-SNAPSHOT.jar org.dataone.hashstore.HashStoreClient -store /path/to/store -storeobject -path /path/to/data.ext -pid testpid1

Expand Down
4 changes: 2 additions & 2 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
</properties>

<dependencies>
Expand Down
Loading
Loading