Skip to content

Commit

Permalink
Update README.md for clarity
Browse files Browse the repository at this point in the history
  • Loading branch information
doulikecookiedough committed Aug 9, 2024
1 parent 41375e0 commit 1da579c
Showing 1 changed file with 62 additions and 56 deletions.
118 changes: 62 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ respective directories and utilizes an identifier-based API for interacting with
HashStore storage classes (like `FileHashStore`) must implement the HashStore interface to ensure
the expected usage of HashStore.

###### Public API Methods
### Public API Methods

- storeObject
- tagObject
Expand All @@ -49,9 +49,9 @@ the expected usage of HashStore.
- deleteMetadata
- getHexDigest

For details, please see the HashStore interface (HashStore.java)
For details, please see the HashStore interface [HashStore.java](https://github.com/DataONEorg/hashstore-java/blob/main/src/main/java/org/dataone/hashstore/HashStore.java)

###### How do I create a HashStore?
### How do I create a HashStore?

To create or interact with a HashStore, instantiate a HashStore object with the following set of
properties:
Expand All @@ -62,7 +62,7 @@ properties:
- storeAlgorithm
- storeMetadataNamespace

```
```java
String classPackage = "org.dataone.hashstore.filehashstore.FileHashStore";
Path rootDirectory = tempFolder.resolve("metacat");

Expand All @@ -79,18 +79,62 @@ storeProperties.setProperty(
HashStore hashStore = HashStoreFactory.getHashStore(classPackage, storeProperties);

// Store an object
hashStore.storeObject(stream, pid)
hashStore.storeObject(stream, pid);
// ...
```

###### Working with objects (store, retrieve, delete)
### What does HashStore look like?

```sh
# Example layout in HashStore with a single file stored along with its metadata and reference files.
# This uses a store depth of 3 (number of nested levels/directories - e.g. '/4d/19/81/' within
# 'objects', see below), with a width of 2 (number of characters used in directory name - e.g. "4d",
# "19" etc.) and "SHA-256" as its default store algorithm
## Notes:
## - Objects are stored using their content identifier as the file address
## - The reference file for each pid contains a single cid
## - The reference file for each cid contains multiple pids each on its own line
## - There are two metadata docs under the metadata directory for the pid (sysmeta, annotations)

.../metacat/hashstore
├── hashstore.yaml
└── objects
| └── 4d
| └── 19
| └── 81
| └── 71eef969d553d4c9537b1811a7b078f9a3804fc978a761bc014c05972c
└── metadata
| └── 0d
| └── 55
| └── 55
| └── 5ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e
| └── 323e0799524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7
| └── sha256(pid+formatId_annotations)
└── refs
├── cids
| └── 4d
| └── 19
| └── 81
| └── 71eef969d553d4c9537b1811a7b078f9a3804fc978a761bc014c05972c
└── pids
└── 0d
└── 55
└── 55
└── 5ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e
```

### Working with objects (store, retrieve, delete)

In HashStore, objects are first saved as temporary files while their content identifiers are
calculated. Once the default hash algorithm list and their hashes are generated, objects are stored
in their permanent location using the store's algorithm's corresponding hash value, the store depth
and the store width. Lastly, reference files are created for the object so that they can be found
and retrieved given an identifier (ex. persistent identifier (pid)). Note: Objects are also stored
once and only once.
and the store width. Lastly, objects are 'tagged' with a given identifier (ex. persistent
identifier (pid)). This process produces reference files, which allow objects to be found and
retrieved with a given identifier.
- Note 1: An identifier can only be used once
- Note 2: Objects are stored once and only once using its content identifier (a checksum generated
from using a hashing algorithm). Clients that attempt to store duplicate objects will receive
the expected ObjectMetadata - with HashStore handling the de-duplication process under the hood.

By calling the various interface methods for `storeObject`, the calling app/client can validate,
store and tag an object simultaneously if the relevant data is available. In the absence of an
Expand All @@ -100,18 +144,18 @@ confirm that the object is what is expected. And to finalize the process (to mak
discoverable), the client calls `tagObject``. In summary, there are two expected paths to store an
object:

```
```java
// All-in-one process which stores, validates and tags an object
objectMetadata objInfo = storeObject(InputStream, pid, additionalAlgorithm, checksum, checksumAlgorithm, objSize)
objectMetadata objInfo = storeObject(InputStream, pid, additionalAlgorithm, checksum, checksumAlgorithm, objSize);

// Manual Process
// Store object
objectMetadata objInfo = storeObject(InputStream)
objectMetadata objInfo = storeObject(InputStream);
// Validate object, if the parameters do not match, the data object associated with the objInfo
// supplied will be deleted
- deleteInvalidObject(objInfo, checksum, checksumAlgorithn, objSize)
deleteInvalidObject(objInfo, checksum, checksumAlgorithn, objSize);
// Tag object, makes the object discoverable (find, retrieve, delete)
tagObject(pid, cid)
tagObject(pid, cid);
```

**How do I retrieve an object if I have the pid?**
Expand All @@ -132,7 +176,7 @@ tagObject(pid, cid)
implemented to occur sequentially to improve clarity in the event of an unexpected conflict or
issue.

###### Working with metadata (store, retrieve, delete)
### Working with metadata (store, retrieve, delete)

HashStore's '/metadata' directory holds all metadata for objects stored in HashStore. All metadata
documents related to a 'pid' are stored in a directory determined by calculating the hash of the
Expand All @@ -155,7 +199,7 @@ that represents an object format for the metadata type (ex. `storeMetadata(strea
which will delete the metadata object associated with the given pid.
- To delete all metadata objects related to a given 'pid', call `deleteMetadata(String pid)`

###### What are HashStore reference files?
### What are HashStore reference files?

HashStore assumes that every object to store has a respective identifier. This identifier is then
used when storing, retrieving and deleting an object. In order to facilitate this process, we create
Expand Down Expand Up @@ -186,46 +230,8 @@ HashStore (calling `storeObject(InputStream)`).
- Cid (content identifier) reference files are created at the same time as pid reference files when
storing an object with an identifier.
- Cid reference files are located in HashStore's '/refs/cid' directory
- A cid reference file is a list of all the pids that reference a cid, delimited by a new line ("
\n") character

###### What does HashStore look like?

```sh
# Example layout in HashStore with a single file stored along with its metadata and reference files.
# This uses a store depth of 3, with a width of 2 and "SHA-256" as its default store algorithm
## Notes:
## - Objects are stored using their content identifier as the file address
## - The reference file for each pid contains a single cid
## - The reference file for each cid contains multiple pids each on its own line
## - There are two metadata docs under the metadata directory for the pid (sysmeta, annotations)

.../metacat/hashstore
├── hashstore.yaml
└── objects
| └── 4d
| └── 19
| └── 81
| └── 71eef969d553d4c9537b1811a7b078f9a3804fc978a761bc014c05972c
└── metadata
| └── 0d
| └── 55
| └── 55
| └── 5ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e
| └── 323e0799524cec4c7e14d31289cefd884b563b5c052f154a066de5ec1e477da7
| └── sha256(pid+formatId_annotations)
└── refs
├── cids
| └── 4d
| └── 19
| └── 81
| └── 71eef969d553d4c9537b1811a7b078f9a3804fc978a761bc014c05972c
└── pids
└── 0d
└── 55
└── 55
└── 5ed77052d7e166017f779cbc193357c3a5006ee8b8457230bcf7abcef65e
```
- A cid reference file is a list of all the pids that reference a cid, delimited by a new line ("\n")
character

## Development Build

Expand Down

0 comments on commit 1da579c

Please sign in to comment.