DIP-273 Content hash semantics and format normalization #273

wesbiggs · 2024-03-02T03:10:24Z

Abstract

We propose treating the url field in DSNP announcements as a hint only, and allow consumers to treat as valid any content that matches the hash field.

We normalize usage of hashes throughout the specification to use the base32-encoded multihash string format.

Motivation

DSNP announcements reference off-chain content using a URL and hash. Current instructions for content consumers are to retrieve a content file using its URL, and then verify its hash. These instructions tie the content to a particular retrieval location via HTTPS. If that location (hostname or IP address) is unavailable, temporarily or permanently, there is no sanctioned means of retrieving the content.

We want to make DSNP more robust by making it possible for consumers to find the relevant data on other filesystems where data can be cached, replicated and distributed. This allows service providers to optimize for higher content availability and (potentially) lower latency, and crucially, for users to self-host their own content as an alternative or backup to hosting by external service providers.

Specification Pull Request

#280

Rationale

The url still provides a useful function and will often be the original and quickest way to retrieve the desired content, so it remains.

Backwards Compatibility

url usage can remain the same, but applications can now treat url as a suggestion or hint as to where to find the content matching the hash.

The ability to update the url of an announcement (for those announcement types that support Update Announcements) is unaffected, because the DSNP Content URI of an announcement is based on the hash field.

Reference Implementation and/or Tests

TBD

Security Considerations

Merely retrieving a file by its content address (CID) does not necessarily mean that its hash is guaranteed to match. Consumers should ensure that the actual CID matches (as by recalculating it from the retrieved data, though this is often done by lower-level libraries).

Dependencies

None

References

Multicodec table: https://github.com/multiformats/multicodec/blob/master/table.csv

Copyright

Copyright and related rights waived via CC0.

The text was updated successfully, but these errors were encountered:

wesbiggs · 2024-04-06T16:48:37Z

We cannot expect consumers to try all possible options for CIDs, which can have various bases, hashes, and codecs, not to mention chunk sizes.

Minimally we should allow for the defaults currently generated by ipfs add --cid-version 1, which is base32 sha2-256 dag-pb for chunked files > 256*1024 bytes, and base32 sha2-256 raw for files that fit in a single 256kb chunk.

I suggest we standardize on:

Bases: base32 or base58btc (though it is trivial with the multiformats library to support others)
Hashes: sha2-256 only (this is the only option implemented in the @ipld/dag-pb javascript library at present)
Codecs: dag-pb or raw (with raw only useful when file size is <= 256kb).

Edited to add: We also have to make assumptions about chunk size. The simplest approach is to try with 256kb chunks, but it's worth noting that this is merely a default in common IPFS utilities. Do these "impedance mismatches" make the whole idea of treating CIDs as hashes untenable? IMO enough of the content files DSNP will be addressing will fit in one chunk to make it useful for comparisons, even if we can't categorically expect to be able to prove that a given byte array is equivalent to a given CID (i.e. there is a chance of false negatives). However, the pragmatic goal is to be able to use CIDs in announcements and have them be directly useful for locating content on IPFS. To accomplish this, it seems reasonable to mandate a particular set of parameters for CID creation.

wilwade · 2024-05-02T17:51:45Z

DSNP Spec Community Call Notes for 2024-05-02

Is it worthwhile to keep the url field? When would the url be used?
- Before the content is widely available on IPFS
- If the content is not available on IPFS
- What if the CID and url have different data? CID wins
There must always be a way to verify the content retrieved via the url or via the network
- https://github.com/LibertyDSNP/dsnp-hash-util

wesbiggs · 2024-05-14T03:06:08Z

Thanks for the notes. Agree with the commentary on url usage. Note that nothing in the proposed spec requires the use of IPFS. The presence of a CID does not imply that a document with that CID is or will be available on IPFS.

There must always be a way to verify the content retrieved via the url or via the network

Per my musings earlier in the thread, "always" is tricky with CIDs. We can say that if the CID was generated using common conventions for parameters, we should be able to verify it. But it is possible for the original CID creator to chunk a file in unexpected ways that would make regenerating the same CID virtually impossible without knowledge of the exact chunking parameters. We need to decide if this possible ambiguity is acceptable.

gobengo · 2024-07-01T21:18:42Z

Minimally we should allow for the defaults currently generated by ipfs add --cid-version 1, which is base32 sha2-256 dag-pb for chunked files > 256*1024 bytes, and base32 sha2-256 raw for files that fit in a single 256kb chunk.

I strongly recommend using a hash function that uses an internal merkle tree, rather than hashing the root created by some other chunking then merkle-encoding processes. For example, what do you think of using blake3 only regardless of the file contents or size?

My understanding is that anything encoded with dag-pb is going to be lacking on the ability to deterministically encode a file to a checksum. This is because dag-pb encodes something that is already in the IPLD data model. i.e. it is an encoding for DAGs. But we're not talking about DAGs. We're talking about files. Thinking in terms of an encoding for DAGs (as is dag-pb) begs the question: "How do you encode the file as a DAG?", ideally, "How do you deterministically encode the file as a DAG?". Just saying "dag-pb it" is not enough to deterministically produce a checksum unless you explain how to deterministically encode a bytestream as a DAG, and dag-pb doesn't say how to do that, nor does any other IPFS spec. blake3 does say how to do that. unlike dag-pb, blake3 explains how to deterministically chunk the input bytestream into the leaves of a merkle tree. The dag-pb spec does not.

IMO if you avoid dag-pb and ipld entirely and just specify to blake3 the file bytes, it will avoid a lot of ambiguity and get a lot better deduplication, because instead of any given bytestream having a CID for each possible way of chunking it into the leaves of a DAG, there will be a clear deterministic process to derive a single CID for each bytestream ('just blake3 hash it').

wesbiggs · 2024-07-01T21:33:33Z

Oh, I like that a lot. I wasn't aware blake3 raw as a CID would handle its own chunking. I will run some tests.

There was previously some concern about mainstream library support for blake3, but that may have improved since I last discussed it.

Do you know of any other projects with similar requirements moving this direction?

gobengo · 2024-07-01T21:49:54Z

Do you know of any other projects with similar requirements moving this direction?

Here are some projects using exclusively blake3 for hashing

Here is an IETF-formatted specification https://github.com/BLAKE3-team/BLAKE3-IETF/blob/main/rfc-b3/draft-aumasson-blake3-00.txt

wesbiggs · 2024-07-03T03:57:34Z

It's happening rather soon, but I'll encourage discussion of this on our monthly DSNP spec Zoom call tomorrow July 3 at 9am Pacific/12pm Eastern. Link and previous meeting recordings here: https://vimeo.com/showcase/dsnp-public-spec-meeting, if you'd like to join.

I think we can distill this feedback into a couple of proposals:

Don't explicitly support CIDs; add Blake3 as a supported hash algorithm, let consumers build CIDs from hashes (@gobengo)
Support CIDs in an even more opinionated way; consider Lucid's narrow spec i.e. raw + Blake3 only (@bumblefudge)

I'll extrapolate two underlying themes: (1) that using dag-pb (and quite possibly dag-cbor) starts to pull in and require a lot of dependencies from the IPLD world beyond raw hash functionality (as I found when experimenting with https://github.com/LibertyDSNP/dsnp-hash-util); (2) that there is a lot of love for Blake3 in the interdwebs (and why not). :-)

To consider: If DSNP-announced content is constrained to content less than 256Kb, dag issues can be avoided, and simple hashes can be transformed to CIDs (and vice versa) easily. This is a pretty reasonable constraint to impose on Activity Content Note and Profile content, though less so on media attachments (which are, however, not directly announced and can incorporate multiple hashes into the format).

Also worth noting that we still have the fallback of Update Announcements, which can be used to notify the network to change the anchor for a post from (say) a hash to a CID, or from one hash algorithm to another.

A synthesized proposal (possibly one that makes no one happy, but worth a shot):

add CIDv1 as proposed, but constrained to the "raw" codec (and hence <256kb if IPFS is used); for the embedded hash, allow only sha2-256 (Kubo default) and blake3 (Iroh, Lucid).
also add blake3 as a supported hash algorithm (alongside existing sha2-256 and blake2b multihashes)

Corollary question -- do we need to support blake2b, or do we act fast and go for blake3 instead?

gobengo · 2024-07-04T00:16:02Z

Personally I don’t know of a strong reason to support BLAKE-2b in addition to BLAKE3 + SHA

wesbiggs · 2024-07-10T18:04:53Z

Summary of Discussion from 2024-07-03 Public DSNP Spec Meeting

Video here: https://vimeo.com/showcase/11090945/video/980924182 (discussion begins at about 19:54)
Overview of current DIP pull request diff
Discussion of proposals noted above in this thread
Discussion of usage trends of blake3 in the IPFS community
Discussion of DSNP design goals and needs:
- data compression for on-chain data
- content integrity
- DSNP does not necessarily need to depend on a specific hash algorithm or legacy IPFS conventions
Discussion of "mainnet" IPFS's underlying 2MB maximum chunk size (though default produced with many tools is 256KB)
- Up to 2MB, CIDs using raw and multihashes can be transformed into each other without difficulty and hosted on IPFS
- Above 2MB, files on IPFS must use chunking codecs like dag-pb and dag-cbor. The hash used in the CID becomes the root of a tree of hashes for the chunks.
Additional points raised:
- CIDs are also used for on-chain reference to Parquet (batch) files, which would be larger than Activity Content files, but may have an enforceable maximum size below 2MB
- Future possibilities for inclusion of ZKPs might require larger files

wesbiggs · 2024-07-11T02:29:49Z

As a summary, we are trying to:

Provide a cryptographically strong permanent content integrity hash for a resource
Enable a resource to be found on a distributed file system (DFS) without binding the content to a specific DNS name or IP address
Avoid endorsing (at the protocol level) a particular DFS
Allow DSNP-implementing systems to define which DFSs they use; in particular, enable Frequency to continue to work with IPFS "mainnet", but with an eye toward forward compatibility with next-generation DFSs.
Keep URLs from being written to the consensus system (in particular, for profileResources and Parquet batches)

From the discussions I have come to a somewhat reluctant conclusion:

Specifying the possibility of dag-pb brings a number of otherwise unneeded dependencies into DSNP-supporting clients
A blake3 raw CIDv1 adds no descriptive value over a blake3 multihash, other than signaling that the file is available on a DFS that uses CIDs.
In addition, it takes more bytes per resource (minimally, the CIDv1 indicator and the raw indicator).
If a client wants to look on IPFS for a resource matching a hash, the transform from multihash to raw CIDv1 is straightforward.

So, I am going to update my proposal to:

Remove blake2b-256 and add blake3 in Supported Hashing Algorithms (sha2-256 will be retained)
Endorse the capability for consuming applications to use the hash to inform a DFS lookup
Alter ProfileResource to have only a multihash, not a CID, and mandate that the system (e.g. DSNP over Frequency) must define how and where to resolve a hash to fetch a resource

mfosterio · 2024-07-22T03:13:23Z

Have you considered digging into https://DID-DHT.com ? It’s based on the BitTorrent Kademlia Mainline DHT and PKARR Public Key Addressable Resource Records (sovereign TLDs) https://github.com/Pubky/pkarr. DID-DHT has an index type registry https://did-dht.com/registry/#indexed-types that are not also private and discoverable but also based on opened linked data vocabularies that can be discovered by SPARQL RDF Query tools like Comunica https://bit.ly/json-ld-query and other JavaScript libraries https://rdf.js.org. When a DID-DHT resolves to a DWN you can host and control your own data and they can message one another as well. You can run a DWN Server anywhere and they can sync your data. https://codesandbox.io/p/github/TBD54566975/dwn-server/main Since DIDs are part of the Verifiable Credential spec that can also be Verified and in return could create a Verifiable WebTorrent network much like PeerTube. There are a ton of features packed in this little DID spec.

mfosterio · 2024-07-24T07:42:11Z

Here is the GitHub repository, I'd be interested in your feedback here. https://github.com/TBD54566975/did-dht

wesbiggs · 2024-07-24T18:47:49Z

Hi @mfosterio that looks interesting. I'm coming in cold so apologies for any misunderstandings. Am I right to think of PKARR as providing similar functionality for Mainline DHT as the IPNS over DNS approach does for IPFS? So then did:dht is a way of wrapping PKARR in DIDs so standard resolvers can be built?

How do you think this could intersect with this DSNP issue/question which at this point is how can we use a hash (or perhaps some other non-human-readable identifier) to find a particular resource across various distributed file systems?

mfosterio · 2024-07-25T00:57:55Z

Yes its goal is to be a peer to peer distributed addressing system in a BitTorrent Mainline DHT. TBD is working on a web echo system where it will plug right into a DIF DWN Decentralized Web Node as a resource that can be spun up anywhere and synced.

https://identity.foundation/decentralized-web-node/spec/

https://github.com/TBD54566975/dwn-server

https://codesandbox.io/p/github/TBD54566975/dwn-server/main

They are aiming to build a developer framework that allows anyone to implement any protocol to solve a problem in the DWeb space. It's worth looking into and see if it helps any of your technical goals.

mfosterio · 2024-07-25T03:34:17Z

A lot of your previous technical discussion around CIDs and IPLD is handled in the
DAG CBOR and discussed in messaging section of the DWN Spec https://identity.foundation/decentralized-web-node/spec/#messages but they do message one another on a did:dht, did:key, did:web, or any other did:method. This project implementation is evolving at TBD so some of it may change with implementation feedback.

The goal is to address the resources in several different ways and if one resource goes down the other retains the messaging. A DWN references all the methods it can be messaged by and its permissions.

A property of a did is that you want the retention addressing to be retained in a peer to peer system. Mainline DHTs are not retained indefinitely so they are currently working on a retention challenge https://did-dht.com/#term:retention-challenges to maintain the retention set https://did-dht.com/#term:retention-set and republishing the resource address bindings https://did-dht.com/#republishing-data.

mfosterio · 2024-07-25T05:42:53Z

One thing to note, is that it's not a good idea to publicly broadcast an IPFS CID or "PIN" it to be distributed on peers you don't know on a social network. If a user posts something by accident and wants to remove it, the IPFS network poses challenges to reel unwanted content back in.

A good quality about the Mainline DHT and DWNs is that it offers a limited duration in the DHT(approximately 2 hours). These are better scenarios for dereferencing and removing content in distributed systems. You can republish the references you are sure you want to keep https://did-dht.com/#republishing-data and back them up to IPFS and reference them in your DWN with permissions for you to read only, this scenario will allow you to rebuild the content as long as you have the CID references to the IPFS files backed up locally. These are options I'm outlining on my LDUX concept.

Good qualities about Mainline DHT are the following:
It has a proven track record of 15 years.
It is the biggest DHT in existence with an estimated 10 million servers.
It retains data for multiple hours at no cost.
It has been implemented in most languages and is stable.

wesbiggs · 2024-07-29T16:33:05Z

Hi @mfosterio, thanks for the additional info.

In most cases we do want semi-permanent pinning with DSNP. However, it's definitely worth looking at mutable addressing options (IPNS variants or BEP46); for something like a user profile document, they provide a great way to reduce consensus system transactions. I'm hesitant to suggest that the spec mandates specific distributed file systems that must be supported in order to work with DSNP, though. Similarly, key management at social media user scale becomes a challenge. Overall though I think it would be beneficial to understand if DWNs can play a role in the architecture.

wesbiggs · 2024-07-29T16:57:06Z

I updated @dsnp/hash-util to align with the current proposed change (sha256 and blake3 hashes, base32 only) and minimized its dependencies so it doesn't require IPLD or indeed any multiformats libraries. (The code is fairly trivial, in fact, but it's nice to have a reference implementation.)

I think we need to decide whether a hash-only solution (no URLs) is sufficient when looking at profileResources and batch announcements. (As it stands in the current proposal, individual announcements still have URLs and content hashes.) There is concern that a mere hash does not provide enough information about where a client can go to find the content, and possibly even a CID assumed to be on IPFS does not do this in a way that meets expectations around latency.

The options as I see it:

Leave proposal as is, consumers are responsible for figuring out how to find content matching hashes.
Add a URL component to these data structures to make the preferred low-latency lookup location explicit.
Something like 2, but consider certain restrictions that help prevent on-chain URLs from being directly human readable, such as compressing the URL, encoding the URL in a different base, or using a decentralized link shortener.
Let providers define their own gateway URL: that is, a template URL that the hash can be inserted into. For profileResources, we would need to add a provider identifier (and consider options for non-delegated usage); batch announcements retain sender info.
Like 4, but instead of defining at the provider level, define gateway URLs at DSNP spec level. For instance, popular IPFS pinning services could be registered and referenced.

Other ideas?

wesbiggs · 2024-08-01T18:04:30Z

Notes from DSNP Spec Community Call 2024-08-01

Revised proposal presented

"contentHash" field in Announcements:

sha2-256 or blake3 multihash as byte array (change blake2b→blake3)
does not use CIDv1 (no change)
Serialized in 0x{hex} form when forming a DSNP Content URI (no change)

"hash" extension in Activity Content

sha2-256 and/or blake3 multihash, encoded using base32 only; other multihashes allowed but not required (change blake2b→blake3, narrow to base32)

ProfileResource User Data Avro object (DIP-267 also in review)

CIDv1 using raw codec and sha2-256 or blake3 hash, serialized as base32
Maximum file size per resource type; Activity Content Profile max of 256Kb

Other

Specification wording proposal adds clarity that URL is a suggested but not only possible location

Discussion points

Using CIDv1 for ProfileResource may not align with IETF standardization efforts, which currently do include multihash.
It was noted that ProfileResource is designed to have additional types evolve, which could accommodate a change in format for the address field.
However, to do so, we should consider naming the address field within the Avro object more generically, and specify the format alongside the resource type enum so that consuming applications can correctly parse not only CIDs but other formats that may be specified in the future.

gobengo · 2024-08-01T18:09:42Z

However, to do so, we should consider naming the address field within the Avro object more generically, and specify the format alongside the resource type enum so that consuming applications can correctly parse not only CIDs but other formats that may be specified in the future.

alternatively, require the value of the field to be a URI, not some other binary whose semantics must be known a priori.

then you can use CIDs with an ipfs: or dweb: URI scheme, but also can use the uri scheme to evolve to other URI Schemes like RFC6920, did: etc.

Problem ======= DSNP should provide affordances for finding content that do not rely on a single DNS-based point of failure for content hosting. More in the original discussion: #273 Solution ======== Enhance the specification to treat URLs in Announcements as suggestions rather than canonical locations for content. Provide a simple and well-specified set of hashes and encodings that can be used consistently throughout the protocol. Use IPFS CIDv1 specifically for locating profiles. Change summary: --------------- * Broaden ProfileResource definition so that different types of profile resources can use different, possibly distributed, file systems via a generic `contentAddress` field * Simplify multihashes to use base32 encoding only and sha2-256 or blake3 as the hashing algorithm * Update various example hashes in line with this * Update to pre-1.3.0 versioning and sync prerelease changelogs for other recent additions to the spec * Change spec language to expand on how to treat content hash + URL pairs. * Announcements use the same base32 encoded multihash for the `contentHash` and `targetContentHash` fields. --------- Co-authored-by: Wes Biggs <wes.biggs@amplica.io>

wesbiggs mentioned this issue Mar 2, 2024

Request for Proposals: Content removal affordances and semantics #259

Open

wilwade mentioned this issue Jun 11, 2024

DIP-267 Provide a User Data mechanism for links to profiles #277

Merged

4 tasks

wesbiggs mentioned this issue Jul 1, 2024

DIP-273 Content Addressing #280

Merged

wesbiggs mentioned this issue Jul 11, 2024

Add Avro ProfileResource type and dsnp.profile-resources schema LibertyDSNP/frequency-schemas#53

Closed

wesbiggs changed the title ~~DIP-273 Content-addressed file retrieval affordances~~ DIP-273 Content hash usage Sep 25, 2024

wesbiggs changed the title ~~DIP-273 Content hash usage~~ DIP-273 Content hash semantics and format normalization Sep 25, 2024

wesbiggs closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIP-273 Content hash semantics and format normalization #273

DIP-273 Content hash semantics and format normalization #273

wesbiggs commented Mar 2, 2024 •

edited

Loading

wesbiggs commented Apr 6, 2024 •

edited

Loading

wilwade commented May 2, 2024

wesbiggs commented May 14, 2024

gobengo commented Jul 1, 2024 •

edited

Loading

wesbiggs commented Jul 1, 2024

gobengo commented Jul 1, 2024 •

edited

Loading

wesbiggs commented Jul 3, 2024

gobengo commented Jul 4, 2024 via email •

edited

Loading

wesbiggs commented Jul 10, 2024

wesbiggs commented Jul 11, 2024

mfosterio commented Jul 22, 2024

mfosterio commented Jul 24, 2024

wesbiggs commented Jul 24, 2024

mfosterio commented Jul 25, 2024 •

edited

Loading

mfosterio commented Jul 25, 2024

mfosterio commented Jul 25, 2024

wesbiggs commented Jul 29, 2024

wesbiggs commented Jul 29, 2024

wesbiggs commented Aug 1, 2024 •

edited

Loading

gobengo commented Aug 1, 2024 •

edited

Loading

DIP-273 Content hash semantics and format normalization #273

DIP-273 Content hash semantics and format normalization #273

Comments

wesbiggs commented Mar 2, 2024 • edited Loading

Abstract

Motivation

Specification Pull Request

Rationale

Backwards Compatibility

Reference Implementation and/or Tests

Security Considerations

Dependencies

References

Copyright

wesbiggs commented Apr 6, 2024 • edited Loading

wilwade commented May 2, 2024

DSNP Spec Community Call Notes for 2024-05-02

wesbiggs commented May 14, 2024

gobengo commented Jul 1, 2024 • edited Loading

wesbiggs commented Jul 1, 2024

gobengo commented Jul 1, 2024 • edited Loading

wesbiggs commented Jul 3, 2024

gobengo commented Jul 4, 2024 via email • edited Loading

wesbiggs commented Jul 10, 2024

Summary of Discussion from 2024-07-03 Public DSNP Spec Meeting

wesbiggs commented Jul 11, 2024

mfosterio commented Jul 22, 2024

mfosterio commented Jul 24, 2024

wesbiggs commented Jul 24, 2024

mfosterio commented Jul 25, 2024 • edited Loading

mfosterio commented Jul 25, 2024

mfosterio commented Jul 25, 2024

wesbiggs commented Jul 29, 2024

wesbiggs commented Jul 29, 2024

wesbiggs commented Aug 1, 2024 • edited Loading

Notes from DSNP Spec Community Call 2024-08-01

Revised proposal presented

"contentHash" field in Announcements:

"hash" extension in Activity Content

ProfileResource User Data Avro object (DIP-267 also in review)

Other

Discussion points

gobengo commented Aug 1, 2024 • edited Loading

wesbiggs commented Mar 2, 2024 •

edited

Loading

wesbiggs commented Apr 6, 2024 •

edited

Loading

gobengo commented Jul 1, 2024 •

edited

Loading

gobengo commented Jul 1, 2024 •

edited

Loading

gobengo commented Jul 4, 2024 via email •

edited

Loading

mfosterio commented Jul 25, 2024 •

edited

Loading

wesbiggs commented Aug 1, 2024 •

edited

Loading

gobengo commented Aug 1, 2024 •

edited

Loading