Skip to content

Commit

Permalink
completely work the persitence article
Browse files Browse the repository at this point in the history
For #1798

My english should be reworked.
The article is currently written as one character conversation which sounds weird, I think it would sound better if it were less conversational.
There are lots of opportunities to shorten my writting.
  • Loading branch information
Jorropo committed Feb 3, 2024
1 parent 9bff70c commit d4155c2
Showing 1 changed file with 70 additions and 52 deletions.
122 changes: 70 additions & 52 deletions docs/concepts/persistence.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,102 @@
---
title: Persistence
description: Learn about how IPFS treats persistence and permanence on the web and how pinning can help keep data from being discarded.
description: Where is IPFS's place in the persistence story ?
---

# Persistence, permanence, and pinning
# The Status Quo

Understand the concepts behind IPFS pinning, along with the differences between persistence, permanence, and pinning.
One goal of IPFS is to preserve humanity's history by letting users store data while minimizing the risk of that data being lost or accidentally deleted. This is often referred to as permanence. But what does permanence _actually_ mean, and why does it matter?

## Persistence versus permanence
A 2011 study found that the [average lifespan of a web page is 100 days](https://blogs.loc.gov/thesignal/2011/11/the-average-lifespan-of-a-webpage/) before it's gone forever. It's not good enough for the primary medium of our era to be this fragile.

One goal of IPFS is to preserve humanity's history by letting users store data while minimizing the risk of that data being lost or accidentally deleted. This is often referred to as permanence. But what does permanence _actually_ mean, and why does it matter?
Solving this is a hard economics problem.
Most data storage mediums ([Flash](https://en.wikipedia.org/wiki/Flash_memory), [Hard Drives](https://en.wikipedia.org/wiki/Hard_disk_drive)) degrade,
some do not, or at an extremely slow rate but they are bad to making the data accessible ([Magnetic Tape](https://en.wikipedia.org/wiki/Magnetic-tape_data_storage)), how useful is an archive if it can't be viewed ?

This means you need an economic solution to pay for long term maintnance costs.

Check failure on line 16 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L16

[ipfs-docs-style.PLNSpelling] Did you really mean 'maintnance'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'maintnance'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 16, "column": 63}}}, "severity": "ERROR"}
Multiple projects try to solve this problem, for example the [Internet Archive](https://archive.org/) is a non profit which has succesfully backed up 863 billion web pages since 1996.

Check failure on line 17 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L17

[ipfs-docs-style.PLNSpelling] Did you really mean 'succesfully'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'succesfully'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 17, "column": 129}}}, "severity": "ERROR"}
For an other example random FTP servers are great resource to find 30+ years old drivers for hardware.

However the current status quo has big flaws:
- Lack of verifiability.
- No good migration story.
- Technical differenciation between live and backups.

Check failure on line 23 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L23

[ipfs-docs-style.PLNSpelling] Did you really mean 'differenciation'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'differenciation'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 23, "column": 13}}}, "severity": "ERROR"}

### Lack of verifiability

Internet Archive is a great project, but when I browse their [first snapshot of wikipedia.org](https://web.archive.org/web/20010727112808/http://www.wikipedia.org/). How can I know this is an accurate snapshot ?

The answer is I can't, someone at Internet Archive, a governement who exercese power over Internet Archive could tamper the snapshots without me being aware.

Check failure on line 29 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L29

[ipfs-docs-style.PLNSpelling] Did you really mean 'governement'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'governement'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 29, "column": 55}}}, "severity": "ERROR"}

Check failure on line 29 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L29

[ipfs-docs-style.PLNSpelling] Did you really mean 'exercese'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'exercese'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 29, "column": 71}}}, "severity": "ERROR"}

This is better than not having a backup but it limits how safely they can be used.
Note that Internet Archive is a best case scenario, this is a well known and trusted entity in the field.
Do you want to run 30 year old code you got from an ftp that does not even have a hostname ?

### No good migration story

What if Internet Archive has funding issues and they can't keep online their wikipedia snapshots ?

Check failure on line 37 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L37

[ipfs-docs-style.PLNSpelling] Did you really mean 'wikipedia'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'wikipedia'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 37, "column": 78}}}, "severity": "ERROR"}
Maybe some new non profit could decide to start hosting the snapshots, however either anyone currently using Internet Archive would need to move to using the new service or some humans at Internet Archive could setup a redirection to the new service.

This forces to have humans in the loop, this limit how easily we can cooperate on this project.

### Technical differenciation between live and backups

Check failure on line 42 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L42

[ipfs-docs-style.PLNSpelling] Did you really mean 'differenciation'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'differenciation'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 42, "column": 15}}}, "severity": "ERROR"}

Humanity create various copies of the same data for various reasons, Internet Archive snapshot the web, your browser cache pages so they load faster, ... however each solution is ad-hoc and incompatible.

If you try to connect to wikipedia.org and it is offline for some reason, your browser wont automatically try archive.org or download it from someone else nearby who viewed the page a couple of minutes ago and has it cached.

A 2011 study found that the [average lifespan of a web page is 100 days](https://blogs.loc.gov/thesignal/2011/11/the-average-lifespan-of-a-webpage/) before it's gone forever. It's not good enough for the primary medium of our era to be this fragile. IPFS can keep every version of your file you wish to store, and make it simple to set up resilient networks for mirroring data.
# How does IPFS improve the situation ?

Nodes on the IPFS network can automatically cache resources they download, and keep those resources available for other nodes. This system depends on nodes being willing and able to cache and share resources with the network. Storage is finite, so nodes need to clear out some of their previously cached resources to make room for new resources. This process is called _garbage collection_.
IPFS does not try to improve on the the economic part of the problem.
How we are gonna store data and make it available over thousands of years is not something IPFS is trying to do.

To ensure that data _persists_ on IPFS, and is not deleted during garbage collection, [data can be pinned](../how-to/pin-files.md) to one or more IPFS nodes. Pinning gives you control over disk space and data retention. As such, you should use that control to pin any content you wish to keep on IPFS indefinitely.
IPFS is solving the other part, how do we decouple the act of storing data and vetting it's authenticity.
To do so IPFS implements Content Addressing, links in IPFS point to the content, not a server who might be online and might serve you what you actually want.

## Garbage collection
This allows to move where and how the data is stored and transfered without the breaking the links.

Check failure on line 56 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L56

[ipfs-docs-style.PLNSpelling] Did you really mean 'transfered'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'transfered'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 56, "column": 58}}}, "severity": "ERROR"}

[Garbage collection](<https://en.wikipedia.org/wiki/Garbage_collection_(computer_science)>) is a form of automatic resource management widely used in software development. The garbage collector attempts to reclaim memory occupied by objects that are no longer in use. IPFS uses garbage collection to free disk space on your IPFS node by deleting data that it thinks is no longer needed.
### Verifiability with IPFS

## Pinning in context
IPFS links (also called CID) contain [cryptographic hashes](https://en.wikipedia.org/wiki/Cryptographic_hash_function) aranged in [merkle trees](https://en.wikipedia.org/wiki/Merkle_tree).

Check failure on line 60 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L60

[ipfs-docs-style.PLNSpelling] Did you really mean 'aranged'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'aranged'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 60, "column": 120}}}, "severity": "ERROR"}

Check failure on line 60 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L60

[ipfs-docs-style.PLNSpelling] Did you really mean 'merkle'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'merkle'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 60, "column": 132}}}, "severity": "ERROR"}

An IPFS node can protect data from garbage collection based on different kinds of user events:
The hashes allows to verify the data you downloaded, if the data received has been modified the hash changes and the data is discarded, this means it is not absolutely critical to fetch from someone you trust.

- The universal way is by adding a low-level [local pin](../how-to/pin-files.md). This works for all data types and can be done manually, but if you add a file using the CLI command [`ipfs add`](../reference/kubo/cli.md#ipfs-add), your IPFS node will automatically pin that file for you.
- When working with files and directories, a better way may be to add them to the local [Mutable File System (MFS)](glossary.md#mfs). This protects the data from garbage collection in the same way as local pinning but is somewhat easier to manage.
The merkle trees are an improvement on top of plain old hashes, they allow for incremental verification and data streaming. This is important for P2P features and allows complex usecases like video streaming.

Check failure on line 64 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L64

[ipfs-docs-style.PLNSpelling] Did you really mean 'merkle'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'merkle'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 64, "column": 5}}}, "severity": "ERROR"}

Check failure on line 64 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L64

[ipfs-docs-style.PLNSpelling] Did you really mean 'usecases'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'usecases'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 64, "column": 179}}}, "severity": "ERROR"}

::: tip
If you want to learn more about how pinning fits into the overall lifecycle of data in IPFS, check out the course from [IPFS Camp _The Lifecycle of Data in DWeb_](https://www.youtube.com/watch?v=fLUq0RkiTBA).
:::
### Migrations with IPFS

## Pinning services
It is probable that we will want to improve technology in the future.

To ensure that your important data is retained, you may want to use a pinning service. These services run lots of IPFS nodes and allow users to pin data on those nodes for a fee. Some services offer a free storage allowance for new users. Pinning services are handy when:
Sadly current limits put hard floor on what can be changed and by who.

- You don't have a lot of disk space, but you want to ensure your data sticks around.
- Your computer is a laptop, phone, or tablet that will have intermittent connectivity to the network. Still, you want to be able to access your data on IPFS from anywhere at any time, even when the device you added it from is offline.
- You want a backup that ensures your data is always available from another computer on the network if you accidentally delete or garbage-collect your data on your own computer.
A link like `https://wikipedia.org/` forces your request to be coupled with some kind of name resolution because a name is all you have.
Turns out domain names expires, they require paying fees to someone else which pays fees to someone else which pays fees to [ICANN](https://en.wikipedia.org/wiki/Internet_Corporation_for_Assigned_Names_and_Numbers).

Some available pinning service providers are:
Solving the economic problem of backups is hard, there are multiple groups and projects proposing various solutions to this, it would be way easier if we were able to join forces on the problem.

:::warning
Some of the pinning services listed below are operated by third party companies. There is no guarantee that these third party companies will continue to maintain their pinning service. It is strongly recommended that you thoroughly research a pinning service before using it to host your data.
:::

- [4EVERLAND Bucket](https://www.4everland.org/bucket/)
- [Estuary](https://estuary.tech/)
- [Filebase](https://filebase.com/)
- [Infura](https://infura.io/)
- [Kriptonio](https://kriptonio.com/)
- [NFT.Storage](https://nft.storage/)
- [Pinata](https://pinata.cloud/)
- [Scaleway](https://labs.scaleway.com/en/ipfs-pinning/)
- [Spheron](https://spheron.network/)
- [Web3.Storage](https://web3.storage/)
### Indifference with IPFS

See how to [work with remote pinning services](../how-to/work-with-pinning-services.md).
Thx to the two points above, there is no particular reason why you need a different link to access the website directly vs a backup made by a third party like the Internet Archive.

Check failure on line 80 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L80

[ipfs-docs-style.PLNSpelling] Did you really mean 'Thx'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'Thx'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 80, "column": 1}}}, "severity": "ERROR"}

## Long-term storage
If you have an awesome idea on how to tackle the economic problem, maybe you can convaince lots of people to install some collaborative archival software which use 1% of their disk drive to backup public content. Without IPFS you also need end users to update their habits to use your software to fetch content.

Check failure on line 82 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L82

[ipfs-docs-style.PLNSpelling] Did you really mean 'convaince'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'convaince'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 82, "column": 82}}}, "severity": "ERROR"}

Storing data using a personal IPFS node is easy, but it can be inconvenient since you have to manage your own hardware. This problem gave rise to _pinning services_, paid services that allow you to upload your data to a remotely hosted IPFS node and retrieve it whenever you want. However, while paying a pinning service to store data is a convenient workaround, it still requires someone to bear the cost of storing that data. If that one sponsor stops paying for that pinning, the content may be lost entirely. While IPFS guarantees that any content on the network is discoverable, it doesn't guarantee that any content is persistently available. This is where [Filecoin](https://filecoin.io) comes in.
Depending on how your IPFS implementation is configured, but good IPFS implementations will want to search in multiple places (so called Content Routing) could be a direct link provided and a [DHT](./dht.md) or [IPNI](./ipni.md) query. This can allow for faster downloads since you are not limited by the remote server and can download closer and or from multiple peers.

### Storing data with Filecoin
By doing this content routing this allows for new places which maintains an updatable list of who should you contact. In other words, if some source node goes down with IPFS, if someone else has a copy a good IPFS implementation will *just work*, unlike HTTP you don't need to manually find a backup somewhere else.

Check failure on line 86 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L86

[ipfs-docs-style.PLNSpelling] Did you really mean 'updatable'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'updatable'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 86, "column": 77}}}, "severity": "ERROR"}

[Filecoin](https://filecoin.io) is a decentralized storage network in which storage providers rent their storage space to clients. The client and the storage provider agree on how much data will be stored, for how long, and at what cost. This agreement is called a _deal_. Once both parties agree to a deal, the client sends the data to the storage provider, who periodically verifies that they are correctly storing the data. When the client wants the data back, they send a request to the storage provider, who initiates the data transfer back to the client. For more information on how Filecoin works, head over to the [official Filecoin documentation →](https://docs.filecoin.io/about/basics/how-filecoin-works/)
### Joining forces in numbers

Filecoin provides users with a dependable, long-term storage solution. However, there are some limitations to consider. The retrieval process is not always as fast as an IPFS pinning service, and the minimum file size accepted by a Filecoin storage provider can be several GiB. Also, the process for creating a storage deal may seem complicated to new users who aren't familiar with blockchain transactions or simply aren't comfortable working within a command line.
Turns out Content Addressing is not only useful for persistance.

Check failure on line 90 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L90

[ipfs-docs-style.PLNSpelling] Did you really mean 'persistance'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'persistance'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 90, "column": 53}}}, "severity": "ERROR"}

### IPFS + Filecoin solutions
Today's status quo is that most projects don't implement content addressing.

Fortunately, there is a growing community of tools and service providers that help simplify the process of making content available over IPFS while also persisting the data via Filecoin. These solutions make it simple to store data using decentralized protocols by acting both as IPFS pinning services and Filecoin storage platforms. Combining the two means that when you upload a file, that file is immediately available for download. Additionally, combined IPFS + Filecoin solutions will periodically bundle data and create a deal with a reputable Filecoin storage provider, ensuring that the data is available in long-term storage. Many solutions include API client libraries for developers to integrate into their apps and services, as well as web interfaces for quickly managing and inspecting files.
The rare projects who do implement it (like docker for caching expensive layers transfer) create snowflake solutions which only work for their problem.

Options in this category include:
If you deploy a docker caching solution you can save money and have faster speeds when transfering docker images, however you can't use it to cache npm packages nor go modules even tho this is the same exact problem.

Check failure on line 96 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L96

[ipfs-docs-style.PLNSpelling] Did you really mean 'transfering'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'transfering'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 96, "column": 88}}}, "severity": "ERROR"}

Check failure on line 96 in docs/concepts/persistence.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/concepts/persistence.md#L96

[ipfs-docs-style.PLNSpelling] Did you really mean 'npm'?
Raw output
{"message": "[ipfs-docs-style.PLNSpelling] Did you really mean 'npm'?", "location": {"path": "docs/concepts/persistence.md", "range": {"start": {"line": 96, "column": 149}}}, "severity": "ERROR"}

- [Web3.Storage](https://Web3.Storage)
- [NFT.storage](https://nft.storage/)
- [Estuary](https://estuary.tech)
- [Powergate](https://github.com/textileio/powergate)
- [ChainSafe Storage](https://storage.chainsafe.io)
- [Fleek Storage](https://fleek.co/storage)
- [Spheron](https://spheron.network)
With IPFS by having a minimal set of specifications about how we describe data and having everything else modular we can apply the same solutions and improvements to:
- Persistence
- Local first networking
- Faster performance (P2P)
- Lower costs (easy caching)

0 comments on commit d4155c2

Please sign in to comment.