Skip to content
This repository was archived by the owner on May 24, 2024. It is now read-only.
This repository was archived by the owner on May 24, 2024. It is now read-only.

Whitepaper questions #15

Open
Open
@gronostajo

Description

@gronostajo

I'm interested in how Duplicati stores backups, so I'm reading the whitepaper and I have some questions that you could hopefully answer.

Q1.

However, there are two complications relating to this setup. One problem is that the list of hashes can become very large. For a setup with a block size of 100KiB using an SHA­256 hash, a 1TiB file will fill up more than 300MiB hash data, which would need to be stored in the filelist. That was the reason, to store the hash lists as binary blobs. As each SHA­256 hash is 32 bytes, a list of hashes is simply the 32 bytes written, then the next 32 bytes, etc. The hash lists are stored as blocks in the same way that the actual data is stored. This reduces the size of the hash list with a factor of blocksize / hashsize, or for the same 1TiB, the hash list in the filelist becomes around 100KiB. This introduces a level of indirection, but does not add further file or data types. If a single byte is changed in the 1 TiB file, the block itself needs to be updated, as well as the binary blob that contains the block’s hash, resulting in two blocks being updated.

I think I understand where the 300 MiB number comes from:

1 TiB / 100 KiB/hash = 10,737,418.2 hashes
10,737,418.2 hashes * 32 bytes/hash = 327.68 MiB of hashes

This calculation assumes hashes are stored in binary, so this number can't be optimized any further, right? Do I understand correctly that the 100 KiB number mentioned in the paper is just total size of hashes of binary hash lists? That would be 3,200 hash list hashes, ~3,355 hashes in each list (100 KiB/hash * 3,355 hashes/file = 335,5 KiB/file), and that 328 MiB of hash lists must still be stored in blocks, it just doesn't make filelists huge.

Q2.

The other complication is, that over time, the volumes that contain the blocks will start to contain unused data.

I think what's implied here is that deleting backups can drop reference counts of some blocks to 0, thus making volumes that contain them wasteful. But then I can't figure out this part:

It is possible to do some strategic compacting that will exploit the fact that old blocks are less likely to become obsolete than new blocks.

Intuitively I'd say that old blocks are more likely to become obsolete, as old backups are more likely to be deleted than recent ones. That's also the assumption you seem to be making earlier:

When a backup grows over time, users usually want to delete the oldest backups first which is not possible with the chain of dependant patches.

Am I missing something here?

Q3.

Is filelist a nested structure, or a flat one like git's trees? In other words, does a single filelist represent entire root of a backup source, or does it only enumerate direct children of a directory, in particular including other filelists for child directories? If it's the latter, are filelists grouped into volumes too?

Q4.

The whitepaper doesn't state clearly at which level the encryption takes place. I guess volumes are encrypted as a whole? What about filelists? As a malicious actor how much knowledge could I gain by looking at encrypted backups?

Q5.

Are you applying any kind of compression to uploaded files?


By the way, your idea looks pretty solid. Unfortunately my first attempt at using Duplicati failed, but I may give it another try.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions