Script updating archive at 2024-11-19T01:45:38Z. [ci skip]

ID Bot · ID Bot · commit 507b2dc8ad2a · 2024-11-19T01:45:38.000Z
diff --git a/archive.json b/archive.json
@@ -1,6 +1,6 @@
 {
   "magic": "E!vIA5L86J2I",
-  "timestamp": "2024-11-17T01:49:57.573600+00:00",
+  "timestamp": "2024-11-19T01:45:36.344907+00:00",
   "repo": "samuel-lucas6/draft-lucas-bkdf",
   "labels": [
     {
@@ -710,7 +710,7 @@
       "labels": [],
       "body": "An alternative to the current KDF approach mentioned by Henry is to take bytes from the final buffer. If `length` is less than the buffer size, you take from the final buffer (e.g., midway through to the end). If it's larger than the buffer size, you would additionally do further mixing of the buffer to retrieve the remaining bytes. Because of `parallelism`, you would then XOR these (potentially larger than a block) outputs to produce the derived key material for the user.\r\n\r\nThis apparently has better memory hardness because you have to keep more data in memory until the end. It also has better performance for outputs smaller than the buffer size. However, there's a performance penalty for larger outputs since it requires more mixing, which is more expensive than a regular KDF. It also sounds more annoying to implement due to the potential extra mixing.",
       "createdAt": "2024-09-22T11:34:28Z",
-      "updatedAt": "2024-11-16T19:34:06Z",
+      "updatedAt": "2024-11-17T12:28:20Z",
       "closedAt": null,
       "comments": [
         {
@@ -733,6 +733,13 @@
           "body": ">> the final output is the hash of the entire memory (i.e. the blocks) with the chosen hash function;\r\n>\r\n> This would be very expensive and is unnecessary.\r\n\r\nExpensive compared to what?\r\n\r\nIs cost should be less than 1/4 the cost of an entire round.  (In a round all the blocks are sequencially hashed, plus for each block we hash three others, thus for each round, we hash 4 times the entire memory, although in different steps.)\r\n\r\nUnnecessary, perhaps, but I don't think it can hurt the overall security...\r\n",
           "createdAt": "2024-11-16T19:34:05Z",
           "updatedAt": "2024-11-16T19:34:05Z"
+        },
+        {
+          "author": "samuel-lucas6",
+          "authorAssociation": "OWNER",
+          "body": "> Expensive compared to what?\r\n\r\nYou'd end up hashing MiBs to GiBs and multiple times if parallelism is used. By worsening performance, you harm security because the parameters have to be reduced for the same delay.\r\n\r\nThe design is inspired by existing designs I've looked at, and using a KDF at the end is done by Argon2/scrypt and some other schemes in the PHC. It's intuitive and relatively cheap.",
+          "createdAt": "2024-11-17T12:28:19Z",
+          "updatedAt": "2024-11-17T12:28:19Z"
         }
       ]
     },
@@ -795,7 +802,7 @@
       "labels": [],
       "body": "If one looks at the original paper, and how the `hash` function (`PRF` in our case) is used:\r\n* `buf[0] = hash(cnt++, passwd, salt)`\r\n* `buf[m] = hash(cnt++, buf[m-1])`\r\n* `buf[m] = hash(cnt++, prev, buf[m])`\r\n* `int other = to_int(hash(cnt++, salt, idx_block)) mod s_cost` (where `block_t idx_block = ints_to_block(t, m, i)`)\r\n* `buf[m] = hash(cnt++, buf[m], buf[other])`\r\n\r\nOne (with the exception of the first usage, and perhaps partially the second one), could conclude that the signature of `hash` is similar to `block_t hash (counter:usize, block_1:block_t, block_2:block_t)`, i.e. it always takes a counter and two blocks, and compresses them into a single block.\r\n\r\nIn fact, the paper states:\r\n> Since H maps blocks of 2k bits down to blocks of k bits, we sometimes refer to H as a cryptographic compression function.\r\n\r\n(And personally, I find the simplicity of the initial Balloon algorithm, and the simplicity of its building blocks, namely the single one `hash`, quite appealing.)\r\n\r\n----\r\n\r\nHowever, the draft has the following usages of `PRF`:\r\n* `key = PRF(key, password || salt || personalization || associatedData || LE32(pepper.Length) || LE32(password.Length) || LE32(salt.Length) || LE32(personalization.Length) || LE32(associatedData.Length))`\r\n* `previous = PRF(key, previous || UTF8(\"bkdf\") || LE32(counter++))`\r\n* `pseudorandom = pseudorandom || PRF(emptyKey, LE32(VERSION) || personalization || LE32(spaceCost) || LE32(timeCost) || LE32(parallelism) || LE32(iteration) || LE64(counter++))`\r\n* `buffer[0] = PRF(key, LE32(VERSION) || LE32(spaceCost) || LE32(timeCost) || LE32(parallelism) || LE32(iteration) || LE64(counter++))`\r\n* `buffer[m] = PRF(key, buffer[m - 1] || LE64(counter++))`\r\n* `buffer[m] = PRF(key, previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3] || LE64(counter++))`\r\n\r\nNamely:\r\n* it introduces the concept of `key`, which doesn't exist in the initial algorithm;\r\n* sometimes there is a counter, sometimes there isn't one;\r\n* sometimes the `VERSION` is included, sometimes not;\r\n* sometimes the order of included parameters is one way, sometimes it is not;\r\n\r\n----\r\n\r\nDoesn't it seem that the draft drifts quite a bit from the original paper?\r\n\r\nI'm not saying it's wrong to have a different take, but I do see two problems:\r\n* the major one being that the paper gives some proofs for what was specified in the paper, meanwhile the draft makes some changes that I'm not sure they are equivalent;\r\n* the way `PRF` is used (with a lot of inline canonicalization) is error prone to this types of problems;\r\n* the way the inputs of the `PRF` are computed, might have performance impacts (as compared to the initial version, more in a different paragraph);\r\n\r\n----\r\n\r\nMany hash functions, especially the \"modern\" ones (I don't know about Blake, but for example Xxh3, which although is not a cryptographic hash function) might have optimized assembly implementations when they work on fixed blocks.\r\n\r\nThus, from this point of view, perhaps the original paper might yield a performance boost (because it uses fixed inputs) that the current draft `PRF` usage that has to concatenate a lot of data.\r\n",
       "createdAt": "2024-11-16T16:22:06Z",
-      "updatedAt": "2024-11-16T19:29:59Z",
+      "updatedAt": "2024-11-17T12:50:40Z",
       "closedAt": null,
       "comments": [
         {
@@ -811,6 +818,13 @@
           "body": "Citing all the places where `PRF` is used:\r\n~~~~\r\nkey = PRF(key, password || salt || personalization || associatedData || LE32(pepper.Length) || LE32(password.Length) || LE32(salt.Length) || LE32(personalization.Length) || LE32(associatedData.Length))\r\n## uses \"parameters\" (i.e. salt, persolalization, associated data, etc.)\r\n## misses the VERSION, timeCost, parallelism, etc.\r\n\r\npseudorandom = pseudorandom || PRF(emptyKey, LE32(VERSION) || personalization || LE32(spaceCost) || LE32(timeCost) || LE32(parallelism) || LE32(iteration) || LE64(counter++))\r\n## uses VERSION\r\n## misses cost, parallelism, etc.\r\n\r\nbuffer[0] = PRF(key, LE32(VERSION) || LE32(spaceCost) || LE32(timeCost) || LE32(parallelism) || LE32(iteration) || LE64(counter++))\r\n## uses VERSION, uses cost\r\n## misses personalization\r\n~~~~\r\n\r\n>> sometimes the VERSION is included, sometimes not;\r\n>\r\n>> sometimes the order of included parameters is one way, sometimes it is not;\r\n\r\nPersonally, I would just take all possible context that is either public (personalization, version, etc.) or mostly constant (costs, parallelism, etc.) and just just mix them into a single value that is to be used wherever \"parameters\" (or part of the parameters) might be needed.\r\n\r\nFor example, rewriting the above:\r\n~~~~\r\ncontext = PRF(emptyKey,\r\n    salt || LE32(salt.Length)      ## let's assume it's public\r\n || pepper || LE32(pepper.Length)  ## let's also assume it's public\r\n || associatedData || LE32(associatedData.Length)\r\n || personalization || LE32(personalization.Length)\r\n || LE32(VERSION)\r\n || LE32(parallelism)\r\n || LE32(timeCost)\r\n || LE32(spaceCost)\r\n)\r\n \r\n\r\nkey = PRF(key, context || password)\r\n\r\npseudorandom = pseudorandom\r\n         || PRF(emptyKey, context || LE32(iteration) || LE64(counter++))\r\nbuffer[0] = PRF(key     , context || LE32(iteration) || LE64(counter++))\r\n~~~~\r\n\r\nSee how \"nice\" the PRF usage looks?\r\n* there is some \"key\";\r\n* the first input is some \"context\";\r\n* then follows the data;\r\n\r\n----\r\n\r\n\r\n\r\n>> the way `PRF` is used (with a lot of inline canonicalization) is error prone to this types of problems;\r\n>\r\n> What type of problem? This type of canonicalization is common with KDFs/MACs. If something is fixed length, you don't need to encode the length. Encoding the length of fixed length parameters is kind of like encoding the length of the length encodings.\r\n\r\nIndeed canonicalization is common, however the usual flavour of canonicalization is either:\r\n* `length(data_x) || data_x || ...`\r\n* `data_x || length(data_x) || ...`\r\n* (this was even standardized by NIST as `TupleHash` in <https://csrc.nist.gov/pubs/sp/800/185/final>;)\r\n\r\nMeanwhile the current draft puts the lengths at the end, thus it's error prone to implement correctly.  (Perhaps it doesn't break the cryptography, but it does offer enough opportunities to mix things and have broken outputs that don't match the test vectors.)\r\n\r\n----\r\n\r\n> I would think more calls to the hash functions is slower, but I haven't implemented this version of the draft to do benchmarks.\r\n\r\nLooking in the Blake3 paper there is the following graph:\r\n![image](https://github.com/user-attachments/assets/8e1f5181-e98c-4f72-a868-531fef136f23)\r\n\r\nAs you can see, for small hashes (although here is the full hash computation, but perhaps it also extends to discrete individual `update(some_data)` calls), there is a smal cost for 64 bytes, then it goes up in between 64-128, then goes down for 128, up again until 256 bytes, and then there is a plateou in between 256 and 2K (perhaps some CPU pipelining or branch-prediction kicks-in).\r\n\r\nThus, assuming one uses Blake3 or another algorithm that has a similar behaviour (strangely enough also SHA2 follows a similar pattern), if the PRF is used as I've proposed above (plus the other two usages), i.e.:\r\n~~~~\r\nPRF(key, context || LE32(iteration) || LE64(counter++))\r\n## let's assume we pad iteration and counter to fit in 16 bytes each\r\n=> PRF(key, 32 bytes || 32 bytes)\r\n=> PRF(key, 64 bytes)\r\n\r\nbuffer[m] = PRF(key, buffer[m - 1] || LE64(counter++))\r\n## let's assume we pad counter to 16 bytes, and we imagine we have an iteration of 0\r\n=> PRF(key, 32 bytes || 32 bytes)\r\n=> PRF(key, 64 bytes)\r\n\r\nbuffer[m] = PRF(key, previous || buffer[m] || buffer[other1] || buffer[other2] || buffer[other3] || LE64(counter++))\r\n## let's also pad counter to 16 bytes, plus iteration of 0\r\n=> PRF(key, 32 bytes || x6)\r\n=> PRF(key, 192 bytes)\r\n~~~~\r\n\r\nWe get almost the optimal behaviour for Blake3.\r\n",
           "createdAt": "2024-11-16T19:29:58Z",
           "updatedAt": "2024-11-16T19:29:58Z"
+        },
+        {
+          "author": "samuel-lucas6",
+          "authorAssociation": "OWNER",
+          "body": "> misses the VERSION, timeCost, parallelism, etc.\r\n\r\nThe parallelism loop iteration can't be included in the key derivation since the key is static.\r\n\r\n> misses cost, parallelism, etc.\r\n\r\nThose are included when computing the pseudorandom bytes.\r\n\r\n> misses personalization\r\n\r\nThis is unnecessary if it's in the key.\r\n\r\n> Personally, I would just take all possible context that is either public (personalization, version, etc.) or mostly constant (costs, parallelism, etc.) and just just mix them into a single value that is to be used wherever \"parameters\" (or part of the parameters) might be needed.\r\n\r\nIt's unfortunately not that simple because you don't want things like the salt/pepper/associated data in the pseudorandom bytes derivation. There's also no need to process those multiple times.\r\n\r\nYou could do it for the VERSION, personalization, parallelism, timeCost, spaceCost, and parallelism iteration, but it's pretty ugly because you need to use an empty key. You also end up potentially hashing more data because a hash is larger than those encoded parameters.\r\n\r\nBut what you're talking about is what I was thinking about with prehashing the personalization so the pseudorandom bytes input fits into a block (that isn't guaranteed if the personalization length is variable).\r\n\r\n> Meanwhile the current draft puts the lengths at the end, thus it's error prone to implement correctly. (Perhaps it doesn't break the cryptography, but it does offer enough opportunities to mix things and have broken outputs that don't match the test vectors.)\r\n\r\nIt's also normal to put the lengths at the end. For example, see the ChaCha20-Poly1305 [RFC](https://www.rfc-editor.org/rfc/rfc8439#section-2.8.1). It allows you to process inputs without knowing their length in advance, which admittedly isn't relevant here. If this is done incorrectly, the test vectors shouldn't pass.\r\n\r\n> Looking in the Blake3 paper there is the following graph\r\n\r\nYes, I've seen that graph. The question is how do repeated small calls do vs one slightly longer call.\r\n\r\nI definitely don't want to include padding because it depends on the algorithm and that gets messy.",
+          "createdAt": "2024-11-17T12:50:39Z",
+          "updatedAt": "2024-11-17T12:50:39Z"
         }
       ]
     }