-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instruction Maybe-Wishlist for Recursion and Consensus #259
Comments
It turns out the the opstack table height is the bottle-neck in the verifier, not the clock cycle count (which is recorded by the processor table). So counting number of clock cycles that a new instruction would shave off might not be the relevant metric. I was considering a
This new |
I guess my view has changed a bit. The recurse_or_return I think would work best is one that returns if ST[5] == 1. Otherwise it recurses. That could be used in combination with merkle_step to reduce the loop body to 2 instructions. A loop that walks up a Merkle tree would then be:
I understand that this creates a problem for Merkle trees of height 0, with only one node. But I think that's not a practical problem. edit: but let's see where we stand after having added the dot_steps and the merkle_step. |
I am observing that the following pattern occurs quite often:
Maybe this can be shrunk to one instruction, |
Also the following:
How about a |
Another frequent pattern. This one at the end of a loop iteration to do index bookkeeping.
The general pattern is |
And of course:
could be reduced with |
Also, at the start of a loop to test the termination condition:
I realize this is addressed by |
There is a cost in proving time for each new instruction that you add or make more general. The instruction In general, I would say that if the loop body is more than 40 clock cycles, then using an efficient check for the loop end-condition, like |
This suggestion anticipates the miner's task of updating the mutator set's SWBFI MMR accumulator. He must use the same authentication path twice, once to prove that the old leaf was correct; and once to prove that the new root is correct. It would be a shame to read this authentication path from memory twice, especially since the authenticity of the paths provided in the transaction can be established efficiently by the transaction initiator. To this end I propose (I'm sure one of you proposed this instruction before I did. I'm just writing it here for the record.) |
I did some analysis on which instruction patterns appear the most often when executing the recursive verifier. Unsurprisingly, the sequences The patterns occurring the next most often are:1
The total number of instructions executed is around 285 000. Therefore, even the most commen sequence of 2 instructions hardly seems worth optimizing for. To be honest, this data does not show me a clear path as to which instruction patterns to prioritize. I also can't confirm any of the patterns @aszepieniec identified last week. (This does not include Footnotes
|
This confirms my sense that we've more or less hit a wall when it comes to verifier-optimizations. It would probably be possible to get the verifier for exansion factor 4 below the next power-of-two threshold, but it would require a great deal of work, including the trick that halves the number of memory table rows at the cost of more columns. Cf., the latest verifier benchmark for alpha-6: {
"name": "tasmlib_verifier_stark_verify_inner_padded_height_524288_fri_exp_4",
"benchmark_result": {
"clock_cycle_count": 285183,
"hash_table_height": 234025,
"u32_table_height": 211197,
"op_stack_table_height": 235124,
"ram_table_height": 281599
},
"case": "CommonCase"
} Getting |
Although this is a great analysis, this analysis will not find all candidates for new instructions since the current code is optimized for the current instruction set. An alternative approach would be to analyze the mathematical primitives of the verifier and add new instructions tailor-made for that. I guess |
This proposal targets a more efficient inner-product-and-row-hashing step, which takes place in the recursive verifier
Currently, the hashing part of these steps accounts for 17% of the RAM table and 8% of the RAM Table; and the inner product part accounts for 56% of the RAM table and 16% of the Processor Table. While the Processor Table dominates at (according to @Sword-Smith's most recent profile for version alpha6 with expansion factor 4) 285185 rows, the RAM table is not far behind with 281599. This proposal is chiefly intended to shrink the RAM Table row count. Without matching drops in the Processor Table row count, this proposal is of little use. So in other words, it should be considered in combination with another proposal to shrink the Processor Table row count. The proposal involves modifying the STARK architecture and introducing some new instructions. STARK architecture. Use univariate batching instead of linear batching of execution trace columns. Basically this means using powers of the same single weight instead of independent weights for every column. In fact, this modification alone already generates a speedup for the prover and, with Horner counterpart to Instruction Sends stack
to
If the top 10 elements, the This instruction's only effect is to modify the stack as described. As this map is linear it generates constraints of degree 1. (And if I'm not mistaken, only one constraint.) Instruction Does the same as Instruction Sends stack
to
where, stretching notation, Instruction Note that Instruction
to
where, stretching notation, Analysis. This proposal drops all RAM accesses involved in the inner-product-and-row-hashing steps, and promises to reduce the RAM Table row count by 67% to some 93000. Furthermore, it stands to reduce the Processor Table's row count as well, but less dramatically. Currently:
With the proposed changes:
We have roughly 350 base field columns and 78 extension field columns, so rounding upwards to compute a performance factor one gets:
Applying that improvement to the current total row count across both hashing and inner product steps for both main and auxiliary tables, one gets 0.54 * ( 20000 (hashing) + 47000 (inner product) ) = 36180. This difference would put the Processor Table row count (currently at 285185) below the next power of two. A lot of complexity comes from |
This is a tracking issue. We add imagined instructions that could make recursion (or consensus programs) faster.
read_mem_forward
❌*ptr
(*ptr+n) [element]
HashVarlen
becomes 3.hash_var_len
is only performance critical when the number of to-be-hashed elements is known a-priori, making other approaches feasibledot_step
(see below) eliminates the second next important use forread_mem_forward
dot_step
✅_ acc2 acc1 acc0 *lhs *rhs
_ acc2' acc1' acc0' (*lhs+3) (*rhs+3)
InnerProductOfThreeRowsWithWeights
becomes 1. Stands to reduce `1M to ~65000 cycles.merkle_step
✅_ merkle_node_idx [Digest; 5]
_ (merkle_node_idx // 2) [Digest'; 5]
divine_sibling
andhash
into one instruction.divine_sibling
. Instructionhash
remains available as-is.divine_sibling
andhash
, which change the height of the stack by 5 elements each.get_colinear_y
_ ax [ay] bx [by] [cx]
(possibly different order)_ [cy]
compute_c_values
from 74 instructions to 49. Total cycle count reduction: ~26000.recurse_or_return
✅_ a b
_ a b
recurse
ifreturn
.dup <m> dup <n> eq skiz return <loop_body> recurse
, reducing the op stack delta of loop maintenanance.recurse_or_return
#288absorb_from_mem
✅_ mem_ptr [garbage; 3]
_ (mem_ptr - 10) [garbage; 3]
read_mem 5 read_mem 5 sponge_absorb
, albeit not a drop-in replacement due to the[garbage; 3]
, which is needed for arithmetization reasons.The text was updated successfully, but these errors were encountered: