You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed some kind of block concurrency issues in limited environments (i.e. no unlimited horizontal scaling), which can have some non-negligible performance impact overall. It may be due to #600.
I proved a payload of 20 contiguous blocks with a t2d-60 instance running 12 simulated workers (in-memory runtime). Because of the limit on parallel block proving, we need to have a block fully proven before adding a new one to the queue. However, because block proofs rely on the previous one, this becomes purely sequential, meaning we need block 1 to finish before kicking off segment proofs for block 17, etc...
I've attached the logs below. If we grep everything related to block 1:
$ catb1_20.log| grep 'id="b1 '2024-09-19T13:34:57.225907Z INFO p_gen: zero::prover_state: using on demand circuit ProverStateManager { circuit_config: CircuitConfig { circuits: [16..21,8..21,8..21,4..20,8..17,4..21,17..24,16..23,7..23] }, persistence: Disk(OnDemand) } id="b1 - 0_0 (0)"2024-09-19T13:34:57.276424Z INFO p_gen: evm_arithmetization::generation::state:CPUhalted after 11092 cycles id="b1 - 0_0 (0)"2024-09-19T13:34:57.276735Z INFO p_gen: evm_arithmetization::generation:CPUtrace padded to 16384 cycles id="b1 - 0_0 (0)"2024-09-19T13:35:00.468043Z INFO p_gen: evm_arithmetization::witness::traces: Trace lengths (before padding): TraceCheckpoint { arithmetic_len: 1705, byte_packing_len: 136, cpu_len: 16384, keccak_len: 696, keccak_sponge_len: 29, logic_len: 161, memory_len: 97894 }, mem_before_len: 66049, mem_after_len: 0 id="b1 - 0_0 (0)"2024-09-19T13:36:05.206126Z INFO p_gen: zero::prover_state: using on demand circuit ProverStateManager { circuit_config: CircuitConfig { circuits: [16..21,8..21,8..21,4..20,8..17,4..21,17..24,16..23,7..23] }, persistence: Disk(OnDemand) } id="b1 - 0_0 (0)"2024-09-19T13:36:05.224921Z INFO p_gen: evm_arithmetization::generation::state:CPUhalted after 13466 cycles id="b1 - 0_0 (0)"2024-09-19T13:36:05.225135Z INFO p_gen: evm_arithmetization::generation:CPUtrace padded to 16384 cycles id="b1 - 0_0 (0)"2024-09-19T13:36:05.638000Z INFO p_gen: evm_arithmetization::witness::traces: Trace lengths (before padding): TraceCheckpoint { arithmetic_len: 2051, byte_packing_len: 169, cpu_len: 16384, keccak_len: 744, keccak_sponge_len: 31, logic_len: 175, memory_len: 103964 }, mem_before_len: 66049, mem_after_len: 0 id="b1 - 0_0 (0)"2024-09-19T13:36:25.986594Z INFO p_gen: zero::ops: segment proof (Dummy) took 88.760709993s id="b1 - 0_0 (0)"2024-09-19T13:36:59.691013Z INFO p_gen: zero::ops: segment proof (Dummy) took 54.484886601s id="b1 - 0_0 (0)"
And later when proof for block 1 is complete and we add block 17 to the queue:
2024-09-19T13:39:34.592868Z INFO zero::prover: Successfully proved block 12024-09-19T13:39:34.592938Z INFO zero::prover: Proving block 17
There is a delta of 2min35sec during which only aggregation proofs have been performed for this block (which are orders of magnitude faster).
An easy way to deal with this is just to increase the pool size (#644 makes it possible) but this goes against the initial purpose of this feature. Ideally we'd want to free the queue as soon as all the segments have been computed, but this seems a bit hacky.
I think we'll soon need some actual benchmarking criteria in CI, because the default parameters here are impacting negatively perfs, and we'd ideally not introduce any noticeable regression in develop / main.
@atanmarko Just as a heads-up, if we can't address it prior to the next release, I'll bump the default pool size to a larger one to prevent any perf regression.
I've noticed some kind of block concurrency issues in limited environments (i.e. no unlimited horizontal scaling), which can have some non-negligible performance impact overall. It may be due to #600.
I proved a payload of 20 contiguous blocks with a
t2d-60
instance running12
simulated workers (in-memory
runtime). Because of the limit on parallel block proving, we need to have a block fully proven before adding a new one to the queue. However, because block proofs rely on the previous one, this becomes purely sequential, meaning we need block 1 to finish before kicking off segment proofs for block 17, etc...I've attached the logs below. If we grep everything related to block 1:
And later when proof for block 1 is complete and we add block 17 to the queue:
There is a delta of 2min35sec during which only aggregation proofs have been performed for this block (which are orders of magnitude faster).
An easy way to deal with this is just to increase the pool size (#644 makes it possible) but this goes against the initial purpose of this feature. Ideally we'd want to free the queue as soon as all the segments have been computed, but this seems a bit hacky.
b1_20.log
The text was updated successfully, but these errors were encountered: