investigate chunking bug #1806

Trivo25 · 2024-08-26T17:02:40Z

Problem: Chunking is implemented in Kimchi and Pickles. However, some specific circuit configuration seem to cause a bug in Pickles. A prior RFC talks about these specific issues, which seem to be part of the wrap circuit (which needs to implement chunking as well)

RFC: https://github.com/o1-labs/rfcs/pull/45/files#diff-e7efc350e1922294d1200d0040edd95d68ca5eb09d648a1fe0ab3a572b6a7db7R70 (github comments for context as well)
- The wrap domain size is capped -- and we don't do chunking there -- so there's a limit on the number of chunks we can use until / unless we also chunk the wrap proof
How chunking is supposed to work in Pickles: https://github.com/MinaProtocol/mina/blob/develop/src/lib/pickles/test/chunked_circuits/test_chunked_circuits.ml

Goal: Figure out what would need to be done in order to fully support chunking in o1js. Is it a relatively small engineering task or does it require further research?

querolita · 2024-09-17T17:48:24Z

Observations after preliminary exploration of chunking bug in o1js with potential relations with Pickles code.

TODO's throughout Pickles code

Original PR#14284
In src/lib/pickles/compile.ml:
In src/lib/pickles/pickles.ml:
- L1306
- L1797
- L1830
In src/lib/pickles/wrap_domains.ml
- L32
- Discussion about chunking types
In src/lib/pickles/wrap_verifier.ml
- L257
In src/lib/pickles/proof.ml
- L120

Domain size of wrap circuit

According to Matthew, the wrap circuit should as well be chunked:
- chunking in o1js rfcs#45 (comment)
- chunking in o1js rfcs#45 (comment)
The large_wrap_domain test uses 2^19 rows (16 chunks) and wrap domain N2 is not sufficient.

    This circuit was compiled for proofs using the wrap domain of size 15, but the actual wrap domain size for the circuit has size 16. You should pass the ~override_wrap_domain argument to set the correct domain size

We might need to extend common.ml, which only considers 2^13, 2^14, and 2^15 for log_2_domain_size.
Also being able to pass >N2 to override_wrap_domain thus extending Mina_wire_types.Pickles_base.Proofs_verified.V1.t.
Maybe the operations performed inside the wrap also require more chunks than just 1, but fewer than the previous step?
If the wrap circuit needs such a large domain, then it needs to be chunked accordingly as in the step circuit.

Verifier

The invalid_proof test trying to prove 2^17 rows, with N1 wrap domain size, and 4 chunks, fails to verify returning (Pickles.verify dlog_check). I suppose it needs 4 chunks and not just 2 in order to fit additional step operations in the circuit.

ZK rows

the number of ZK rows should be given by the operation:

   zk_rows := (16 * num_chunks + 5) / 7

when num_chunks = 1 then zk_rows = 3.
when num_chunks = 2 then zk_rows = 4 but the above operation should give 5.
running an MVP of chunking with 2 chunks in o1js yields the following error message:
there are not enough random rows to achieve zero-knowledge (expected: 3, got: 2)
this error comes from the Kimchi prover
there, index.cs.zk_rows is 2 but zero_knowledge_limit equals 3
the zk_rows of the constraint system are created inside the build() function of the Builder trait.
no values for num_chunks gives a 2 for zk_rows.
some Kimchi bindings set a fixed 3
- inside pasta_fp_plonk_verifier_index inf caml_pasta_fp_plonk_verifier_index_dummy
- inside pasta_fq_plonk_verifier_index.rs in caml_pasta_fq_plonk_verifier_index_dummy
- inside plonk_verifier_index.rs in impl_verification_key; and in of_wasm
  - when defining w in a call to zk_w(), which is defined as
```
/// Returns the end of the circuit, which is used for introducing zero-knowledge in the permutation polynomial
pub fn zk_w<F: FftField>(domain: D<F>, zk_rows: u64) -> F {
    domain.group_gen.pow([domain.size - zk_rows])
}
```
  - and when calling permutation_vanishing_polynomial defined as (x - w^{n - zk_rows}) * (x - w^{n - zk_rows + 1}) * (x - w^{n - 1}).

Exceptions

When running a program with chunking on o1js (4 chunks, N1 wrap domain), it fails with
an uncaught error.

/workspace_root/src/bindings/ocaml/overrides.js:45
    if (err instanceof Error) throw err;

RuntimeError: unreachable
    at __rg_oom (wasm://wasm/011a91aa:wasm-function[4848]:0x3e27c2)
    at __rust_alloc_error_handler (wasm://wasm/011a91aa:wasm-function[4998]:0x3e3251)
    at alloc::alloc::handle_alloc_error::rt_error::h6c30faefac1fefed (wasm://wasm/011a91aa:wasm-function[5056]:0x3e355f)
    at alloc::alloc::handle_alloc_error::h6289c616c1280e38 (wasm://wasm/011a91aa:wasm-function[5055]:0x3e3554)
    at alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle::h41bc9aef765cae77 (wasm://wasm/011a91aa:wasm-function[3755]:0x3c27a7)
    at <o1_utils::serialization::SerdeAs as serde_with::ser::SerializeAs<T>>::serialize_as::h1da92fe6ae712792 (wasm://wasm/011a91aa:wasm-function[2483]:0x366fca)
    at serde_with::ser::const_arrays::<impl serde_with::ser::SerializeAs<[T; N]> for [As; N]>::serialize_as::h1233a3b3efbeac4b (wasm://wasm/011a91aa:wasm-function[1484]:0x2f0683)
    at kimchi::circuits::constraints::_::<impl serde::ser::Serialize for kimchi::circuits::constraints::ColumnEvaluations<F>>::serialize::h2627bc77795152ac (wasm://wasm/011a91aa:wasm-function[1126]:0x2ad0c6)
    at kimchi::prover_index::_::<impl serde::ser::Serialize for kimchi::prover_index::ProverIndex<G,OpeningProof>>::serialize::hf0dd2bc7ae7d95f7 (wasm://wasm/011a91aa:wasm-function[1589]:0x300c01)
    at caml_pasta_fp_plonk_index_encode (wasm://wasm/011a91aa:wasm-function[1909]:0x32b28c)
    at module.exports.caml_pasta_fp_plonk_index_encode (o1js/dist/node/bindings/compiled/_node_bindings/plonk_wasm.cjs:1475:14)
    at encodeProverKey (o1js/src/lib/proof-system/prover-keys.ts:105:26)
    at write_ (o1js/src/lib/proof-system/zkprogram.ts:1017:48)
    at write$0 (src/bindings/ocaml/lib/pickles_bindings.ml:470:7)
    at spec (src/bindings/ocaml/lib/pickles_bindings.ml:421:40)
    at filter_map$1 (ocaml/base/list.ml:812:14)
    at write$1 (src/bindings/ocaml/lib/pickles_bindings.ml:417:9)
    at write$0 (src/mina/src/lib/key_cache/key_cache.ml:140:5)
    at _iCZ_ (src/mina/src/lib/pickles/cache.ml:114:18)
    at <anonymous> (src/mina/src/lib/promise/js/promise.js:25:37)

According to @mitschabaude , it could be a memory issue with storing the proving key.
When running with let { verificationKey } = await MyProgram.compile({ cache: Cache.None }); to skip writing the prover key, it complaints again about a mismatch in the expected number of zero knowledge rows.

/workspace_root/src/bindings/ocaml/jsoo_exports/overrides.js:45
    if (err instanceof Error) throw err;
                              ^
Error: there are not enough random rows to achieve zero-knowledge (expected: 8, got: 3)
    at module.exports.__wbindgen_error_new (o1js/dist/node/bindings/compiled/_node_bindings/plonk_wasm.cjs:9723:17)
    at <wasm_bindgen::JsError as core::convert::From<E>>::from::he54f4fac6a715911 (wasm://wasm/01276de2:wasm-function[4435]:0x4041cf)
    at caml_pasta_fp_plonk_proof_create (wasm://wasm/01276de2:wasm-function[660]:0x2496fd)
    at module.exports.caml_pasta_fp_plonk_proof_create (o1js/dist/node/bindings/compiled/_node_bindings/plonk_wasm.cjs:2937:14)
    at caml_pasta_fp_plonk_proof_create (src/mina/src/lib/crypto/kimchi_bindings/js/bindings.js:789:15)
    at <anonymous> (src/mina/src/lib/crypto/kimchi_backend/pasta/vesta_based_plonk.ml:120:15)

This suggests a problem in the way we use the memory to write the proving key, and reaffirms the issue with the bindings not using the right number of zk rows (always 3 instead).

Hardcoded

The codebase is full of places where a hardcoded 3 is used to refer to the number of rows for zero knowledge and 1 for the number of chunks:

~~#L1570, #L184 and #L1841 in /src/lib/pickles/pickles.ml~~
~~#L457 in /src/lib/pickles/step.ml~~
~~#L63 and #L120 in /src/lib/pickles/verify.ml~~
~~#L383, #L882, #L883, #L937 and #L938 in src/lib/pickles/compile.ml~~
~~#L90 in /src/lib/pickles/unfinalized.ml~~
~~#L41 in src/lib/pickles/wrap_proof.ml~~
~~#L1566 in src/lib/pickles/wrap_verifier.ml~~
~~#L200 in /src/lib/pickles/verification_key.ml~~
~~#L196 and #L197 in /src/lib/pickles/step_branch_data.ml~~
~~#L232 in /src/lib/pickles/side_loaded_verification_key.ml~~
~~#L242 and #L251 in /src/lib/pickles/plonk_checks/plonk_checks.ml~~
#L282 in /src/lib/crypto/kimchi_bindings/stubs/src/pasta_fp_plonk_verifier_index.rs
#L281 in /src/lib/crypto/kimchi_bindings/stubs/src/pasta_fq_plonk_verifier_index.rs

Proposals

PR16057

in kimchi_bindings/wasm/src/plonk_verifier_index.rs:
- res.set(zk_w(domain, index.zk_rows)).unwrap();
- res.set(permutation_vanishing_polynomial(domain, index.zk_rows)).unwrap();
- but this has no effect on o1js test
in src/lib/pickles/compile.ml
- ((2 * (permuts + 1) * num_chunks) - 2 + permuts) / permuts instead of -1
in src/lib/pickles/step_branch_data.ml
- ((2 * (permuts + 1) * num_chunks) - 2 + permuts) / permuts instead of -1

The above fixes `invalid_proof.ml`

PR16092

create a default number of ZK rows for 1 chunk as a constant to avoid plenty of locations with the hardcoded value 3.
- less error prone, easier to keep track of.

PR2590

fix mismatch in the error message in Kimchi ProverError::NotZeroKnowledge
- the number of rows needs to be at least (16*chunks-2)/7 + 1 but the error message suggests (16*chunks-2)/7 is sufficient.
- the error must be triggered if the number of rows is <= than the strict lower bound, not just <.

I couldn't find any more locations where hardcoded values are introduced, thus I believe there must be some calls to the functions with default values.
- Number of chunks are correctly obtained at o1js/src/bindings/ocaml/lib/pickles_bindings.ml.
- Number of chunks are correctly passed onto mina/src/lib/pickles/step_branch_data.ml.
- The Kimchi error NoZeroKnowledge is raised after the step call to create_async
- The Kimchi error NoZeroKnowledge is raised before the wrap call to create_async.
- Raised before prev_evals in f function.
- Fails in Backend.Tick.Proof.create_async inside create_proof
- Fails in Backend.create_async
- Fails in Kimchi_bindings.Protocol.Proof.Fp.create which is caml_pasta_fp_plonk_proof_create

## Wrap circuit

As chunks increase, the number of commitments and verifier complexity also grows. That means the step circuit grows to fit the verifier computation. Then, the wrap circuit derived from it also grows. We would need to measure how much it grows to automatically overwrite the wrap circuit size. And when that circuit exceeds 2^16, chunk accordingly. This needs further engineering work, to replicate the same behavior in step circuits. Worst case, I would start by setting the number of chunks required in the wrap circuit to be the number of chunks of the step circuit. But it should be smaller (plus, I wonder if it would be a problem to set the chunks higher than actually needed, leaving "empty" chunks).

mitschabaude · 2024-09-18T05:48:11Z

Nice investigation @querolita!!

When running a program with chunking on o1js (4 chunks, N1 wrap domain), it fails with
an uncaught error

Judging from the trace, this is an out of memory error ("__rg_oom") that happens when writing the prover key to a file.
You could compile with zkprogram.compile({ cache: Cache.None }) to skip writing the prover key (but probably the OOM will be somewhere later then.)

We might have to investigate and improve memory usage in Kimchi/Wasm!

Trivo25 · 2024-10-16T08:52:34Z

As originally suspected, the process runs out of memory before it can finish proving a circuit with even 2 chunks (or 2^17 constraints), as a result, we need to decide what approach we want to take next:

wait until the problem resolves itself :)
investigate if we can optimise our wasm code somehow that will enable us to prove larger circuits
(part fix) enable native prover access, this means we will be able to prove large circuits (likely way larger than 2^17 constraints too) but we wont be able to do this in the browser
investigate in what state Memory64 is and if or when we can utilize it

The current wasm memory limit is 4gb, which the process reaches quickly when trying to compile and prove a circuit of size 2^17 constraints (even slightly above 2^16 constraints). I estimate we would need at least 6 to 8gb for 2^17 constraints

mitschabaude · 2024-10-16T09:21:35Z

4. investigate in what state Memory64 is and if or when we can utilize it

Memory64 can be used in Node.js with a flag, also in some nightly browsers behind a flag but you can't expect users to set those. Not sure about the state of the Rust compilation pipeline to wasm64.

https://webassembly.org/features/

Trivo25 · 2024-10-16T10:46:21Z

4. investigate in what state Memory64 is and if or when we can utilize it

Memory64 can be used in Node.js with a flag, also in some nightly browsers behind a flag but you can't expect users to set those. Not sure about the state of the Rust compilation pipeline to wasm64.

https://webassembly.org/features/

Rust can target wasm64 (memory64) as a compile target, but I am unsure if the rest of our stack is compatible with it. Either way, it would still be worth to investigate further. I could even see a world where we release chunking in an experimental state that only works with the correct node flags set, until memory64 lands in an official capacity

Trivo25 · 2025-03-10T06:14:11Z

blocked on #1879

o1-labs deleted a comment Aug 26, 2024

querolita self-assigned this Aug 29, 2024

querolita mentioned this issue Sep 4, 2024

Improved comments for chunked Lagrange evaluations o1-labs/proof-systems#2521

Merged

querolita mentioned this issue Sep 16, 2024

Small fix of zk rows for chunking MinaProtocol/mina#16057

Merged

7 tasks

This was referenced Sep 19, 2024

Remove hardcoded values for zk_rows and num_chunks in Pickles code MinaProtocol/mina#16092

Merged

[easy] Fix lower bound of zk rows for chunking o1-labs/proof-systems#2590

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate chunking bug #1806

investigate chunking bug #1806

Trivo25 commented Aug 26, 2024 •

edited

Loading

querolita commented Sep 17, 2024 •

edited

Loading

mitschabaude commented Sep 18, 2024 •

edited

Loading

Trivo25 commented Oct 16, 2024

mitschabaude commented Oct 16, 2024

Trivo25 commented Oct 16, 2024

Trivo25 commented Mar 10, 2025

investigate chunking bug #1806

investigate chunking bug #1806

Comments

Trivo25 commented Aug 26, 2024 • edited Loading

querolita commented Sep 17, 2024 • edited Loading

TODO's throughout Pickles code

Domain size of wrap circuit

Verifier

ZK rows

Exceptions

Hardcoded

Proposals

PR16057

PR16092

PR2590

mitschabaude commented Sep 18, 2024 • edited Loading

Trivo25 commented Oct 16, 2024

mitschabaude commented Oct 16, 2024

Trivo25 commented Oct 16, 2024

Trivo25 commented Mar 10, 2025

Trivo25 commented Aug 26, 2024 •

edited

Loading

querolita commented Sep 17, 2024 •

edited

Loading

mitschabaude commented Sep 18, 2024 •

edited

Loading