Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary host functions allocator usage #4

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
210 changes: 210 additions & 0 deletions text/0004-remove-unnecessary-allocator-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# RFC-0004: Remove unnecessary host functions allocator usage

| | |
| --------------- | ------------------------------------------------------------------------------------------- |
| **Start Date** | 2023-07-04 |
| **Description** | Add alternatives to host functions that make use of the allocator when unnecessary |
| **Authors** | Pierre Krieger |

## Summary

Add new versions of host functions in order to avoid using the allocator when the buffer being allocated is of a size known at compilation time.

## Motivation

The heap allocation of the runtime is currently controlled by the host using a memory allocator on the host side.

The API of many host functions consists in allocating a buffer. For example, when calling `ext_hashing_twox_256_version_1`, the host allocates a 32 bytes buffer using the host allocator, and returns a pointer to this buffer to the runtime. The runtime later has to call `ext_allocator_free_version_1` on this pointer in order to free the buffer.

Even though no benchmark has been done, it is pretty obvious that this design is very inefficient. To continue with the example of `ext_hashing_twox_256_version_1`, it would be more efficient to instead write the output hash to a buffer that was allocated by the runtime on its stack and passed by pointer to the function. Allocating a buffer on the stack in the worst case scenario simply consists in decreasing a number, and in the best case scenario is free. Doing so would save many Wasm memory reads and writes by the allocator, and would save a function call to `ext_allocator_free_version_1`.

After this RFC, only the following functions have no exact equivalent that doesn't use the allocator:

- `ext_storage_get`
- `ext_default_child_storage_get`
- `ext_storage_next_key`
- `ext_default_child_storage_next_key`
- `ext_crypto_ed25519_public_keys`
- `ext_crypto_sr25519_public_keys`
- `ext_crypto_ecdsa_public_keys`
- `ext_offchain_network_state`
- `ext_offchain_local_storage_get`
- `ext_offchain_http_response_wait`
- `ext_offchain_http_response_headers`
- `ext_offchain_http_response_read_body`

## Stakeholders

No attempt was made at convincing stakeholders. The writer of this RFC believes that these changes are pretty non-controversial.

## Explanation

This RFC proposes to introduce the following new host functions:

```wat
(func $ext_storage_read_version_2
(param $key i64) (param $value_out i64) (param $offset i32) (result i64))
Copy link

@koute koute Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result could be an i32 here; even with an i32 a return value of -1 is not ambiguous as you'll physically never be able to actually allocate a buffer of such size.

So, let me see if I got this right, is the idea here that:

  1. the guest prepares a buffer of n bytes,
  2. it calls this host function and passes a pointer to this buffer and its size,
  3. the host reads the storage and writes m bytes into the buffer, where m <= n and returns m,
  4. if m == n (so either the buffer was exactly the right size or we still have some data left) then the guest enlarges the buffer and calls the host function again with a bigger offset and repeats this until the whole value is read (and if the guest guesses wrong then this could be called many times),

correct?

If so maybe instead we could do this: return an i64 which would be a packed (i32, i32) (or a i64 with its upper bit reserved) where the first value is going to be whether the read succeeded, and the second value is going to be the actual size of the storage value, and have the algorithm go like this:

  1. the guest prepares a buffer of n bytes,
  2. it calls this host function and passes a pointer to this buffer and its size,
  3. let m be the size of the storage value; if n <= m then write it to the buffer and return (1, m)
  4. if n > m then return (0, m)
  5. on the guest size if a 0 was returned then enlarge the buffer to the exact size needed and call the function again (which is guaranteed to succeed now), otherwise we're done

And also do not require that "if the pointer to the buffer is invalid then runtime execution stops even if length is 0".

This way we gain the following benefits:

  1. if the buffer is exactly the right size we will only call this function once (as opposed to twice with the current proposal),
  2. if the buffer is not big enough then we will only call this function twice (as opposed to twice or more with the current proposal),
  3. there's only going to be one host-guest memcpy (because if the buffer is not big enough then nothing will be copied) which is going to be faster/more efficient,
  4. the guest can choose whether it wants to try its luck guessing the initial size of the buffer (risk having to reallocate the buffer in the worst case but read the value in only one call in the best case), or it can call the host function with a size of "0" to ask for the right size and prepare a buffer which is of exactly the right size (always read the value in two calls, but won't ever need to reallocate the buffer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result could be an i32 here; even with an i32 a return value of -1 is not ambiguous as you'll physically never be able to actually allocate a buffer of such size.

I went with i64 because I'm not really a fan of "-1i32 is not ambiguous". If we ever switch to a 64bits runtime, then it's no longer impossible for 4 GiB to be allocated. Sure, a 4 GiB storage item would be unusually big, but it's also not in the real of completely absurd.

correct?

Yes, but I think that your explanation builds upon a wrong assumption.

if m == n (so either the buffer was exactly the right size or we still have some data left) then the guest enlarges the buffer and calls the host function again with a bigger offset and repeats this until the whole value is read (and if the guest guesses wrong then this could be called many times),

The idea when I designed this function is that the runtime is supposed to know what the size of the storage item is. When you're for example reading a u32 from storage, you know it's always 4 bytes. When you're reading a Vec<something>, you can first read the length prefix in front of the Vec (which realistically shouldn't be more than 5 bytes) and then allocate a Vec of exactly the correct size.

Allocating a buffer whose size is equal to the size of the runtime storage item builds upon the assumption that you can't decode SCALE-encoded items in a streaming way. This assumption is true in practice now, but in the absolute it isn't.

To give an example, if you have a storage item whose format is Vec<Vec<u8>>, right now you have to allocate a big buffer that contains the entire storage item, then when decoding you copy the data within each inner Vec. It seems more optimal to me (but I could be wrong) to call storage_read once to get the size of the outer Vec, then twice for each inner Vec, and read the data of each inner Vec<u8>s directly its their destination buffer.

(func $ext_default_child_storage_read_version_2
(param $child_storage_key i64) (param $key i64) (param $value_out i64)
(param $offset i32) (result i64))
```

The signature and behaviour of `ext_storage_read_version_2` and `ext_default_child_storage_read_version_2` is identical to their version 1 equivalent, but the return value has a different meaning.
The new functions directly return the number of bytes that were written in the `value_out` buffer. If the entry doesn't exist, a value of `-1` is returned. Given that the host must never write more bytes than the size of the buffer in `value_out`, and that the size of this buffer is expressed as a 32 bits number, a 64bits value of `-1` is not ambiguous.

```wat
(func $ext_hashing_keccak_256_version_2
(param $data i64) (param $out i32))
(func $ext_hashing_keccak_512_version_2
(param $data i64) (param $out i32))
(func $ext_hashing_sha2_256_version_2
(param $data i64) (param $out i32))
(func $ext_hashing_blake2_128_version_2
(param $data i64) (param $out i32))
(func $ext_hashing_blake2_256_version_2
(param $data i64) (param $out i32))
(func $ext_hashing_twox_64_version_2
(param $data i64) (param $out i32))
(func $ext_hashing_twox_128_version_2
(param $data i64) (param $out i32))
(func $ext_hashing_twox_256_version_2
(param $data i64) (param $out i32))
Copy link

@burdges burdges Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the twox hashes still being used anywhere? I'd thought we deprecated them since they're footguns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are still used for hashing the key of a storage value which is a static key.

Copy link
Contributor Author

@tomaka tomaka Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are used in order to know where to store stuff. For example the pallet Foo stores stuff under the key xxhash64("Foo").
While this is out of scope of this RFC, this is also very inefficient (literally every single storage access calculates the xxhashes of the pallet and the item) and should be moved to compile time.

Copy link

@burdges burdges Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We deprecated twox entirely though, no? twox_64 is xxhash64. twox_128 is simply two copies of xxhash64: https://github.com/paritytech/polkadot-sdk/blob/cfa19c37e60f094a2954ffa73b1da261c050f61f/substrate/primitives/core/hashing/src/lib.rs#L77

It's therefore easy to find collisions for both twox flavors, which yields two storage locations which your code believes to be different, but actually wind up being the same. It technically facilitates supply chain attacks against parachains, even if used only at compile time.

Cyan4973/xxHash#165 (comment)
Cyan4973/xxHash#54 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quote me on that, but it is likely that nowadays the only input to the twox hash functions are strings hardcoded in the source code (pallet names and storage key names).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quote me on that, but it is likely that nowadays the only input to the twox hash functions are strings hardcoded in the source code (pallet names and storage key names).

Yes this. I mean it is up to the user, but for maps the default is blake2.

should be moved to compile time.

This is also already done.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll avoid any pallet dependencies with bizarre hard coded storage names then I guess. lol

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also already done.

We do not need host calls for twox hash then I guess.

(func $ext_trie_blake2_256_root_version_3
(param $data i64) (param $version i32) (param $out i32))
(func $ext_trie_blake2_256_ordered_root_version_3
(param $data i64) (param $version i32) (param $out i32))
(func $ext_trie_keccak_256_root_version_3
(param $data i64) (param $version i32) (param $out i32))
(func $ext_trie_keccak_256_ordered_root_version_3
(param $data i64) (param $version i32) (param $out i32))
(func $ext_default_child_storage_root_version_3
(param $child_storage_key i64) (param $out i32))
(func $ext_crypto_ed25519_generate_version_2
(param $key_type_id i32) (param $seed i64) (param $out i32))
(func $ext_crypto_sr25519_generate_version_2
(param $key_type_id i32) (param $seed i64) (param $out i32) (return i32))
(func $ext_crypto_ecdsa_generate_version_2
(param $key_type_id i32) (param $seed i64) (param $out i32) (return i32))
```

The behaviour of these functions is identical to their version 1 or version 2 equivalent. Instead of allocating a buffer, writing the output to it, and returning a pointer to it, the new version of these functions accepts an `out` parameter containing the memory location where the host writes the output. The output is always of a size known at compilation time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected behavior (trap?) for an out-of-bounds memory access ought to be specified.

Copy link
Contributor Author

@tomaka tomaka Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're actually (I guess semi-unintentionally) raising a good point.

If we take for example ext_crypto_ed25519_sign_version_2, on success the function returns 0 and writes to out, and on failure the function returns 1 and doesn't write anything. Should a trap happen if out is out of range even in the situation where 1 would be returned? The answer is important for determinism.

I've added some text mentioning that yes, the range of out must always be checked. This is a bit arbitrary, but I think that passing a completely invalid input has "higher priority" than a failure in what is being demanded from the function.


```wat
(func $ext_default_child_storage_root_version_3
(param $child_storage_key i64) (param $out i32))
(func $ext_storage_root_version_3
(param $out i32))
```

The behaviour of these functions is identical to their version 1 and version 2 equivalents. Instead of allocating a buffer, writing the output to it, and returning a pointer to it, the new versions of these functions accepts an `out` parameter containing the memory location where the host writes the output. The output is always of a size known at compilation time.

I have taken the liberty to take the version 1 of these functions as a base rather than the version 2, as a PPP deprecating the version 2 of these functions has previously been accepted: <https://github.com/w3f/PPPs/pull/6>.

```wat
(func $ext_storage_clear_prefix_version_3
(param $prefix i64) (param $limit i64) (param $removed_count_out i32)
(return i32))
(func $ext_default_child_storage_clear_prefix_version_3
(param $child_storage_key i64) (param $prefix i64)
(param $limit i64) (param $removed_count_out i32) (return i32))
(func $ext_default_child_storage_kill_version_4
(param $child_storage_key i64) (param $limit i64)
(param $removed_count_out i32) (return i32))
```

The behaviour of these functions is identical to their version 2 and 3 equivalent. Instead of allocating a buffer, writing the output to it, and returning a pointer to it, the version 3 and 4 of these functions accepts a `removed_count_out` parameter containing the memory location to a 8 bytes buffer where the host writes the number of keys that were removed in little endian. The functions return 1 to indicate that there are keys remaining, and 0 to indicate that all keys have been removed.

Note that there is an alternative proposal to add new host functions with the same names: <https://github.com/w3f/PPPs/pull/7>. This alternative doesn't conflict with this one except for the version number. One proposal or the other will have to use versions 4 and 5 rather than 3 and 4.

```wat
(func $ext_crypto_ed25519_sign_version_2
(param $key_type_id i32) (param $key i32) (param $msg i64) (param $out i32) (return i32))
(func $ext_crypto_sr25519_sign_version_2
(param $key_type_id i32) (param $key i32) (param $msg i64) (param $out i32) (return i32))
func $ext_crypto_ecdsa_sign_version_2
(param $key_type_id i32) (param $key i32) (param $msg i64) (param $out i32) (return i32))
(func $ext_crypto_ecdsa_sign_prehashed_version_2
(param $key_type_id i32) (param $key i32) (param $msg i64) (param $out i32) (return i64))
```

The behaviour of these functions is identical to their version 1 equivalents. The new versions of these functions accept an `out` parameter containing the memory location where the host writes the signature. The signatures are always of a size known at compilation time. On success, these functions return `0`. If the public key can't be found in the keystore, these functions return `1` and do not write anything to `out`.

Note that the return value is 0 on success and 1 on failure, while the previous version of these functions write 1 on success (as it represents a SCALE-encoded `Some`) and 0 on failure (as it represents a SCALE-encoded `None`). Returning 0 on success and non-zero on failure is consistent with common practices in the C programming language and is less surprising than the opposite.

```wat
(func $ext_crypto_secp256k1_ecdsa_recover_version_3
(param $sig i32) (param $msg i32) (param $out i32) (return i64))
(func $ext_crypto_secp256k1_ecdsa_recover_compressed_version_3
(param $sig i32) (param $msg i32) (param $out i32) (return i64))
```

The behaviour of these functions is identical to their version 2 equivalents. The new versions of these functions accept an `out` parameter containing the memory location where the host writes the signature. The signatures are always of a size known at compilation time. On success, these functions return `0`. On failure, these functions return a non-zero value and do not write anything to `out`.

The non-zero value written on failure is:

- 1: incorrect value of R or S
- 2: incorrect value of V
- 3: invalid signature

These values are equal to the values returned on error by the version 2 (see <https://spec.polkadot.network/chap-host-api#defn-ecdsa-verify-error>), but incremented by 1 in order to reserve 0 for success.

```wat
(func $ext_offchain_http_request_start_version_2
(param $method i64) (param $uri i64) (param $meta i64) (result i32))
```

The behaviour of this function is identical to its version 1 equivalent. Instead of allocating a buffer, writing the request identifier in it, and returning a pointer to it, the version 2 of this function simply returns the newly-assigned identifier to the HTTP request. On failure, this function returns 0. An identifier of 0 is invalid and is reserved to indicate failure.

Host implementers should be aware that, because a zero identifier value was previously valid, this might require slightly more changes to the host than just adding the new function.

```wat
(func $ext_offchain_http_request_write_body_version_2
(param $method i64) (param $uri i64) (param $meta i64) (result i32))
```

The behaviour of this function is identical to its version 1 equivalent. Instead of allocating a buffer, writing two bytes in it, and returning a pointer to it, the new version of this function simply indicates what happened:

- 0 on success.
- 1 if the deadline was reached.
- 2 if there was an I/O error while processing the request.
- 3 if the identifier of the request is invalid.

These values are equal to the values returned on error by the version 1 (see <https://spec.polkadot.network/chap-host-api#defn-http-error>), but incremented by 1 in order to reserve 0 for success.

```wat
(func $ext_offchain_submit_transaction_version_2
(param $data i64) (return i32))
(func $ext_offchain_http_request_add_header_version_2
(param $request_id i32) (param $name i64) (param $value i64) (result i32))
```

The behaviour of these functions is identical to their version 1 equivalent. Instead of allocating a buffer, writing `1` or `0` in it, and returning a pointer to it, the version 2 of these functions simply return 1 or 0.

## Drawbacks

- This RFC might be difficult to implement in Substrate due to the internal code design. It is not clear to the author of this RFC how difficult it would be.

- In some situations, the runtime might still have to call `ext_allocator_malloc_version_1` then call one of the new functions, which is slightly less performant than when the two operations are combined into one call. The author of this RFC believes that this negligible, and that the performance saved by not allocating a buffer in most situations is worth the trade-off.

## Prior Art

The API of these new functions was heavily inspired by API used by the C programming language.

## Unresolved Questions

No unresolved questions.

## Future Possibilities

The existence of the host-side allocator has become questionable over time. It is implemented in a very naive way, and for determinism and backwards compatibility reasons it needs to be implemented exactly identically in every client implementation. Furthermore, runtimes make substantial use of heap memory allocations, and each allocation needs to go twice through the runtime <-> host boundary (once for allocating and once for freeing). Moving the allocator to the runtime side, while it would increase the size of the runtime, would be a good idea. But before the host-side allocator can be deprecated, all the host functions that make use of it need to be updated to not use it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you move the allocator to the runtime side?

Do you just give it a big block of memory and let it split it up as desired?

Copy link
Contributor Author

@tomaka tomaka Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the problem? The runtime could just do whatever it wants with its memory.

If the question is how to make sure that the memory usage is bounded, my personal opinion would be that at initialization the runtime does storage_get(":heappages") and puts the value in its memory allocator, and the allocator crashes if it reaches the limit.

An alternative way is to rely on the fact that the wasm memory needs to be grown by the wasm code with a memory.grow opcode, so the host can detect when memory usage is over a certain threshold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue, I'm just not familiar enough with Wasm and was curious about how this would be done in practice

Copy link
Contributor Author

@tomaka tomaka Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the allocator back into the runtime would make the runtime more similar to a normal program (as normal programs manage their memory themselves, with the exception of "the sbrk mechanism" which is mimicked by memory.grow in wasm).
The fact that the allocator is outside of the runtime is what requires additional code/hacks. So the answer on how to move the allocator back to the runtime is simply "remove the hacks".


When it comes to removing the allocator usage from the functions that still use it after this RFC, it can in my opinion be done like this:

- The `ext_crypto_*_public_keys`, `ext_offchain_network_state`, and `ext_offchain_http_*` host functions can be updated to no longer use the allocator by splitting them in two: one function to query the number of items and one function to query an individual item.
- The `ext_offchain_local_storage_get` host functions should be deprecated in favor of a new `ext_offchain_local_storage_read` host function.
- The `ext_storage_next_key` and `ext_default_child_storage_next_key` host functions should also be updated to get a new `read`-like API. Because storage keys are organized by concatenating hashes, their length is always known ahead of time, and as such a buffer containing the next key can be allocated ahead of time.
- The `ext_storage_get` and `ext_default_child_storage_get` host functions should be deprecated in favor of `ext_storage_read` and `ext_default_child_storage_read`. Most of the time (numbers, cryptographic public keys, etc.), the runtime knows ahead of the time the size of the data that it is going to read. Using `read` instead of `get` can be problematic in situations where the size of the data is not known ahead of time, which is `Vec`s and `String`s, as it might require reallocations and mutiple calls to `read` instead of just one to `get`. While `Vec`s and `String`s are pretty uncommon, it is difficult to know ahead of time whether this is an actual problem, and this would need to be benchmarked. A theoretical alternative approach could be to provide an API similar to `mmap`, where the runtime maps a storage value in its memory, but this theoretical approach might not realistically be implementable.

Furthermore, the input data provided to the runtime is also allocated using the allocator. New host functions that allow reading the input from the runtime would have to be added.

Because all these changes might be controversial and might require benchmarking, and that the removal of the allocator might also be controversial, I have decided to not include them as part of this RFC.