Skip to content

Conversation

@phlip9
Copy link
Contributor

@phlip9 phlip9 commented Apr 19, 2025

I might have gotten a bit over excited and started implementing ed25519 support (#73) 😅. I'm pretty excited about graviola -- it would seriously simplify the build issues and reduce the compile overhead we experience with aws-lc-rs and (patched) ring while yielding aws-lc-level perf. Our internal CA/TLS uses ed25519 certs exclusively, so ed25519 support is a requirement.

My plan is to spend the next several weekends implementing ed25519 support in graviola, if you'll let me. If you're not comfortable with that, I totally understand. Hopefully this PR can at least get things started.


This PR adds Rust codegen for all s2n-bignum assembly subroutines required to support ed25519 gen/sign/verify:

  • edwards25519_decode: decodes a 32B compressed edwards curve point. also tests for canonicality.
    • verify: deserialize the pubkey
  • edwards25519_scalarmulbase: computes P := [n]B, where n is a scalar and B is the basepoint.
    • gen: compute our pubkey A := [s]B
    • sign: compute R := [r]B
  • edwards25519_scalarmuldouble: computes P' := [n]P + [m]B, where n and m are scalars, B is the basepoint, and P is an edwards curve point.
    • verify: compute R' := [S]B + [k](-A')
  • bignum_mod_n25519: modular reduction mod L, the order of the edwards25519 basepoint.
    • sign+verify: mod reduce r and k scalars
  • bignum_madd_n25519: computes z := x * y + c (mod L)
    • sign: compute S := r + k * s (mod L)
  • bignum_neg_p25519: computes z := -x (mod 2^255-19)
    • verify: compute -A' := (-(A'_X), A'_Y)

You can see how aws-lc uses these in curve25519.c and curve25519_s2n_bignum_asm.c.

There's also edwards25519_encode (compress+serialize an edwards point), but s2n-bignum's arm impl is fairly pessimistic so as to be endian independent: (https://github.com/awslabs/s2n-bignum/blob/main/arm/curve25519/edwards25519_encode.S#L61-L123). I wrote this one in Rust to avoid that.

Testing

There's a basic sanity check test for edwards25519_decode to at least convince myself that the codegen works, but it's probably higher leverage to test the mid level ed25519 impl, so I've left further testing for later.

Generated assembly

I took a a pass auditing the codegen'd inline assembly. As far as I can tell, everything looks like it translated over correctly -- no missing clobbers, instructions, hoisting looks ok, etc.

The only funny thing I spotted is the somewhat wasteful 16 KiB page aligned constant tables in the generated aarch64 code. Guess that's something to optimize later :)

@ctz
Copy link
Owner

ctz commented Apr 19, 2025

The only funny thing I spotted is the somewhat wasteful 16 KiB page aligned constant tables in the generated aarch64 code. Guess that's something to optimize later :)

AIUI this alignment needs to be the highest in common use for a given architecture. As it currently stands 16KB pages are needed on macos, but theoretically the architecture supports 64KB pages. But it was kind of unclear to me whether adrp actually respects the page size, or is a remnant of the every-page-is-4KB history.

@phlip9
Copy link
Contributor Author

phlip9 commented Apr 20, 2025

AIUI this alignment needs to be the highest in common use for a given architecture. As it currently stands 16KB pages are needed on macos, but theoretically the architecture supports 64KB pages. But it was kind of unclear to me whether adrp actually respects the page size, or is a remnant of the every-page-is-4KB history.

I did a little experimental validation and can confirm adrp is not special on Apple silicon and works with 4KiB alignment. Apple Silicon CPU Optimization Guide indirectly corroborates this:

2.8.3 Long Address Generation
The Arm ISA requires a two-step instruction sequence to generate long distance PC-
relative addresses. ADRP adds a 21-bit signed immediate value to the page address
(sometimes referred to as page number) of the PC while clearing the bottom 12 bits.
This allows an address to be generated +/- 4 GiB from top of the current PC page.
ADRP is often followed by an ADD instruction with a page offset as an immediate value to
generate a full address. In many cases, the assembly will include a name instead of the
immediate values, which will be converted to immediate values by the linker.

000000000000132c ADRP X0, #0x84 ; 0x84000 + 0x001000 = 0x85000
0000000000001330 ADD X0, X0, #0xa00 ; 0x85000 + 0x000a00 = 0x85a00

So looks like just ADRP + 4KiB-aligned tables should work.

There's also ADRP+ADD which doesn't require any alignment:

; aarch64-apple-darwin
adrp     x0, .my_label@PAGE
add      x0, x0, .my_label@PAGEOFF
; aarch64-unknown-linux-gnu
adrp     x0, .my_label
add      x0, x0, :lo12:.my_label

@ctz
Copy link
Owner

ctz commented Apr 23, 2025

Ah, that's good to know - thanks for looking into that. I know that s2n-bignum upstream is working on obviating all pain in this area, so hopefully that ends up in a situation where we don't need to rewrite any assembly soon.

I will keep tabs on this PR!

@codspeed-hq
Copy link

codspeed-hq bot commented Apr 23, 2025

CodSpeed Performance Report

Merging #74 will not alter performance

Comparing phlip9:phlip9/ed25519-low-codegen (c4a7cd8) with main (a0f733b)

Summary

✅ 134 untouched benchmarks

@phlip9
Copy link
Contributor Author

phlip9 commented Apr 23, 2025

Hmm looks like some low::aarch64::edwards25519_decode label references didn't get wired up correctly. I can at least manually edit to fix compilation. Will need to take a look at the driver later

diff --git a/graviola/src/low/aarch64/edwards25519_decode.rs b/graviola/src/low/aarch64/edwards25519_decode.rs
index 5f2fbd1f..0dc9bc81 100644
--- a/graviola/src/low/aarch64/edwards25519_decode.rs
+++ b/graviola/src/low/aarch64/edwards25519_decode.rs
@@ -91,7 +91,7 @@ macro_rules! mulp {
         "add x0, " $dest ";\n"
         "add x1, " $src1 ";\n"
         "add x2, " $src2 ";\n"
-        "bl " Label!("edwards25519_decode_alt_mul_p25519", 0, Before)
+        "bl " Label!("edwards25519_decode_alt_mul_p25519", 3, After)
     )}
 }

@@ -100,7 +100,7 @@ macro_rules! nsqr {
         "add x0, " $dest ";\n"
         "mov x1, " $n ";\n"
         "add x2, " $src ";\n"
-        "bl " Label!("edwards25519_decode_alt_nsqr_p25519", 0, Before)
+        "bl " Label!("edwards25519_decode_alt_nsqr_p25519", 4, After)
     )}
 }

@ctz
Copy link
Owner

ctz commented Apr 24, 2025

Hmm looks like some low::aarch64::edwards25519_decode label references didn't get wired up correctly. I can at least manually edit to fix compilation. Will need to take a look at the driver later

Ah, there is a false assumption here that jumps only happen in the function body; not in a macro. Assuming there is one solution to resolving the label direction, we could fix this by finding that in a later pass and using it when outputting the macro. Alternatively we could expand macros that contain jumps.

@phlip9 phlip9 force-pushed the phlip9/ed25519-low-codegen branch from 8b1f482 to 4fb676b Compare May 11, 2025 01:23
@phlip9
Copy link
Contributor Author

phlip9 commented May 11, 2025

Rebased and added a "macro fixup" pass that runs on the generated Rust code if there are any macro definitions that reference labels.

@codecov
Copy link

codecov bot commented May 17, 2025

Codecov Report

❌ Patch coverage is 99.91327% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.46%. Comparing base (a0f733b) to head (c4a7cd8).
⚠️ Report is 59 commits behind head on main.

Files with missing lines Patch % Lines
graviola/src/mid/ed25519.rs 99.53% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #74      +/-   ##
==========================================
+ Coverage   99.31%   99.46%   +0.14%     
==========================================
  Files         168      181      +13     
  Lines       40269    51529   +11260     
==========================================
+ Hits        39994    51252   +11258     
- Misses        275      277       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

phlip9 added 16 commits May 18, 2025 18:44
Ex: `#define foo()   bl edwards25519_decode_alt_mul_p25519`

Before this change, the generated rust macro didn't wire up the local label
reference properly:

```rust
macro_rules! foo { () => { Q!(
    "bl " Label!("edwards25519_decode_alt_mul_p25519", 0, Before)
)} }
```

After this change:

```rust
macro_rules! foo { () => { Q!(
    "bl " Label!("edwards25519_decode_alt_mul_p25519", 4, After)
)} }
```

To support this, we now track which macros reference labels. We then track
which blocks call these macros. As long as all macro callsites are uniformly
before or after the labels they reference, we can safely replace the local label
id and search direction in the original macro definition.

It was challenging to track macro callsites in the main `RustFormatter` and also
difficult to "pre-locate" callsites while dealing with hoisting, so the actual
macro fixing happens on the generated Rust code from the `RustFormatter` pass.
@phlip9 phlip9 force-pushed the phlip9/ed25519-low-codegen branch from 4ec2dd5 to 06e69fd Compare May 19, 2025 10:03
@ctz ctz mentioned this pull request Nov 23, 2025
2 tasks
@phlip9
Copy link
Contributor Author

phlip9 commented Nov 25, 2025

Replaced by #120

@phlip9 phlip9 closed this Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants