Skip to content

Commit 58bea0b

Browse files
authored
optimize neon loadu_128/storeu_128 (#384)
vld1q_u8 and vst1q_u8 has no alignment requirements. This improves performance on Oracle Cloud's VM.Standard.A1.Flex by 1.15% on a 16*1024 input, from 13920 nanoseconds down to 13800 nanoseconds (approx)
1 parent 5b9af1c commit 58bea0b

File tree

1 file changed

+2
-4
lines changed

1 file changed

+2
-4
lines changed

c/blake3_neon.c

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,12 @@
1010

1111
INLINE uint32x4_t loadu_128(const uint8_t src[16]) {
1212
// vld1q_u32 has alignment requirements. Don't use it.
13-
uint32x4_t x;
14-
memcpy(&x, src, 16);
15-
return x;
13+
return vreinterpretq_u32_u8(vld1q_u8(src));
1614
}
1715

1816
INLINE void storeu_128(uint32x4_t src, uint8_t dest[16]) {
1917
// vst1q_u32 has alignment requirements. Don't use it.
20-
memcpy(dest, &src, 16);
18+
vst1q_u8(dest, vreinterpretq_u8_u32(src));
2119
}
2220

2321
INLINE uint32x4_t add_128(uint32x4_t a, uint32x4_t b) {

0 commit comments

Comments
 (0)