Remove useless IR when fusing instructions #234

qwe661234 · 2023-09-27T11:55:56Z

Originally, memory usage sharply increased during the frequent allocation and freeing of basic blocks and IR arrays. To reduce memory usage, we introduced a block memory pool and an IR array memory pool.

As shown in the analysis below, we experienced about a 10% performance loss due to the reduction in memory usage. However, we are currently utilizing KiB of memory instead of MiB.

Performance

Metric	Original	Proposed
Dhrystone	2359 DMIPS	2085.5 DMIPS
CoreMark	1967.382 iter/s	1823.333 iter/s

Memory usage

Metric	Original	Proposed
Dhrystone	99 MiB	169 KiB
CoreMark	2.4 MiB	176 KiB

qwe661234 · 2023-09-27T12:06:37Z

Compared to PR #232, introducing a memory pool can effectively reduce memory usage, as valgrind only records the memory you actually use. For example, we allocate 2^32 bytes of memory using mmap for virtual data memory, but the total memory usage is less than 2^32.

Here, I compare the performance based on Benchmark / Performance regression check and memory usage based on host machine.

Performance

Metric	PR #232	This PR
Dhrystone	1058.12 DMIPS	1084.62 DMIPS
CoreMark	922.773 iter/s	1010.928 iter/s

Memory usage

Metric	Original	Proposed
Dhrystone	382,269 B	169,197 B
CoreMark	835,597 B	176,605 B

RinHizakura · 2023-09-27T14:09:56Z

I believe that the benefit of memory reduction mostly comes from mmap. When we undefined the USE_MMAP, you can still see the terrible memory usage. The memory pool can do less since we only have to do mpool_free when block_map_clear happens.

If we think supporting a platform without mmap is not important, I would feel good to apply this, however.

jserv · 2023-09-27T14:26:52Z

I believe that the benefit of memory reduction mostly comes from mmap. When we undefined the USE_MMAP, you can still see the terrible memory usage. The memory pool can do less since we only have to do mpool_free when block_map_clear happens.
If we think supporting a platform without mmap is not important, I would feel good to apply this, however.

Good point.

In the context of utilizing the WebAssembly System Interface (WASI), it is apparent that there is no equivalent functionality to mmap. An issue has been raised on the WASI GitHub repository, highlighting the absence of mmap support within the WebAssembly environment. Consequently, if rv32emu were to be ported to WebAssembly, it should be noted that the mmap functionality would not be available.

@qwe661234, you should take into consideration the scenario where mmap may not be feasibly usable on certain platforms.

qwe661234 · 2023-09-28T11:09:12Z

Even if we cannot use mmap, the total memory usage for blocks and IR arrays has been limited to sizeof(block) * block_capacity + sizeof(rv_insn_t) * block_capacity * 1024 by the memory pool. We should note that the total memory usage would exceed 2^32 bytes if we could not use mmap, due to virtual data memory. Therefore, I believe the memory usage of blocks is not a critical issue when your total memory usage has already exceeded 4 GiB.

I believe that using a linked list to implement IR is still a more flexible approach if we want to remove any IR nodes from this linked list during instruction fusion. However, it consumes more memory than qemu when running certain benchmarks.

src/riscv.h

src/emulate.c

RinHizakura · 2023-09-28T14:58:12Z

Even if we cannot use mmap, the total memory usage for blocks and IR arrays has been limited to sizeof(block) * block_capacity + sizeof(rv_insn_t) * block_capacity * 1024 by the memory pool. We should note that the total memory usage would exceed 2^32 bytes if we could not use mmap, due to virtual data memory. Therefore, I believe the memory usage of blocks is not a critical issue when your total memory usage has already exceeded 4 GiB.

I'm not sure if it is safe to say something like" Although we are not good, at least we aren't that bad at all compared to others". In other words, the 2^32 bytes requirement of data memory may not be something that isn't able to do better. So I am not sure if we have to be restricted for this.

I believe that using a linked list to implement IR is still a more flexible approach if we want to remove any IR nodes from this linked list during instruction fusion. However, it consumes more memory than qemu when running certain benchmarks.

If mmap is the key to reducing the consumption of memory, it is worth noting that we can also apply mmap for #232, which may save more memory. We should use mmap on #232 for fair comparison if qemu also make use of mmap .

RinHizakura · 2023-09-28T15:11:45Z

Another idea here. What if we try to combine both solutions' advantages? Specifically, we still use the memory pool but allocate the chunk of memory in a single sizeof(rv_insn_t) unit but not a 1 << 10 * sizeof(rv_insn_t) unit. This allows us to still get the benefit of the memory pool, and also the benefit of mmap (if enabled). If mmap is disabled for the platform unluckily, the small chunk of memory could still reduce the arena we should prepare for mpool, since we have less chance to do mpool_expand.

The possible issue for this solution could be the cost for mpool_alloc, since allocated with smaller chunks means frequent calls of this API. The mpool_free should not be a big problem because the allocation of rv_insn_t is special, which will free all of the allocated objects at once at block_map_clear, and this shall have room to be optimized.

qwe661234 · 2023-09-28T15:37:56Z

Another idea here. What if we try to combine both solutions' advantages? Specifically, we still use the memory pool but allocate the chunk of memory in a single sizeof(rv_insn_t) unit but not a 1 << 10 * sizeof(rv_insn_t) unit. This allows us to still get the benefit of the memory pool, and also the benefit of mmap (if enabled). If mmap is disabled for the platform unluckily, the small chunk of memory could still reduce the arena we should prepare for mpool, since we have less chance to do mpool_expand.

The possible issue for this solution could be the cost for mpool_alloc, since allocated with smaller chunks means frequent calls of this API. The mpool_free should not be a big problem because the allocation of rv_insn_t is special, which will free all of the allocated objects at once at block_map_clear, and this shall have room to be optimized.

To maintain compactness, we must reduce our memory usage to be lower than that of QEMU. If this approach can achieve this goal while maintaining acceptable performance, it would be preferable for this PR.

RinHizakura · 2023-09-28T18:54:42Z

To maintain compactness, we must reduce our memory usage to be lower than that of QEMU. If this approach can achieve this goal while maintaining acceptable performance, it would be preferable for this PR.

Sure, 196cffbe shows the implementation of my new idea.

196cffbe with mmap:

$ ./build/rv32emu ./build/dhrystone.elf 
==11644==   total heap usage: 338 allocs, 338 frees, 15,204 bytes allocated

#234 with mmap

valgrind ./build/rv32emu ./build/dhrystone.elf 
==12367==   total heap usage: 336 allocs, 336 frees, 15,172 bytes allocated

196cffbe w/o mmap:

valgrind ./build/rv32emu ./build/dhrystone.elf 
==13356==   total heap usage: 342 allocs, 342 frees, 252,772 bytes allocated

#234 w/o mmap

valgrind ./build/rv32emu ./build/dhrystone.elf 
==14116==   total heap usage: 338 allocs, 338 frees, 58,768,196 bytes allocated

I haven't carefully experimented with the performance and other benchmarks yet. but it looks like some advantage could be brought for this new solution:

Compare to Reduce memory usage for instruction block #232, less changes are needed because we can rely on existed mpool mechanism
Compare to Remove useless IR when fusing instructions #234, similar memory usage with mmap enabled. If mmap is disabled, however, we still don't need to allocate terrible amounts of memory.

qwe661234 · 2023-09-29T07:38:23Z

To maintain compactness, we must reduce our memory usage to be lower than that of QEMU. If this approach can achieve this goal while maintaining acceptable performance, it would be preferable for this PR.

Sure, 196cffbe shows the implementation of my new idea.

196cffbe with mmap:
$ ./build/rv32emu ./build/dhrystone.elf 
==11644==   total heap usage: 338 allocs, 338 frees, 15,204 bytes allocated
#234 with mmap
valgrind ./build/rv32emu ./build/dhrystone.elf 
==12367==   total heap usage: 336 allocs, 336 frees, 15,172 bytes allocated
196cffbe w/o mmap:
valgrind ./build/rv32emu ./build/dhrystone.elf 
==13356==   total heap usage: 342 allocs, 342 frees, 252,772 bytes allocated
#234 w/o mmap
valgrind ./build/rv32emu ./build/dhrystone.elf 
==14116==   total heap usage: 338 allocs, 338 frees, 58,768,196 bytes allocated
I haven't carefully experimented with the performance and other benchmarks yet. but it looks like some advantage could be brought for this new solution:

Compare to Reduce memory usage for instruction block #232, less changes are needed because we can rely on existed mpool mechanism

Compare to Limit the memory usage of block and block IR #234, similar memory usage with mmap enabled. If mmap is disabled, however, we still don't need to allocate terrible amounts of memory.

@jserv, I think this idea is better than using array to store IRs. With this modification, we can easily remove useless IR directly, instead of skipping it.

jserv · 2023-09-29T13:59:46Z

@jserv, I think @RinHizakura's proposal is better than using array to store IRs. With this modification, we can easily remove useless IR directly, instead of skipping it.

Based on the feedback, I would like to merge pull request #232 first and then request improvements in IR manipulation by effectively removing unneeded instructions.
@qwe661234, create new pull requests for IR manipulation and take action on this pull request, either by refining it or marking it as obsolete.

RinHizakura · 2023-09-30T05:39:02Z

The new implementation is merged to #232 now! We could discuss over there if we have a consensus to use that solution.

src/emulate.c

jserv

Squash the git commit and refine the commit message to address the benefit by use case.

Originally, we need to keep and handle useless IR carefully when fusing instructions. Now, we can discard these useless IRs because of the modification of IRs' data structure from array to singly-linked list.

jserv · 2023-10-03T12:00:35Z

I defer to @RinHizakura for confirmation.

RinHizakura

@jserv Now it looks great to me!

jserv requested a review from RinHizakura September 27, 2023 11:57

jserv reviewed Sep 28, 2023

View reviewed changes

src/riscv.h Outdated Show resolved Hide resolved

jserv reviewed Sep 28, 2023

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

qwe661234 force-pushed the limit_memory_usage branch from 38d663d to 754006b Compare October 2, 2023 14:44

RinHizakura reviewed Oct 2, 2023

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

qwe661234 requested review from jserv and RinHizakura October 3, 2023 03:42

jserv requested changes Oct 3, 2023

View reviewed changes

Remove useless IR when fusing instructions

ac05003

Originally, we need to keep and handle useless IR carefully when fusing instructions. Now, we can discard these useless IRs because of the modification of IRs' data structure from array to singly-linked list.

qwe661234 force-pushed the limit_memory_usage branch from 06ed289 to ac05003 Compare October 3, 2023 08:32

qwe661234 requested a review from jserv October 3, 2023 08:34

jserv changed the title ~~Limit the memory usage of block and block IR~~ Remove useless IR when fusing instructions Oct 3, 2023

RinHizakura approved these changes Oct 3, 2023

View reviewed changes

jserv merged commit 8b59e9d into sysprog21:master Oct 3, 2023
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove useless IR when fusing instructions #234

Remove useless IR when fusing instructions #234

qwe661234 commented Sep 27, 2023

qwe661234 commented Sep 27, 2023

RinHizakura commented Sep 27, 2023

jserv commented Sep 27, 2023

qwe661234 commented Sep 28, 2023

RinHizakura commented Sep 28, 2023

RinHizakura commented Sep 28, 2023 •

edited

Loading

qwe661234 commented Sep 28, 2023 •

edited by jserv

Loading

RinHizakura commented Sep 28, 2023 •

edited by jserv

Loading

qwe661234 commented Sep 29, 2023

jserv commented Sep 29, 2023

RinHizakura commented Sep 30, 2023

This comment was marked as outdated.

jserv left a comment

jserv commented Oct 3, 2023

RinHizakura left a comment

Remove useless IR when fusing instructions #234

Remove useless IR when fusing instructions #234

Conversation

qwe661234 commented Sep 27, 2023

qwe661234 commented Sep 27, 2023

RinHizakura commented Sep 27, 2023

jserv commented Sep 27, 2023

qwe661234 commented Sep 28, 2023

RinHizakura commented Sep 28, 2023

RinHizakura commented Sep 28, 2023 • edited Loading

qwe661234 commented Sep 28, 2023 • edited by jserv Loading

RinHizakura commented Sep 28, 2023 • edited by jserv Loading

qwe661234 commented Sep 29, 2023

jserv commented Sep 29, 2023

RinHizakura commented Sep 30, 2023

This comment was marked as outdated.

jserv left a comment

Choose a reason for hiding this comment

jserv commented Oct 3, 2023

RinHizakura left a comment

Choose a reason for hiding this comment

RinHizakura commented Sep 28, 2023 •

edited

Loading

qwe661234 commented Sep 28, 2023 •

edited by jserv

Loading

RinHizakura commented Sep 28, 2023 •

edited by jserv

Loading