Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase private segment limit for dynamic scratch kernels #80

Open
wants to merge 91 commits into
base: develop
Choose a base branch
from

Conversation

publixsubfan
Copy link

Check dynamic scratch allocations against the per-wave scratch limit in ROCR-Runtime, which is set to 8MB/warp. This is an increase in the original per-thread private segment limit from 16KiB to almost 128KiB.

A reproducer to observe this issue is here: https://gist.github.com/publixsubfan/f5fcfe9f3d826c45a80e23acf5d88de2

Details

The ROCR-Runtime repo defines the following limit for per-wave scratch memory:
https://github.com/ROCm/ROCR-Runtime/blob/master/src/core/runtime/amd_gpu_agent.cpp#L84

#define MAX_WAVE_SCRATCH 8387584  // See COMPUTE_TMPRING_SIZE.WAVESIZE

Likewise, LLVM specifies the following scratch limits in this chunk of code here: https://github.com/ROCm/llvm-project/blob/d0f9aa6415cde2f7b9bc6dbf385b5c77b700edec/llvm/lib/Target/AMDGPU/GCNSubtarget.h#L308-L320

  • (2^13-1) * 1 KiB for GFX10 and below, this matches the MAX_WAVE_SCRATCH value
  • (2^15-1) * 256 B for GFX11, this is a little larger than MAX_WAVE_SCRATCH
  • (2^18-1) * 256B for GFX12 and above, which is a little under 64MB/wave.

gandryey and others added 30 commits March 18, 2024 10:52
PAL optimized the logic for the barriers, which caused failures with CP DMA on Navi4x.
Change barrier's code to match the most recent PAL optimizations.

Change-Id: I55eeab20f51eb8e920bcbb4b55fbe3c7f77fd3fa
Change-Id: If1bcca45825c9899462bb95ed6f637f5af806cc8
This no longer does anything.

Change-Id: I0643198a46a534a76454a5b461d010ed1776a89a
Added call to hipDeviceSynchronize in __hipUnregisterFatBinary
to ensure that all HSA async signal handlers complete before removeFatBinary

Change-Id: I756fecca1c2a5eae092613d8079de266399e5685
… elementSize should be 1 as width is in bytes while capturing hipMemset2DAsync.

Change-Id: I8f9122a30cba0a07c097dfd7609432090caab142
This reverts commit 5f68a45.

Reason for revert: due to windows staging branch using Opencl-icd-loader master

Change-Id: I9cca7564a21de1733665a34da6f0322aa3b886e7
…function to be called at exit"

This reverts commit 5e294f8.

Change-Id: Ib9cb1cc0c3903bfba56c9a5d05ae8afe96be583a
Under ROCr physical allocations don't have initial VA and require extra
flag in ROCclr. Add an option to have a mempool of physical allocations.

Change-Id: I4d062fe0dd8113d4eaf6e8b51749ed56d8701d1e
Change-Id: I33d1359d5e4c871f63350d8300f726e039664d86
Change-Id: I7b58177c41dc0c6c59813977cb90e65a6cb3be72
Change-Id: Ica32017ef7b00326dfb6d1f604e126d40ad5b786
Change-Id: Ib479c744b90125b74d99cbf18b7f4b8cf765bf1c
- With https://gerrit-git.amd.com/c/lightning/ec/llvm-project/+/1002628 applied, at -O0 Kernel::dynamicParallelism() returns true but virtual queue isn't created
- This causes segfault inside VirtualGPU::submitKernelInternal() when getVQVirtualAddress() is called

Change-Id: Ia7af042adad2329e870c142caaac3e8fa886f8b8
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.

Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
…ead is being destroied by app/test's unload libamdhip64.so call.

Change-Id: I8d4a8d8b6801d9f6eb745c45adf831597def0cb5
alwaysResident setting doesn't require per queue residency tracking.
Thus, the logic can be skipped to avoid the lock of queues.

Change-Id: Ib5cff5b79d3ecb8c2f2eb2565cf069f9a69438b0
The new logic has a lock for PAL call and doesn't require the lock for queues.

Change-Id: I61b67c3c4abd2ede44809de1d6beed756766032e
… during capture

=> hipDeviceSynchronize is not allowed during capture.
=> hipEventSynchronize during capture should return hipErrorCapturedEvent error
=> hipEventQuery during capture should return hipErrorCapturedEvent error
hipStreamSynchronize, hipEventSynchronize, hipStreamWaitEvent, hipStreamQuery
For Side Stream(Stream that is not currently under capture):
=> If current thread is capturing in relaxed mode, calls are allowed
=> If any stream in current/concurrent thread is capturing in global mode, calls are not allowed
=> If any stream in current thread is capturing in ThreadLocal mode, calls are not allowed
For Stream that is currently under capture
=> calls are not allowed
=> Any call that is not allowed during capture invalidates the capture sequence
=> It is invalid to call synchronous APIs during capture. Synchronous APIs,
such as hipMemcpy(), enqueue work to the legacy stream and synchronize it before returning.

Change-Id: I201c6e63e1a5d93fd416a3b520264c0fdbe31237
Change-Id: Id58f982edd4f17d675f7a0f61a9b4dea0baebd9b
Change-Id: Icbe67024297c92bf59139b6a2ccd2ba3674f60b1
Add cltrace compile definition for CL_TARGET_OPENCL_VERSION to OpenCL 2.2

Change-Id: Ie868ab0a6e86951afc6d07da58be942c3b736d15
When large bar is enabled, persistent memory leads to overallocation for 32 bit architecture.

Change-Id: Iae39359d8128588de02e42d77fe58e868b8e71fd
- Application is passing null for parameter stream in API hipStreamWaitEvent
- When event stream isn't capturing and event is not recorded, causes segfault because we are accessing deviceId() from waitStream

Change-Id: I8b87ffd6f234677f68b66dcb7ef44b2ff04a7c91
…hGenericAqlPacket

Dispatching multiple packets with ring the doorbell once is not supported by the lower layers

Change-Id: I7665a2dcdd4ef9e47dadfe410180fed64c5a4ee0
Change-Id: Iadcdadd734e7aeeb23742e426353defa972d3ad5
If we are using the mask returned by getLastUsedSdmaEngine() then we
need to apply the SDMA Read/Write mask to it before using with HSA
copy_on_engine API.

Change-Id: I6e5dc6c187eeb3c61ee159e9d2a0fa7b4737c06e
This also brings bfloat16 implementation closer to CUDA's.

Change-Id: I23f381141faacd6537923ae9b88ada4d661db496
Change-Id: Ic41b1ad1b64cca0e31986337a83a5146d52a7328
Add kernel arguments optimization into blit path.
Enabled by default on MI300.

Change-Id: I2694a81b90d48ad07d86dfe4c0c64fe187bada8e
…lti devices and chunk decommit gets delayed.

Change-Id: Ia4b0d5fbfa8f198776e52d14de8b22c6942f740d
gandryey and others added 25 commits April 25, 2024 11:24
Change-Id: Id23882286cb2a0d0472964ffc501ab27b7dc7f00
Signed-off-by: shadi <shadi.dashmiz@amd.com>
Change-Id: Iff3ee7dcbcd24836f227fdc9bd5ff4b554ac914f
Change-Id: Ie3f3c0bcea84368c1b0607fd52b4bc7cae41c512
Make sure graph mempool unmaps VA on release

Change-Id: Id3f1bd8d0115b533ae60aa5ba3676b8bf7e5b961
…a single virtual address.

Change-Id: Ie678e607a64f2e5c35a10b9083185f041c5527ac
The .bat file will not be removed from windows pacakge.
Also used cmake install(PROGRAMS …), which will set the correct permission rather than hard coding the permission

Change-Id: I8b57778b59f70e01de949be2ea353b67eec70d2f
Change-Id: Iedd6290a813d6e43a4350709484f78e05b08adc8
Change-Id: I1f8411eae9ed49632667e244a25f223fed92c720
Change-Id: I7fda94e61121f9d3a30f4ad185b8a97712922f3c
Change-Id: Ief73138faed0af70b90186db5bde6689e0a83f88
…ss the flag to HSA API.

Change-Id: I1bafeaa3096395c729723af958d609bc41e7845c
- Add LOG_TS mask for printing signal times
- Read raw ticks from signals

Change-Id: Ibdd0bf06c790729f6c65083a4784c97a3c3219e0
- Fix possible buffer overflow for long kernel names

Change-Id: I3c51669de7ff242d03f9210ee045b6d5e7ac274a
Change-Id: I820cacd75a36363d1387e2e881c96937281bc265
Change-Id: Id3cb4b58b9efd3aceec4377d4d1d3a053c50333e
Change-Id: I3601373f680aa4bff0075f4b3b9e885e54b4600d
1.Make runtime use comgr to unbundle code objects
2.Support compressed/uncompressed modes
3.Remove HIP_USE_RUNTIME_UNBUNDLER and
  HIPRTC_USE_RUNTIME_UNBUNDLER to simplify logics
4.Add comgr wrapper for
  amd_comgr_action_info_set_bundle_entry_ids()

Change-Id: Ic41b1ad1b64cca1e31986437983a5146d52a7329
The Readback and Avoid HDP Flush memory ordering workaround is
used as a fallback solution only when HDP flush register is invalid

Change-Id: Ic284eba1f95ed22b0270d3abeb904fb902015b1a
Unbundle compressed code objects needs comgr 2.8 or higher

Change-Id: I23942d2038e19b02c3ea5d3c9c1fe5367db87136
…ltiple translation unit.

Change-Id: I92179ad198abbdaf5aec9c3c4ba76eeb6b0cc761
Change-Id: I21abe109ddfabfe7640bf78a96c81a1317d31952
Change-Id: Iabb4071bb77201576bc2c0488a04f4fa188815df
Change-Id: I374ea7c3119b0c61f9846a862c4a448ddb179748
Switch commands creation to the new suballocator to avoid
frequent expensive OS calls

Change-Id: I3597c811820e577c15708bad8b8a41aa53acc400
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.