-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase private segment limit for dynamic scratch kernels #80
Open
publixsubfan
wants to merge
91
commits into
ROCm:develop
Choose a base branch
from
publixsubfan:develop
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PAL optimized the logic for the barriers, which caused failures with CP DMA on Navi4x. Change barrier's code to match the most recent PAL optimizations. Change-Id: I55eeab20f51eb8e920bcbb4b55fbe3c7f77fd3fa
Change-Id: If1bcca45825c9899462bb95ed6f637f5af806cc8
This no longer does anything. Change-Id: I0643198a46a534a76454a5b461d010ed1776a89a
Added call to hipDeviceSynchronize in __hipUnregisterFatBinary to ensure that all HSA async signal handlers complete before removeFatBinary Change-Id: I756fecca1c2a5eae092613d8079de266399e5685
… elementSize should be 1 as width is in bytes while capturing hipMemset2DAsync. Change-Id: I8f9122a30cba0a07c097dfd7609432090caab142
This reverts commit 5f68a45. Reason for revert: due to windows staging branch using Opencl-icd-loader master Change-Id: I9cca7564a21de1733665a34da6f0322aa3b886e7
…function to be called at exit" This reverts commit 5e294f8. Change-Id: Ib9cb1cc0c3903bfba56c9a5d05ae8afe96be583a
Under ROCr physical allocations don't have initial VA and require extra flag in ROCclr. Add an option to have a mempool of physical allocations. Change-Id: I4d062fe0dd8113d4eaf6e8b51749ed56d8701d1e
Change-Id: I33d1359d5e4c871f63350d8300f726e039664d86
Change-Id: I7b58177c41dc0c6c59813977cb90e65a6cb3be72
Change-Id: Ica32017ef7b00326dfb6d1f604e126d40ad5b786
Change-Id: Ib479c744b90125b74d99cbf18b7f4b8cf765bf1c
- With https://gerrit-git.amd.com/c/lightning/ec/llvm-project/+/1002628 applied, at -O0 Kernel::dynamicParallelism() returns true but virtual queue isn't created - This causes segfault inside VirtualGPU::submitKernelInternal() when getVQVirtualAddress() is called Change-Id: Ia7af042adad2329e870c142caaac3e8fa886f8b8
- Create a vector to allow multiple TS to be stored in Command. - This would mean we dont wait for entire batch in Accumulate command to finish when we exhaust signals. - Reduce the number of signals created at init to 64. This min value may still need to be tuned but the KFD allows max of 4094 interrupt signals per device. - Store kernel names whenever they are available and not just when profiling. If we dynamically enable profiling like for Torch, a crash can happen if hipGraphInstantiate wasnt included in Torch profile scope beacuse we previously entered kernel names only when profiler is attached. Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
…ead is being destroied by app/test's unload libamdhip64.so call. Change-Id: I8d4a8d8b6801d9f6eb745c45adf831597def0cb5
alwaysResident setting doesn't require per queue residency tracking. Thus, the logic can be skipped to avoid the lock of queues. Change-Id: Ib5cff5b79d3ecb8c2f2eb2565cf069f9a69438b0
The new logic has a lock for PAL call and doesn't require the lock for queues. Change-Id: I61b67c3c4abd2ede44809de1d6beed756766032e
… during capture => hipDeviceSynchronize is not allowed during capture. => hipEventSynchronize during capture should return hipErrorCapturedEvent error => hipEventQuery during capture should return hipErrorCapturedEvent error hipStreamSynchronize, hipEventSynchronize, hipStreamWaitEvent, hipStreamQuery For Side Stream(Stream that is not currently under capture): => If current thread is capturing in relaxed mode, calls are allowed => If any stream in current/concurrent thread is capturing in global mode, calls are not allowed => If any stream in current thread is capturing in ThreadLocal mode, calls are not allowed For Stream that is currently under capture => calls are not allowed => Any call that is not allowed during capture invalidates the capture sequence => It is invalid to call synchronous APIs during capture. Synchronous APIs, such as hipMemcpy(), enqueue work to the legacy stream and synchronize it before returning. Change-Id: I201c6e63e1a5d93fd416a3b520264c0fdbe31237
Change-Id: Id58f982edd4f17d675f7a0f61a9b4dea0baebd9b
Change-Id: Icbe67024297c92bf59139b6a2ccd2ba3674f60b1
Add cltrace compile definition for CL_TARGET_OPENCL_VERSION to OpenCL 2.2 Change-Id: Ie868ab0a6e86951afc6d07da58be942c3b736d15
When large bar is enabled, persistent memory leads to overallocation for 32 bit architecture. Change-Id: Iae39359d8128588de02e42d77fe58e868b8e71fd
- Application is passing null for parameter stream in API hipStreamWaitEvent - When event stream isn't capturing and event is not recorded, causes segfault because we are accessing deviceId() from waitStream Change-Id: I8b87ffd6f234677f68b66dcb7ef44b2ff04a7c91
…hGenericAqlPacket Dispatching multiple packets with ring the doorbell once is not supported by the lower layers Change-Id: I7665a2dcdd4ef9e47dadfe410180fed64c5a4ee0
Change-Id: Iadcdadd734e7aeeb23742e426353defa972d3ad5
If we are using the mask returned by getLastUsedSdmaEngine() then we need to apply the SDMA Read/Write mask to it before using with HSA copy_on_engine API. Change-Id: I6e5dc6c187eeb3c61ee159e9d2a0fa7b4737c06e
This also brings bfloat16 implementation closer to CUDA's. Change-Id: I23f381141faacd6537923ae9b88ada4d661db496
Change-Id: Ic41b1ad1b64cca0e31986337a83a5146d52a7328
Add kernel arguments optimization into blit path. Enabled by default on MI300. Change-Id: I2694a81b90d48ad07d86dfe4c0c64fe187bada8e
…lti devices and chunk decommit gets delayed. Change-Id: Ia4b0d5fbfa8f198776e52d14de8b22c6942f740d
Change-Id: Id23882286cb2a0d0472964ffc501ab27b7dc7f00
Signed-off-by: shadi <shadi.dashmiz@amd.com> Change-Id: Iff3ee7dcbcd24836f227fdc9bd5ff4b554ac914f
Change-Id: Ie3f3c0bcea84368c1b0607fd52b4bc7cae41c512
Make sure graph mempool unmaps VA on release Change-Id: Id3f1bd8d0115b533ae60aa5ba3676b8bf7e5b961
…a single virtual address. Change-Id: Ie678e607a64f2e5c35a10b9083185f041c5527ac
The .bat file will not be removed from windows pacakge. Also used cmake install(PROGRAMS …), which will set the correct permission rather than hard coding the permission Change-Id: I8b57778b59f70e01de949be2ea353b67eec70d2f
Change-Id: Iedd6290a813d6e43a4350709484f78e05b08adc8
Change-Id: I1f8411eae9ed49632667e244a25f223fed92c720
Change-Id: I7fda94e61121f9d3a30f4ad185b8a97712922f3c
Change-Id: Ief73138faed0af70b90186db5bde6689e0a83f88
…ss the flag to HSA API. Change-Id: I1bafeaa3096395c729723af958d609bc41e7845c
- Add LOG_TS mask for printing signal times - Read raw ticks from signals Change-Id: Ibdd0bf06c790729f6c65083a4784c97a3c3219e0
- Fix possible buffer overflow for long kernel names Change-Id: I3c51669de7ff242d03f9210ee045b6d5e7ac274a
Change-Id: I820cacd75a36363d1387e2e881c96937281bc265
Change-Id: Id3cb4b58b9efd3aceec4377d4d1d3a053c50333e
Change-Id: I3601373f680aa4bff0075f4b3b9e885e54b4600d
1.Make runtime use comgr to unbundle code objects 2.Support compressed/uncompressed modes 3.Remove HIP_USE_RUNTIME_UNBUNDLER and HIPRTC_USE_RUNTIME_UNBUNDLER to simplify logics 4.Add comgr wrapper for amd_comgr_action_info_set_bundle_entry_ids() Change-Id: Ic41b1ad1b64cca1e31986437983a5146d52a7329
The Readback and Avoid HDP Flush memory ordering workaround is used as a fallback solution only when HDP flush register is invalid Change-Id: Ic284eba1f95ed22b0270d3abeb904fb902015b1a
Unbundle compressed code objects needs comgr 2.8 or higher Change-Id: I23942d2038e19b02c3ea5d3c9c1fe5367db87136
…ltiple translation unit. Change-Id: I92179ad198abbdaf5aec9c3c4ba76eeb6b0cc761
Change-Id: I21abe109ddfabfe7640bf78a96c81a1317d31952
Change-Id: Iabb4071bb77201576bc2c0488a04f4fa188815df
Change-Id: I374ea7c3119b0c61f9846a862c4a448ddb179748
Switch commands creation to the new suballocator to avoid frequent expensive OS calls Change-Id: I3597c811820e577c15708bad8b8a41aa53acc400
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Check dynamic scratch allocations against the per-wave scratch limit in ROCR-Runtime, which is set to 8MB/warp. This is an increase in the original per-thread private segment limit from 16KiB to almost 128KiB.
A reproducer to observe this issue is here: https://gist.github.com/publixsubfan/f5fcfe9f3d826c45a80e23acf5d88de2
Details
The ROCR-Runtime repo defines the following limit for per-wave scratch memory:
https://github.com/ROCm/ROCR-Runtime/blob/master/src/core/runtime/amd_gpu_agent.cpp#L84
Likewise, LLVM specifies the following scratch limits in this chunk of code here: https://github.com/ROCm/llvm-project/blob/d0f9aa6415cde2f7b9bc6dbf385b5c77b700edec/llvm/lib/Target/AMDGPU/GCNSubtarget.h#L308-L320
(2^13-1) * 1 KiB
for GFX10 and below, this matches theMAX_WAVE_SCRATCH
value(2^15-1) * 256 B
for GFX11, this is a little larger thanMAX_WAVE_SCRATCH
(2^18-1) * 256B
for GFX12 and above, which is a little under 64MB/wave.