Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KFDMemoryTests fail with 5.11rc7 and gfx1030 (hsa intermittently fails) #108

Open
powderluv opened this issue Feb 21, 2021 · 5 comments
Open

Comments

@powderluv
Copy link

powderluv commented Feb 21, 2021

On Linux 5950x 5.11.0-051100rc7. (5.11 RC7) I am seeing the following KFDMemory failures with GFX1030

[----------] 25 tests from KFDMemoryTest
[ RUN ] KFDMemoryTest.MMapLarge
[ ] Successfully registered and mapped 117GB system memory to gpu
[ OK ] KFDMemoryTest.MMapLarge (1243 ms)
[ RUN ] KFDMemoryTest.MapUnmapToNodes
[ ] Skipping test: At least two GPUs are required.
[ OK ] KFDMemoryTest.MapUnmapToNodes (30 ms)
[ RUN ] KFDMemoryTest.MapMemoryToGPU
[ OK ] KFDMemoryTest.MapMemoryToGPU (6 ms)
[ RUN ] KFDMemoryTest.InvalidMemoryPointerAlloc
[ OK ] KFDMemoryTest.InvalidMemoryPointerAlloc (5 ms)
[ RUN ] KFDMemoryTest.ZeroMemorySizeAlloc
[ OK ] KFDMemoryTest.ZeroMemorySizeAlloc (5 ms)
[ RUN ] KFDMemoryTest.MemoryAlloc
[ OK ] KFDMemoryTest.MemoryAlloc (5 ms)
[ RUN ] KFDMemoryTest.AccessPPRMem
[ ] Skipping test: Test requires APU.
[ OK ] KFDMemoryTest.AccessPPRMem (5 ms)
[ RUN ] KFDMemoryTest.MemoryRegister
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/Dispatch.cpp:95: Failure
Value of: (hsaKmtWaitOnEvent(m_pEop, timeout))
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/BaseQueue.cpp:122: Failure
Value of: WaitOnValue(m_Resources.Queue_read_ptr, RptrWhenConsumed(), timeOut)
Actual: false
Expected: true
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/KFDMemoryTest.cpp:482: Failure
Value of: WaitOnValue(&stackData[sdmaOffset], 0x12345678)
Actual: false
Expected: true
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/Dispatch.cpp:95: Failure
Value of: (hsaKmtWaitOnEvent(m_pEop, timeout))
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/BaseQueue.cpp:122: Failure
Value of: WaitOnValue(m_Resources.Queue_read_ptr, RptrWhenConsumed(), timeOut)
Actual: false
Expected: true
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/KFDMemoryTest.cpp:530: Failure
Value of: stackData[dstOffset]
Actual: 3735928559
Expected: 0xD00BED00
Which is: 3490442496
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/KFDMemoryTest.cpp:531: Failure
Value of: stackData[sdmaOffset]
Actual: 3735928559
Expected: 0xD0BED0BE
Which is: 3502166206
[ FAILED ] KFDMemoryTest.MemoryRegister (10379 ms)
[ RUN ] KFDMemoryTest.MemoryRegisterSamePtr
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/BaseQueue.cpp:122: Failure
Value of: WaitOnValue(m_Resources.Queue_read_ptr, RptrWhenConsumed(), timeOut)
Actual: false
Expected: true
/home/foo/github/roct-thunk-interface/tests/kfdtest/src/KFDMemoryTest.cpp:593: Failure
Value of: WaitOnValue((unsigned int *)(&mem[2]), 0xdeadbeef)
Actual: false
Expected: true
[ FAILED ] KFDMemoryTest.MemoryRegisterSamePtr (4256 ms)
[ RUN ] KFDMemoryTest.FlatScratchAccess

hsakmt is built on rocm-4.0.x branch and compiled with latest AOMP 13.x.

I stumbled on this before my test program here:

https://github.com/powderluv/LLVM-AMDGPU-Assembler-Extra/blob/master/examples/asm-kernel/asm-kernel.s

would fail 2/3 times but run sometimes.

foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Using agent: gfx1030

Success

foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Queue at 0x7f3899753000 inactivated due to async error:
HSA_STATUS_ERROR_INVALID_ALLOCATION: The requested allocation is not valid.
^C
130 foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Queue at 0x7f9dc3f5f000 inactivated due to async error:
HSA_STATUS_ERROR_INVALID_ALLOCATION: The requested allocation is not valid.
^[[A^C
130 foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Queue at 0x7fa0543a9000 inactivated due to async error:
HSA_STATUS_ERROR_INVALID_ALLOCATION: The requested allocation is not valid.
^C
130 foo@5950x:/github/LLVM-AMDGPU-Assembler-Extra/b/examples/asm-kernel$ ./asm-kernel
Using agent: gfx1030

Success

Here are some related issues:
ROCm/HIP#2238
ROCm/aomp#187. (Rocminfo is attached to this bug)

@fxkamd
Copy link
Contributor

fxkamd commented Feb 22, 2021

Can you provide a kernel log (dmesg) and the .config used to build your kernel?

Can you also try a supported kernel with the DKMS driver and firmware to help narrow down the problem?

@powderluv
Copy link
Author

It is the default config file from ubuntu nightly builds. The image is from: https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.11-rc7/amd64/

I need something newer than 5.10 to address this bug (https://bugzilla.kernel.org/show_bug.cgi?id=210593) with Ryzen CPU support

I can build from git if required with any debug. The relevant pieces from dmesg are here: Full dmesg also attached.

[ 541.246085] [drm] kiq ring mec 2 pipe 1 q 0
[ 541.271938] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 541.272151] [drm] JPEG decode initialized successfully.
[ 541.273686] kfd kfd: Allocated 3969056 bytes on gart
[ 541.273963] Virtual CRAT table created for GPU
[ 541.274494] amdgpu: Topology: Add dGPU node [0x73bf:0x1002]
[ 541.274497] kfd kfd: added device 1002:73bf
[ 541.274501] amdgpu 0000:31:00.0: amdgpu: SE 4, SH per SE 2, CU per SH 10, active_cu_number 80
[ 541.274630] amdgpu 0000:31:00.0: [drm] Cannot find any crtc or sizes
[ 541.274774] amdgpu 0000:31:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 541.274777] amdgpu 0000:31:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 541.274780] amdgpu 0000:31:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 541.274781] amdgpu 0000:31:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 541.274783] amdgpu 0000:31:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 541.274785] amdgpu 0000:31:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 541.274787] amdgpu 0000:31:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 541.274789] amdgpu 0000:31:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 541.274791] amdgpu 0000:31:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 541.274793] amdgpu 0000:31:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[ 541.274795] amdgpu 0000:31:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 541.274797] amdgpu 0000:31:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 541.274799] amdgpu 0000:31:00.0: amdgpu: ring sdma2 uses VM inv eng 14 on hub 0
[ 541.274801] amdgpu 0000:31:00.0: amdgpu: ring sdma3 uses VM inv eng 15 on hub 0
[ 541.274803] amdgpu 0000:31:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[ 541.274805] amdgpu 0000:31:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[ 541.274807] amdgpu 0000:31:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[ 541.274810] amdgpu 0000:31:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 5 on hub 1
[ 541.274812] amdgpu 0000:31:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 6 on hub 1
[ 541.274814] amdgpu 0000:31:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 7 on hub 1
[ 541.274816] amdgpu 0000:31:00.0: amdgpu: ring jpeg_dec uses VM inv eng 8 on hub 1
[ 541.287958] [drm] Initialized amdgpu 3.40.0 20150101 for 0000:31:00.0 on minor 1
[ 607.112107] amdgpu: init_user_pages: failed to validate BO
[ 611.188995] amdgpu: init_user_pages: failed to validate BO
[ 613.265813] [drm:amdgpu_ttm_backend_bind [amdgpu]] ERROR failed to pin userptr
[ 613.265933] amdgpu: init_user_pages: failed to validate BO
...
[ 980.203828] amdgpu 0000:31:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0xe9ef75c39e0 flags=0x0000]
[ 980.204064] amdgpu 0000:31:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0xe9ef75c8010 flags=0x0020]
[ 980.204338] amdgpu 0000:31:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0xe9ef75c3a60 flags=0x0000]
[ 980.204550] amdgpu 0000:31:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0xe9ef75c8010 flags=0x0020]
[ 980.204758] AMD-Vi: Event logged [IO_PAGE_FAULT device=31:00.0 domain=0x0021 address=0xe9ef75c3d20 flags=0x0000]
[ 980.204972] AMD-Vi: Event logged [IO_PAGE_FAULT device=31:00.0 domain=0x0021 address=0xe9ef75c8010 flags=0x0020]
dmesg.log

@fxkamd
Copy link
Contributor

fxkamd commented Feb 22, 2021

Please check this:
grep CONFIG_DRM_AMDGPU_USERPTR /boot/config-uname -r

This should report
CONFIG_DRM_AMDGPU_USERPTR=y

If that's not the case, I recommend you report a bug against the Ubuntu kernel. This is a required feature for KFD to work. I submitted a patch upstream recently to make KFD select this automatically during the kernel build process.

@powderluv
Copy link
Author

powderluv commented Feb 22, 2021

I dont think that is the issue since it is enabled.

@5950x:~/github/roct-thunk-interface/tests/kfdtest/b$ grep CONFIG_DRM_AMDGPU_USERPTR  /boot/config-5.11.0-051100rc7-generic 
CONFIG_DRM_AMDGPU_USERPTR=y@5950x:~/github/roct-thunk-interface/tests/kfdtest/b$ grep AMD  /boot/config-5.11.0-051100rc7-generic 
CONFIG_X86_AMD_PLATFORM_DEVICE=y
CONFIG_CPU_SUP_AMD=y
CONFIG_X86_MCE_AMD=y
# CONFIG_PERF_EVENTS_AMD_POWER is not set
CONFIG_MICROCODE_AMD=y
CONFIG_AMD_MEM_ENCRYPT=y
# CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set
CONFIG_AMD_NUMA=y
CONFIG_X86_AMD_FREQ_SENSITIVITY=m
CONFIG_AMD_NB=y
CONFIG_KVM_AMD=m
CONFIG_KVM_AMD_SEV=y
CONFIG_MTD_CFI_AMDSTD=m
CONFIG_MTD_AMD76XROM=m
CONFIG_PATA_AMD=m
CONFIG_NET_VENDOR_AMD=y
CONFIG_AMD8111_ETH=m
CONFIG_AMD_XGBE=m
CONFIG_AMD_XGBE_DCB=y
CONFIG_AMD_XGBE_HAVE_ECC=y
CONFIG_AMD_PHY=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD756_S4882=m
CONFIG_I2C_AMD8111=m
CONFIG_I2C_AMD_MP2=m
CONFIG_SPI_AMD=m
CONFIG_PINCTRL_AMD=y
CONFIG_GPIO_AMDPT=m
CONFIG_GPIO_AMD_FCH=m
CONFIG_GPIO_AMD8111=m
CONFIG_SENSORS_AMD_ENERGY=m
CONFIG_AGP_AMD64=y
CONFIG_DRM_AMDGPU=m
CONFIG_DRM_AMDGPU_SI=y
CONFIG_DRM_AMDGPU_CIK=y
CONFIG_DRM_AMDGPU_USERPTR=y
# CONFIG_DRM_AMDGPU_GART_DEBUGFS is not set
CONFIG_DRM_AMD_ACP=y
CONFIG_DRM_AMD_DC=y
CONFIG_DRM_AMD_DC_DCN=y
CONFIG_DRM_AMD_DC_HDCP=y
CONFIG_DRM_AMD_DC_SI=y
CONFIG_HSA_AMD=y
CONFIG_SND_SOC_AMD_ACP=m
CONFIG_SND_SOC_AMD_CZ_DA7219MX98357_MACH=m
CONFIG_SND_SOC_AMD_CZ_RT5645_MACH=m
CONFIG_SND_SOC_AMD_ACP3x=m
CONFIG_SND_SOC_AMD_RV_RT5682_MACH=m
CONFIG_SND_SOC_AMD_RENOIR=m
CONFIG_SND_SOC_AMD_RENOIR_MACH=m
# AMD SFH HID Support
CONFIG_AMD_SFH_HID=m
# end of AMD SFH HID Support
CONFIG_USB_AMD5536UDC=m
CONFIG_EDAC_AMD64=m
# CONFIG_EDAC_AMD64_ERROR_INJECTION is not set
CONFIG_AMD_PMC=m
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_V2=m
# CONFIG_NTB_AMD is not set
CONFIG_AMDTEE=m

@ppanchad-amd
Copy link

@powderluv Can you please check if the issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants