Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plan about my dev v2 #37

Open
yuq opened this issue Mar 20, 2018 · 28 comments
Open

plan about my dev v2 #37

yuq opened this issue Mar 20, 2018 · 28 comments

Comments

@yuq
Copy link
Owner

yuq commented Mar 20, 2018

As previous plan is done, start a new one.

I've setup a mali450 board for mali450 dev and found the kernel driver HW ops not stable, like L2 cache and MMU reset command timeout, so want to give the kernel driver some refine and fix which may also benefit some problem found when mali400 dev. After this, I can send a RFC to kernel DRM driver mailing list for feedback.

@anarsoul
Copy link
Contributor

Looking forward to see lima driver mainlined :)

@yuq
Copy link
Owner Author

yuq commented Apr 21, 2018

Progress update

  1. fix the PP error irq and MMU fault due to not enough PLB number
  2. fully mali450 support with DLBU and BCAST used for PP jobs

Next

  1. still some bug need to be fixed in the kernel
  2. try to use TTM as our MM if possible

@anarsoul
Copy link
Contributor

anarsoul commented Apr 21, 2018

@yuq PP error irq isn't fixed for mali400. I still get this on some runs of glmark:

[  768.373615] lima 1c40000.gpu: pp error irq state=201 status=40
[  768.390193] lima 1c40000.gpu: pp error irq state=201 status=40

@anarsoul
Copy link
Contributor

Also I get MMU faults in kmscube -M rgba:

[ 2045.982632] lima 1c40000.gpu: mmu page fault at 0xe9bf80 from bus id 0 of type read on ppmmu1
[ 2046.001755] lima 1c40000.gpu: mmu page fault at 0xe997c0 from bus id 0 of type read on ppmmu0
[ 2046.021156] lima 1c40000.gpu: mmu resume
[ 2046.035419] lima 1c40000.gpu: mmu resume

@yuq
Copy link
Owner Author

yuq commented Apr 22, 2018

What's your screen resolution when this kind of error happens? I remember you said you have a 2536x1440 monitor?

I fix this error when 1920x1080 and the PLB number is not set to max. But there's another dimension I haven't tried -- the PLB size. PLB size can be 128, 256, 512, 1024. Dumping mali driver I always see it's set to 512, so does lima-ng. But maybe when higher resolution, it should be increased to 1024.

@anarsoul
Copy link
Contributor

My monitor resolution is 2560x1440.

@yuq
Copy link
Owner Author

yuq commented Apr 22, 2018

So does increase LIMA_CTX_PLB_BLK_SIZE to 1024 solves the error on your side?

@anarsoul
Copy link
Contributor

No, with LIMA_CTX_PLB_BLK_SIZE = 1024 kmscube doesn't work at all - and I get this in dmesg:

[  288.667891] lima 1c40000.gpu: pp error irq state=200 status=41
[  288.683349] lima 1c40000.gpu: pp error irq state=200 status=41

@yuq
Copy link
Owner Author

yuq commented Apr 22, 2018

OK, maybe there's other place need to be configured for 1024 PLB like the DLBU reg:
1c7700f#diff-15af9d78941ee5e81caea488e2910f77R1092

I just hard code 0x20000000 for 512 PLB, 1024 PLB should be 0x30000000. So there maybe the same field for mali400 that we haven't discovered. We can first dump 2560x1440 mali and see if it uses 1024 PLB size, then where's this field.

@anarsoul
Copy link
Contributor

Here's dump: https://drive.google.com/file/d/16WDMIvAeE6-wK4NYvepF8R0YEfJXEUHD/view?usp=sharing - I'm not really sure what to look for.

@yuq
Copy link
Owner Author

yuq commented Apr 22, 2018

From your dump, although the gp stream mem is missing, I can see in the pp stream mem it's still 512 PLB. But I also find in the code that LIMA_CTX_PLB_BLK_SIZE is not used every where it should be, so fixed with:
376b3c8

With this fix, 1024 PLB works, could you try it again?

@anarsoul
Copy link
Contributor

1024 PLB works now, but it's the same as 512 - I'm still getting mmu fault in 'kmscube -m rgba':

[  138.162217] lima 1c40000.gpu: mmu page fault at 0x1bd400 from bus id 0 of type read on ppmmu0
[  138.181672] lima 1c40000.gpu: mmu page fault at 0x1bd400 from bus id 0 of type read on ppmmu1
[  138.200989] lima 1c40000.gpu: mmu resume
[  138.215337] lima 1c40000.gpu: mmu resume

and pp error in glmark2-es2-drm -b build:

[  300.957500] lima 1c40000.gpu: pp error irq state=200 status=41
[  300.973596] lima 1c40000.gpu: pp error irq state=201 status=40

Btw, everything that uses textures stutters for me, i.e. textured cube or 'glmark2-es-drm -b pulsar'

@yuq
Copy link
Owner Author

yuq commented Apr 22, 2018

OK, then seems not the plb size problem. As the texture, Is it caused by the compiler:
https://www.mail-archive.com/mesa-dev@lists.freedesktop.org/msg189216.html

@anarsoul
Copy link
Contributor

anarsoul commented Apr 23, 2018

Oh, I wasn't aware of this change in mesa-18.0. That explains stuttering.

As for the issue - I suspect it's something related to cache - since it works 4 out of 5 times fine, and fails on 5th time (that's approximately)

@yuq
Copy link
Owner Author

yuq commented Apr 23, 2018

The compiler scalar back to vec problem will get worse when 18.1. But I want to focus on kernel currently so left it with some incomplete work around.

The issue maybe cache problem. Another possibility is the switch_delay, I found on Amlogic chip, when in high frequency (>500MHz), it has to be bigger than 0xff, otherwise the chip will work in unstable state. Not sure if this affect your chip.

@anarsoul
Copy link
Contributor

Setting switch-delay to 0xffff doesn't help for ppmmu error, but "pp error irq state=200" goes away. Looks like Mali400 in Allwinner A64 needs switch-delay 0xffff to work properly. Does it make sense to make switch-delay = 0xffff default value?

@anarsoul
Copy link
Contributor

And I think I understand when "ppmmu error" happens - it always happens if I run some app that uses textures and when I press ctrl+c to interrupt it. I believe driver tears down MMU mapping while PP is still running.

@yuq
Copy link
Owner Author

yuq commented Apr 23, 2018

I don't know if it's proper to always set switch delay to 0xffff as some platform just set this value to 0xff and some set it to 0xffff in the mali driver, also this value depends on the clk freq. Does proprietary A64 mali kernel driver set it to 0xffff or 0xff?

As the ppmmu error, no matter your guess is true, kernel driver indeed has no mechanism to prevent this situation happen. If user just call vm_unmap before PP task is done, this result is expected. If user is interrupted and resource is freed due to dev file descriptor close, we may add some code to wait the task done.

@anarsoul
Copy link
Contributor

If I read this code correctly, it uses 0x0 as delay since there's no pmu_switch_delay in device tree:
https://github.com/mripard/sunxi-mali/blob/master/r6p2/src/devicedrv/mali/linux/mali_osk_mali.c#L244

What does 0x0 mean in this case? Highest possible delay?

@yuq
Copy link
Owner Author

yuq commented Apr 23, 2018

Are you sure the switch delay reg is set to 0? this is the min delay or no delay from the comment.

@anarsoul
Copy link
Contributor

I verified it, and it's setting it to 0.

@yuq
Copy link
Owner Author

yuq commented Apr 23, 2018

Then if set to 0 in lima kernel driver, does it fix your pp error too?

@yuq
Copy link
Owner Author

yuq commented May 13, 2018

Progress update:

  1. switch to use TTM as MM is done, but I left the buffer eviction and swap not implemented because I don't know if GP/PP support MMU fault recovery (mali kernel driver doesn't implement it either), need reverse engineering. Otherwise we may implement it by pin/unpin buffer when task creation/deletion.
  2. implement EGL_ANDROID_native_fence_sync for atomic modesetting, "kmscube -A" is supported

I'll prepare an RFC for the kernel driver recently.

@anarsoul
Copy link
Contributor

@yuq please CC me on your RFC patches

@yuq
Copy link
Owner Author

yuq commented May 14, 2018

@anarsoul no problem.

@yuq
Copy link
Owner Author

yuq commented May 21, 2018

RFC has been send:
https://lists.freedesktop.org/archives/dri-devel/2018-May/177314.html

@mirh
Copy link

mirh commented May 26, 2018

Soo.. I noticed a guy noticed you are missing some Mali architectures there.
You can even add ARCH_U8500, ARCH_HISI, ARCH_MEDIATEK, ARCH_SPRD, ARCH_ZX and ARCH_TANGO

@yuq
Copy link
Owner Author

yuq commented May 26, 2018

Oh, I didn't know there're so many ARCH. Now I decide to just write like this:
ARM || ARM64 || COMPILE_TEST

Thanks for your notice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants