Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configs: Consider adding CONFIG_UDMABUF=y #6706

Open
rmader opened this issue Mar 7, 2025 · 29 comments
Open

configs: Consider adding CONFIG_UDMABUF=y #6706

rmader opened this issue Mar 7, 2025 · 29 comments

Comments

@rmader
Copy link
Contributor

rmader commented Mar 7, 2025

Describe the bug

From the docs:

udmabuf is a device driver which allows userspace to create dmabufs. The memory used for these dmabufs must be backed by memfd.

It is becoming increasingly popular for multimedia related tasks and, from Fedora 41 and systemd 257 on, is available by default to users.
The RPi5 can notably benefit from it as it lacks hardware video decoders for most common codecs such as H264, VP9 and AV1 - udmabuf allows software decoders to allocate buffers that can be used by the powerful display engine.

Thus I'd like to kindly request to add CONFIG_UDMABUF=y to the configs :)

Some more context:
udmabuf was recently enabled by default in systemd and is used by the libcamera software ISP, mesa-llvmpipe and, hopefully soon, gstreamer. The later MR is an experiment, allowing software decoded video to be displayed much more efficiently compared to other approaches.


See also:

Steps to reproduce the behaviour

Device (s)

Raspberry Pi 5

System

Logs

No response

Additional context

No response

@pelwell
Copy link
Contributor

pelwell commented Mar 7, 2025

Any thoughts, @jc-kynesim, @cillian64, @naushir and @6by9?

@rmader
Copy link
Contributor Author

rmader commented Mar 7, 2025

P.S. in case you wonder: CONFIG_UDMABUF is very similar to CONFIG_DMABUF_HEAPS_SYSTEM, however it doesn't lack from memory accounting issues like the later and was thus considered safe to be exposed to users by default (systemd/systemd#33738, as mentioned above already).

That makes it much more attractive for apps/toolkits/libraries to support.

@naushir
Copy link
Contributor

naushir commented Mar 7, 2025

I'm not opposed to this change, but our libcamera based camera applications already do their own buffer allocations through the dma heap instead of relying on the kernel drivers to do this:
https://github.com/raspberrypi/rpicam-apps/blob/main/core/dma_heaps.cpp
https://github.com/raspberrypi/rpicam-apps/blob/74abee8f2de519fed9cb88d39473d1dd4cf50b62/core/rpicam_app.cpp#L1014

This provides all the efficiency (i.e. userland managed and cached) that we require. So I doubt we would change to using UDMABUF allocations for these applications.

@popcornmix
Copy link
Collaborator

Kodi can support udmabuf in additional to dma heap. Currently we go through the dma heap path, but it looks like if udmabuf is available it will take precedence.

It's possible the cacheability may be different (and according to PR whether buffer is contiguous, although I would hope it would be if we are trying to display it), so I'd like to test whether anything goes wrong with udmabuf enabled.

@jc-kynesim
Copy link
Contributor

jc-kynesim commented Mar 7, 2025 via email

@jc-kynesim
Copy link
Contributor

If as I suspect this cannot do CMA we should only enable it on IOMMU capable Pis otherwise it will just be confusing.

@popcornmix
Copy link
Collaborator

If as I suspect this cannot do CMA we should only enable it on IOMMU capable Pis otherwise it will just be confusing.

That's not straightforward. bcm2711_defconfig is used on 64-bit pi3, pi4 and pi5 (with 4k pagesize).
It could be enabled only for bcm2712_defconfig, but so far the only difference in 2711/212 is the pagesize.

@jc-kynesim
Copy link
Contributor

Which of 2711 / 2712 do we ship by default on Pi5? If it is 2712 then I'd be in favour of just adding it to 2712_defconfig but I can see there are other valid opinions here.

@popcornmix
Copy link
Collaborator

popcornmix commented Mar 7, 2025

We ship both. 2712 (16k pagesize) is the default on Pi5 but there are some compatibility issues with software that assumes a 4k pagesize, so some users run with 2711 / 4k (kernel=kernel8.img).

@naushir
Copy link
Contributor

naushir commented Mar 7, 2025

Cursory googling suggests UDMABUFS do allocate contiguous blocks.

@jc-kynesim
Copy link
Contributor

jc-kynesim commented Mar 7, 2025

Cursory googling suggests UDMABUFS do allocate contiguous blocks.

You've found a different udmabuf interface! (same as I did first time)

@jc-kynesim
Copy link
Contributor

I'm happy to leave the decision to someone else, but I do know that passing a non-CMA dma buffer to something expecting CMA produces very confusing errors, especially as it sometimes works as the memory just happens to be contiguous (this normally happens on the first run through then everything fails on subsequent runs).

@pelwell
Copy link
Contributor

pelwell commented Mar 7, 2025

How do Kodi et al detect UDMABUF support, and is there scope for enabling via a cmdline.txt setting?

@rmader
Copy link
Contributor Author

rmader commented Mar 7, 2025

FWIW., I was just able to confirm that my main use-case, improving performance for sw-decoded video, works as expected on the RPi5. With this Gstreamer MR I can fluently play 1080p AV1 (8bit) on a 4K screen as the udmabuf buffers are passed through from Gstreamer -> GTK4 -> Gnome-Shell/Mutter -> KMS:

cat /sys/kernel/debug/dri/1/framebuffer 
framebuffer[680]:
	allocated by = KMS thread
	refcount=2
	format=YU12 little-endian (0x32315559)
	modifier=0x0
	size=1920x1080
	layers:
		size[0]=1920x1080
		pitch[0]=2048
		offset[0]=0
		obj[0]:
			name=0
			refcount=4
			start=0010e304
			size=3317760
			imported=yes
			dma_addr=0x0000000a7b400000
			vaddr=0000000000000000
		size[1]=960x540
		pitch[1]=1024
		offset[1]=2211840
		obj[1]:
			name=0
			refcount=4
			start=0010e304
			size=3317760
			imported=yes
			dma_addr=0x0000000a7b400000
			vaddr=0000000000000000
		size[2]=960x540
		pitch[2]=1024
		offset[2]=2764800
		obj[2]:
			name=0
			refcount=4
			start=0010e304
			size=3317760
			imported=yes
			dma_addr=0x0000000a7b400000
			vaddr=0000000000000000
framebuffer[679]:
	allocated by = [fbcon]
	refcount=1
	format=RG16 little-endian (0x36314752)
	modifier=0x0
	size=3840x2160
	layers:
		size[0]=3840x2160
		pitch[0]=7680
		offset[0]=0
		obj[0]:
			name=0
			refcount=2
			start=00100000
			size=16588800
			imported=no
			dma_addr=0x0000000a7f000000
			vaddr=00000000f79cec9e

Playing 4K@30FPS unfortunately is still not smooth, but passthrough works as well.

So both GPU and display engine seem to handle the buffers pretty well already. Will quickly test the RPi4 - IIUC the display engine there shouldn't support the import, however the GPU should.

@jc-kynesim
Copy link
Contributor

FWIW I think this https://github.com/torvalds/linux/blob/master/include/uapi/linux/udmabuf.h is the interface under discussion

@jc-kynesim
Copy link
Contributor

@rmader You might not want to do it this way but allocating dmabufs via /dev/dma_heap/vidbuf_cached (a sym-link to system or linux,cma depending on Pi variant) will work on all Pis and does work well for s/w decode buffers

@rmader
Copy link
Contributor Author

rmader commented Mar 7, 2025

I do know that passing a non-CMA dma buffer to something expecting CMA produces very confusing errors

In the context of Wayland, GL/VK and KMS - shouldn't that usually just gracefully fail? In any case, the udmabuf we're talking about here uses virtual memory.

especially as it sometimes works as the memory just happens to be contiguous

Wow, that's wild^^

How do Kodi et al detect UDMABUF support, and is there scope for enabling via a cmdline.txt setting?

By checking whether we can open /dev/udmabuf with write permissions. So distros can easily remove those.

@pelwell
Copy link
Contributor

pelwell commented Mar 7, 2025

By checking whether we can open /dev/udmabuf with write permissions. So distros can easily remove those.

That suggests we could use a module parameter set from bootargs in Device Tree to enable it on model-specific basis.

@jc-kynesim
Copy link
Contributor

I do know that passing a non-CMA dma buffer to something expecting CMA produces very confusing errors

In the context of Wayland, GL/VK and KMS - shouldn't that usually just gracefully fail? In any case, the udmabuf we're talking about here uses virtual memory.

I think that by the time you have a dmabuf handle in your hand you must have all the memory locked down - almost no bit of h/w is going to cope with paging requests. Failure tends to happen late in the process when a driver tries to actually get a h/w address from the handle with a call it is expecting to succeed.

especially as it sometimes works as the memory just happens to be contiguous

Wow, that's wild^^

Seen it happen

@jc-kynesim
Copy link
Contributor

By checking whether we can open /dev/udmabuf with write permissions. So distros can easily remove those.

That suggests we could use a module parameter set from bootargs in Device Tree to enable it on model-specific basis.

That sounds like a solid idea

@rmader
Copy link
Contributor Author

rmader commented Mar 7, 2025

@rmader You might not want to do it this way but allocating dmabufs via /dev/dma_heap/vidbuf_cached (a sym-link to system or linux,cma depending on Pi variant) will work on all Pis and does work well for s/w decode buffers

Thanks, that's great to know - for my current project it's kinda an anti-goal though :)

The context here is that we are exploring whether sw-decoding with udmabuf could be an almost universal baseline for video playback on semi-recent linux devices. I previously tested the same GST patches on old Intel and AMD laptops, now I'm trying various ARM devices. In theory - and so far results are very promising - the approach should allow us to make software decoding quite a bit faster compared to previous generic approaches, by avoiding unnecessary copies in the graphics stack whenever at least the GPU has a MMU. Because the buffers can just get passed through the same code paths that many apps already have for HW decoding - and only once we need to composite, be it in the app, the system compositor or the display engine, a copy happens.

The RPi5 is an interesting case because it lacks HW decoders for common formats BUT has a powerful display engine, unlike older laptops that usually need to use the GPU for the final blit.

@rmader
Copy link
Contributor Author

rmader commented Mar 7, 2025

Tested the RPi4 now. As expected the display engine doesn't support / accept udmabuf buffers - it seems to handle that perfectly though, failing in the test-only KMS commits, not in real ones (I didn't test long though).

The GPU in turn imports the buffers just fine as expected, meaning that clients can pass the buffers through, up to the Wayland compositor when possible, ensuring a minimal amount of copies.

Thus I suggest to enable udmabuf on the 4 as well.

P.S.: With the GST patches Showtime (the upcoming Gnome default player) outperformed mpv (upstream version) and just managed to play 1080p30fps-8bit AV1 (low quality) on a 2560x1440 screen smoothly. And the same should be possible in other apps/players - notably Firefox.

@jc-kynesim
Copy link
Contributor

Beware that our Wayland now has passthrough for dmabufs direct to HVS if fullscreen. This may confuse that.
I mention the dma_heap on the Pi 'cos once you have the dmabuf allocated everything else will work exactly the same as in your udmabuf case.

@popcornmix
Copy link
Collaborator

I tried enabling this and testing kodi. Initially I see in kodi log

2025-03-07 15:29:03.180 T:2645    debug <general>: CUDMABufferObject::Register - unable to open /dev/udmabuf: Permission denied

because user isn't in kvm group:

pi@pi500:~ $ ls -l /dev/udmabuf
crw-rw---- 1 root kvm 10, 125 Mar  7 14:55 /dev/udmabuf

Adding user to kvm group does mean kodi uses the /dev/udmabuf node, but h264 (which is software decode on Pi5) is very corrupt. Checking logs shows:

2025-03-07 15:20:47.051 T:2272    error <general>: CDMAHeapBufferObject::CreateBufferObject - ioctl DMA_HEAP_IOCTL_ALLOC failed, errno=Cannot allocate memory

I'm guessing this is trying to use cma (which is set quite low on Pi5, due to availability of iommus). Ah yes, dmesg shows (many of):

[ 1594.807163] cma: __cma_alloc: linux,cma: alloc failed, req-size: 151 pages, ret: -12
[ 1594.807172] cma: number of available pages: 16@272+96@416+126@642+126@898+126@1154+105@1687+105@1943+105@2199+105@2455+105@2711+105@2967+105@3223+105@3479+105@3735+105@3991=> 1540 free of 4096 total pages

and increasing cma does get the file to play. But this is suboptimal - Pi5 doesn't require cma for this use case (dma heap can allocate from non-contiguous system memory using /dev/dma_heap/vidbuf_cached).

@jc-kynesim
Copy link
Contributor

Those traces look like you are still using dma_heap but have fallen back to linux,cma not udmabuf at all

@popcornmix
Copy link
Collaborator

Those traces look like you are still using dma_heap but have fallen back to linux,cma not udmabuf at all

how does fixing permissions on /dev/udmabuf make it use dma_heap?

I'm pretty sure this code is being used and the UDMABUF_CREATE ioctl is causing a cma allocation.

@jc-kynesim
Copy link
Contributor

OK - but the debug you quote says "ioctl DMA_HEAP_IOCTL_ALLOC failed" not "ioctl UDMABUF_CREATE failed"

@cillian64
Copy link
Contributor

Turning on udmabuf seems sensible. I agree that if there's a neat way to only enable it on Pi5 then that might avoid some confusion.

As an aside, does anyone know why udmabuf only works with memfd and not regular SHM allocations? Is it just so we can enforce appropriate seals (🦭🦭)? It would be really useful if we could use udmabuf to do DRM scanout on regular Wayland SHM buffers.

@6by9
Copy link
Contributor

6by9 commented Mar 10, 2025

No objection from me if it works in a useful manner.

I remember @cillian64 looking at it and getting some benefit, but largely on Pi5 as HVS then has an IOMMU. Earlier boards it is less useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants