Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan loader/layer interface makes it difficult to query what extensions an ICD supports #866

Open
mfncharm opened this issue Mar 1, 2022 · 10 comments
Labels
enhancement New feature or request layers

Comments

@mfncharm
Copy link

mfncharm commented Mar 1, 2022

The Vulkan loader architecture and implementation shield Vulkan applications as well as the Vulkan ICDs from the complexity of having multiple drivers accessible via a single Vulkan API.

Unfortunately, Vulkan layers can find themselves handling some of this complexity themselves. In particular, a Vulkan layer wanting to implement a Vulkan extension has to intercept a number of instance and device commands. In some circumstances, it is not straightforward for the layer to decide whether it should handle the call itself or it should rather hand it over to the loader and the ICDs.

For example, if the layer implements an instance extension, then it cannot easily find out whether the same instance extension is supported by the ICDs in the system. In particular,

  • Unlike Vulkan applications, Vulkan layers cannot assume that a non-NULL pointer returned by vkGetInstanceProcAddr is callable. The Vulkan loader may enable instance trampolines for extensions that the layer implements, even when there are no ICDs that implement that extension. In that case, by calling vkGetInstanceProcAddr the layer obtains pointers to the loader trampolines. When called, these trampolines abort as they are unable to find any ICD implementation for the corresponding Vulkan commands.
  • The loader aggregates instance extensions and it is difficult for the layer to tell which extensions each ICD supports, especially when the extensions are also being implemented by the layer itself.

Another problem encountered by layers is enabling additional extensions that they need in order to operate. See for example #51. I understand, the conclusion is that enabling device extensions may be fine, but enabling instance extensions may be problematic. The boundary of what a layer is or is not able to do is not very well defined / documented. Also - I suspect - layers enabling additional extensions behind the application's back (e.g. by intercepting vkCreateDevice and adding more extensions to ppEnabledExtensionNames) should take care to intercept vkGet*ProcAddr and return NULL for commands that the applications is not expecting to be enabled. It'd help to have some of this complexity, that all layers have to deal with, moved to the loader.

See also: https://gitlab.freedesktop.org/mesa/vulkan-wsi-layer/-/issues/18 where this discussion started

@charles-lunarg
Copy link
Collaborator

I'm going to enumerate what I perceive to be the listed issues.

  • Layers have poor visibility into driver support for instance extensions.

    • Non-Null vkGetInstanceProcAddr isn't authoritative
    • vkEnumerateInstanceExtensions aggregates the list, no way to tell which driver implements a specific
  • Layers implementing instance extensions is poorly defined and documented.

I don't have any concrete solutions for those problems. Much of the issue is that the design did indeed try to keep layers from having to know/deal with the multiple drivers on the system. One idea might be to allow querying the instance extensions active on a physical device. Downside is this has to occur after an instance is created. But the alternative is to make layers aware of the Drivers on the system, something which isn't possible now. There is no VkDriverHandle to use at the moment, and would bring up some interesting questions.

Also - I suspect - layers enabling additional extensions behind the application's back (e.g. by intercepting vkCreateDevice and adding more extensions to ppEnabledExtensionNames) should take care to intercept vkGet*ProcAddr and return NULL for commands that the applications is not expecting to be enabled. It'd help to have some of this complexity, that all layers have to deal with, moved to the loader.

I'm not sure as to what complexity you are referring to here. Layers should handle unknown function pointers uniformly, in that they simply query the next layer down (save for vkGetPhysicalDeviceProcAddr...).

@mfncharm
Copy link
Author

I'm not sure as to what complexity you are referring to here. Layers should handle unknown function pointers uniformly, in that they simply query the next layer down (save for vkGetPhysicalDeviceProcAddr...).

I am referring to the case where a layer intercepts vkCreateDevice and adds some extra extensions that it needs to the list of extensions that the application provides in VkDeviceCreateInfo.ppEnabledExtensionNames. I believe that - by doing so - the Vulkan runtime will report as supported the entrypoints for the extensions enabled by the layer. This means that the layer must intercept vkGetDeviceProcAddr and hide all the entrypoints that the application does not expect to be usable, according to the Vulkan spec (see table 1 here)

Similarly, for extensions that alter the behaviour of some Vulkan entrypoints, the layer may have to intercept these entrypoints and undo the alterations introduced by the extensions that the layer enabled, but the application is not expecting to be enabled.

In an ideal world, the layer would instead tell to the Vulkan Loader which extensions it needs without having to hack the VkDeviceCreateInfo structure.

@charles-lunarg
Copy link
Collaborator

I'm not sure its specified what happens when a layer enables an extension. It would stand to reason that a layer should be able to enable an extension, after all, layers can make addition Vulkan API calls while intercepting a Vulkan call. If a layer wishes to use an extension that the application didn't specify, that should be allowed as it follows that a layer which needs to call an extension function must enable that extension first.

What isn't clear is all the behavior around that. I see the logic of a layer returning NULL for entrypoints that the application doesn't expect to be valid, as while the layer itself may load those entrypoints, the application didn't enable the extension. However, as far as the driver is concerned, the extension was enabled. There is no issue for the application to call those entrypoints. This distinction leads me to think that layers shouldn't be responsible for returning NULL here, the extension is being enabled regardless.

I don't see anywhere in the specification which describes the behavior. The Vulkan-Loader will emit a debug message for any extension not supported by the ICD, but the validation of the extensions happens before calling down the chain.

As for a specific solution to layers adding extensions, we could add a new struct to the pNext chain that holds the necessary info in it.

I think its not impossible to add a private loader-layer function which lets layers ask which physical devices support instance extensions. vkEnumeratePhysicalDeviceInstanceExtensions(VkPhysicalDevice physical_device, uint32_t* count, VkExtensionProperties* properties) is a possible prototype for this function.

@mfncharm
Copy link
Author

This distinction leads me to think that layers shouldn't be responsible for returning NULL here, the extension is being enabled regardless.

Implicit layers that intend to implement a Vulkan extension in a way that is transparent to the application would, however, end up introducing side effects that are "noticeable" to applications. This is why an implicit layer may want to intercept vkGet*ProcAddr and return NULL for the extra entrypoints. However, in general, "undoing" enabled extensions may not be as easy as returning NULL where appropriate in vkGet*ProcAddr. Required extensions may extend or alter the behaviour of core entrypoints. It is unclear to me how much of a problem this is in practice.

Returning to vkGet*ProcAddr, the Vulkan spec is clear on what the behavior of this entrypoint should be and doesn't mention that implicit layers or other third parties may enable extensions behind the application's back. In practice, whether it is fine or not for a layer to not "undo" the side effects caused by enabling extra extensions depends on what applications out there got used to expect/rely-on when calling vkGet*ProcAddr.

I think its not impossible to add a private loader-layer function which lets layers ask which physical devices support instance extensions.

I think this would be a good improvement that would benefit developers of Vulkan layers. In particular, it would make it easier to write layers that work properly a multiple-ICD setups.

@charles-lunarg
Copy link
Collaborator

A possible declaration for the function could be:

typedef bool (VKAPI_PTR *PFN_SupportInstanceExtension)(VkPhysicalDevice physicalDevice, const char* extensionName);

With vk_layerGetPhysicalDeviceProcAddr, it is possible to get the function through the create info chain or vkGetInstanceProcAddr. I believe for uniformity this new function should be added to both, though it'd be easier to only make it get-able through vkGetInstanceProcAddr.

Alternatively, if the vkGetPhysicalDeviceProcAddr function was added to the Vulkan API proper, then a private implementation wouldn't be necessary, as well as allow applications to query if a driver actually supports instance level commands rather than return NULL. This would close the information gap, in that apps can call any physical device level function (from instance extensions) but it wont necessarily do something useful.

@mfncharm
Copy link
Author

The prototype you propose seems good to me. It would definitely help the use case explained in the description of this issue: decide whether a layer can safely call down a Vulkan command that the layer itself implements.

How is this function meant to operate, though? Are you thinking that the function could bypass all the layers and directly query the ICD? Consider the case where there are two layers both implementing MyNewCommand:

MyNewCommand -> Layer A -> Layer B -> ICD

Layer A can always call down, while Layer B can call down only if the ICD supports MyNewCommand. It would be nice if the new SupportInstanceExtension function could also include all the commands that other layers may implement underneath it (from the layer's point of view what is important is whether it can call down rather than whether the ICD implements the command.) I am not sure how easy it is to achieve that, though. The implementation of GetProcAddr in layers calls down and therefore doesn't tell whether the command is implemented by the layer itself or whatever sits below it. Information on which commands a layer implements is present in the layer's manifest, but only for device extensions, AFAIK.

@charles-lunarg
Copy link
Collaborator

Those are very valid concerns.
Bypassing the other layers is convenient because then its a direct query between layer and ICD. However, it completely upends the architecture of Vulkan, so it is not consistent.

FYI the manifest containing a list of supported device extensions is something I would like to remove, or at least deprecate. It doesn't delineate what 'support' means, and many debugging layers 'intercept' all sorts of device extensions but don't declare it in that manifest. Additionally, support could be 'intercept', it could be 'implement', or it could be both implementing the behavior and calling down the chain.

@charles-lunarg
Copy link
Collaborator

In your hypothetical, for Instance & physical device functions, layers always call the loader's terminator for the function, not the driver directly. So Layer B would always have a valid function to call, the loader may just return immediately without calling into the ICD(s).

Yes, there is a very large lack of visibility to other layers who does & doesn't intercept a function. I could imagine a solution to this that mirrors the VK_EXT_tooling_info where each layer adds its own information to it, but I find that to be very annoying to write for a layer developer, as it requires everyone to get it right for the output to be useful.

@charles-lunarg
Copy link
Collaborator

Looking at vkEnumerateInstanceExtensionProperties, maybe it could be adapted for use here? A layer would need to enumerate the physical devices (otherwise there needs to be another mechanism to specify which ICD to use). Then, they can query the list of extensions directly supported by that ICD.

typedef bool (VKAPI_PTR *PFN_vk_layerEnumerateICDSupportedInstanceExtensionProperties)(VkPhysicalDevice physicalDevice, 
    uint32_t*                                   pPropertyCount,
    VkExtensionProperties*                      pProperties);

Something I realized while typing these replies out is whether or not a layer should care about individual ICD's supporting instance extensions. The way the loader is written, if 1 ICD doesn't support an instance extension that another does, the loader must handle this situation, and make sure that any calls which hit the terminator wont cause a problem. From my vantage point, it seems that layers trying to cater to a specific ICD are going about things backwards, as layers aren't supposed to care about individual ICD's. Could you explain the rationale for needing to know which ICD supports which instance extension?

I should add that with Vulkan 1.1/vkGetPhysicalDeviceProperties2, Device extensions were allowed to add functions which take VkPhysicalDevice as its dispatchable handle. This solves the issue of 'which ICD supports which things' because device extensions are per Physical Device. Surface creation obviously doesn't have such a carve out in the spec, but that is due to the aforementioned "instance should 'work' for every ICD" that the loader does.

@mfncharm
Copy link
Author

Could you explain the rationale for needing to know which ICD supports which instance extension?

This problem emerged while working on the layer at https://gitlab.freedesktop.org/mesa/vulkan-wsi-layer

In this layer we are implementing surface extensions like VK_KHR_wayland_surface and VK_EXT_headless_surface. One problem we have is to decide whether we should handle entrypoints ourselves or fall back to the ICD, in case they implement these extensions themselves. See (note) below.

For example, our layer implements VK_EXT_headless_surface, but also some ICDs might. We thought we'd handle these situations by preferring to use the ICD's implementations, when available. The problem is that from the layer's position in the stack it is not possible to know whether calling down to the ICD would work. Instance extensions returned by the Vulkan API include those advertised by the layer. So VK_EXT_headless_surface is always included. Also, calling vkGetInstanceProcAddr returns the loader's trampoline, as the loader sees VK_EXT_headless_surface in the list of enabled extensions and thus enables the trampoline for it.

I guess one solution could be to change the terminators to assume the layer may themselves implement the entrypoint. I suspect this may be problematic as some entrypoints are expected to do something when they are exposed via vkGetInstanceProcAddr. Let's take a concrete example. What should the loader do if in the terminator for vkGetPhysicalDeviceSurfaceSupportKHR it finds that no ICDs support VK_KHR_surface? Maybe return with VK_ERROR_SURFACE_LOST_KHR? That's odd in a way, as this enumeration is part of VK_KHR_surface. Also the surface isn't actually lost.

(note) VK_*_surface are somewhat odd extensions. They are partly implemented by the loader, but have consequences on the ICDs, as ultimately the physical devices advertise support for these for their queues. They also "leak" into other (device) extensions, such as VK_KHR_swapchain, which are necessarily tied to the windowing systems.

In your hypothetical, for Instance & physical device functions, layers always call the loader's terminator for the function, not the driver directly. So Layer B would always have a valid function to call, the loader may just return immediately without calling into the ICD(s).

What we found in practice is that the loader terminator function complains because the ICDs do not implement the entrypoint. Consider the example the case I discussed before: vkGetPhysicalDeviceSupportKHR. The terminator isn't really in a good place to do anything if the ICDs do not implement this function.

Looking at vkEnumerateInstanceExtensionProperties, maybe it could be adapted for use here? A layer would need to enumerate the physical devices (otherwise there needs to be another mechanism to specify which ICD to use). Then, they can query the list of extensions directly supported by that ICD.

The prototype you propose looks interesting. It'd be nice if this was an ordinary Vulkan entrypoint (or would be treated as such), so that layers can intercept it.This would allow a layer to know what instance extensions each ICD, or layers below, support.

Another solution I thought about was to change the Vulkan loader implementation of vkGetInstanceProcAddress to return NULL if the ICD does not support the function. Going back to the implementation of vkGetPhysicalDeviceSupportKHR, I see the following code in its terminator:

    if (NULL == icd_term->dispatch.GetPhysicalDeviceSurfaceSupportKHR) {
        // set pSupported to false as this driver doesn't support WSI functionality
        *pSupported = false;
        loader_log(loader_inst, VULKAN_LOADER_ERROR_BIT, 0,
                   "ICD for selected physical device does not export vkGetPhysicalDeviceSurfaceSupportKHR!\n");
        return VK_SUCCESS;
    }

However, this is how the loader implements vkGetInstanceProcAddr:

    if (!strcmp("vkGetPhysicalDeviceSurfaceSupportKHR", name)) {
        *addr = loader_inst->wsi_surface_enabled ? (void *)vkGetPhysicalDeviceSurfaceSupportKHR : NULL;
        return true;
    }

It could be argued the loader should not return a function pointer unless it is guaranteed to work. Therefore the condition loader_inst->wsi_surface_enabled is maybe not sufficient.

One option would be to also check that NULL != icd_term->dispatch.GetPhysicalDeviceSurfaceSupportKHR, like done in the terminator. The problem is that vkGetInstanceProcAddr doesn't take a physical device as input. What seems to be missing is a vkGetPhysicalDeviceProcAddr, to be honest. Such a function, could do the right thing and not return the address of a function that is destined to fail. Actually, the loader-layer interface defines a vk_layerGetPhysicalDeviceProcAddr, but that doesn't take a VkPhysicalDevice as input :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request layers
Projects
None yet
Development

No branches or pull requests

3 participants