-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dual Nvidia GPU #81
Comments
As the display gpu is not marked as "internal GPU" the auto-decection of GPU devices fails. I'd suggest, that you set the ids of the graphics devices manually and choose the graphics devices by hand:
You can find out the necessary ids with vulkaninfo. Look for this section:
This would result in running applications with:
Try The output you shared looks really promising as it indicates, that both devices are reported and just primus fails in identifying which device to use for which role. |
But I got the IDs for
|
With I should note, my main |
There actually seems to be a bug in the device-selection code, if both devices are from the same vendor. Could you add a Line 129 in 0c63679
So let me just ask a few questions/state assumptions about your system setup:
The nvidia driver has the the strange quirk to connect to the (current) X-Server (for whatever reason), even before selecting graphics devices. I built |
Change prevents selection of same ID, but now the second one doesn't find anything, so it only finds either display or render (depending on which I put where) - and always the one set in
Correct. X-Server detects and uses the 710 - visible in
My
Correct. I can run non-Vulkan applications completely fine using
|
Wow, primus works! I wasn't expecting that. And to be honest, I don't really understand, why. Probably, because in the OpenGL-Context we always pass along the expected XDisplay explicitly. I've just discovered (by accident) that the nvidia driver misbehaves less, if it cannot open an X-Display. Could you just change the value of |
That change together with
I actually never had to remove that file in the past (using AUR
No change from the solution of above code changes. |
Interesting. That seems to be closer to our goal. Could you show |
|
Hmm.... I think we need more debug output. Can you try |
Had to comment out regeneration of that file from the
|
Ok, so we know that the call to So run Check the loop over the icds. Make sure that the nvidia icd is one of them and there is only a single instance. Also take a look on other icds in that list and tell me what they are. Set a breakpoint here: https://github.com/KhronosGroup/Vulkan-Loader/blob/fa696ca02c7fcd488602a0e0132e26b49cfaa836/loader/wsi.c#L322 |
There is Nvidia ICD and only 1 instance. The other one is PrimusVK wrapper. // loop 1
(gdb) p *icd_term->scanned_icd
$8 = {
lib_name = 0xda1330 "libGLX_nvidia.so.0",
handle = 0x6874e0,
api_version = 4202638,
interface_version = 5,
GetInstanceProcAddr = 0x7ffff71607d0 <vk_icdGetInstanceProcAddr>,
GetPhysicalDeviceProcAddr = 0x7ffff7160760 <vk_icdGetPhysicalDeviceProcAddr>,
CreateInstance = 0x7ffff3b54c40,
EnumerateInstanceExtensionProperties = 0x7ffff3b54c30
}
// loop 2
(gdb) p *icd_term->scanned_icd
$9 = {
lib_name = 0xda11b0 "libnv_vulkan_wrapper.so.1",
handle = 0x647110,
api_version = 4198484,
interface_version = 5,
GetInstanceProcAddr = 0x7ffff78fd259 <vk_icdGetInstanceProcAddr(VkInstance, char const*)>,
GetPhysicalDeviceProcAddr = 0x7ffff78fd2ab <vk_icdGetPhysicalDeviceProcAddr(VkInstance, char const*)>,
CreateInstance = 0x7ffff3b54c40,
EnumerateInstanceExtensionProperties = 0x7ffff3b54c30
}
Steps in, here is the ICD. (gdb) p *phys_dev_term
$11 = {
disp = 0xd9ecf0,
this_icd_term = 0xf42180,
icd_index = 0 '\000',
phys_dev = 0xf47168
}
|
The primus_vk wrapper actually tries to behave identical/substituting to the original nvidia_icd, so I'd have counted that as 2 instances. Ok, so it seems that the display device is currently obtained through the "directly" installed icd, which could be bad (who really knows). Could you provide the result of these debugging steps, when I am not completely sure that we clarified this, but re-reading some previous posts made me realize, that the nvidia driver fails you even if we only have the host GPU, right? So when you run just plain |
If I rebind the 1060 back to
However that's because I also set my 1060 to be used by Bumblebee by default in
Same error as above (with only 710 active) after removing the Nvidia ICD. If I reactivate the 1060, also same problem, but it defaults the render to 1060 (as it should and as it was before). I re-added Nvidia ICD just to clarify, and now I am getting 4 diplay GPUs found, until removing it again. Unsure how that happened, like it is loading them both now.
Here are the results of the same debugging steps (without Nvidia ICD). // this time just one
(gdb) p *icd_term->scanned_icd
$2 = {
lib_name = 0xda12b0 "libnv_vulkan_wrapper.so.1",
handle = 0x647110,
api_version = 4198484,
interface_version = 5,
GetInstanceProcAddr = 0x7ffff78fd259 <vk_icdGetInstanceProcAddr(VkInstance, char const*)>,
GetPhysicalDeviceProcAddr = 0x7ffff78fd2ab <vk_icdGetPhysicalDeviceProcAddr(VkInstance, char const*)>,
CreateInstance = 0x7ffff3b54c40,
EnumerateInstanceExtensionProperties = 0x7ffff3b54c30
}
(gdb) print *phys_dev_term
$3 = {
disp = 0xda2030,
this_icd_term = 0xe70b80,
icd_index = 0 '\000',
phys_dev = 0xf457d8
}
Not sure what you mean with that. I did leave |
I've asked a friend (thanks @janisstreib) and have gotten access to a system with 2 Nvidia GPUs.
With these patches Otherwise you'd have to help me to get the vfio + bumblebee configuration working. (You don't use
Regarding vfio: I have added the vfio modules to
and configured vfio:
However vfio doesn't seem to be able to get the device:
even when I manually unload the nvidia driver ( |
Sadly that didn't fix it. I am starting to think it's something with the way I configured the GPUs and drivers so will try to do that from scratch. Correct, no |
I disabled kernel binding the 1060 to Are you sure your display GPU is actually using the |
Hmm, I didn't remove
After a BIOS upgrade the vfio-binding now seems to work (still not having removed
However bumblebee fails to activate the secondary GPU:
This error persists, even if in
When I to not use
I can reproduce the problem of the segfaulting nvidia driver in So from my experimenting I'd say that nvidia's vulkan driver cannot be convinced to detect 2 gpus correctly when they are only available on different x servers. I'd consider that a bug in the nvidia driver where we sadly can't work around. If you accept binding both nvidia cards to the same X-Server (their names show up in Xorg.0.log) the nvidia driver will successfully detect both graphics cards and operate them correctly. Nvidia's "solution" to gpu offloading still does not work in this setup (when I tried it, the application said that no graphics queue could be found) but primus-vk (with the special branch I provided) will be able to render on the secondary GPU, without the need to have a screen attached to the secondary GPU. You must not need to have I was not able to reproduce the setup where one GPU is bound by vfio-pci and then later the nvidia driver takes over. I do not know how you would spawn a "second instance" of the nvidia module which would handle the secondary graphics card, that only is available later. From what I observed, the nvidia driver will only detect graphics cards once on startup and I currently see no way to "add" the graphics card which is initially occupied by |
Most likely. I found the most sure way to know is to run
You do have to rebind the driver to
I will try that again tomorrow. Sadly binding to Nvidia at the start then fails to rebind into |
First of all, thanks for explaining how you rebind the pci-devices between drivers. Works perfectly :-) After very much experimenting, I believe I've got it running with the 2 gpus on 2 different X-Servers:
I have pushed the necessary hacks on the branch
|
Nice! However, I am getting a segfault with that:
My details are identical to yours (with different IDs). |
Interesting, can you provide more details about the segfault? What is null? |
Ah yes, that was stupid of me to forget! Completely forgot about it due to the new commit.
Works wonders now, awesome work, it is rare to see such in-depth debug help! I will also test any additional changes, if you will make any, and when it reaches master. |
Ooh, I just noticed, the bogus display, was actually my fault, as I just blindly rebased onto the "old" |
Haha, no I've seen it, just didn't think on it, so no worries. Yeah, let me know here and I'll help testing. :) |
I have already cleaned up the nv-fixes for a few days now and have used them myself to just get a feeling on how stable it is. Now I feel confident enough that they are fit for everyone. |
Tested |
I am closing this issue as we resolved the issue. Feel free to reopen or create a new one if you have other issues. |
I currently have 2 Nvidia GPUs installed:
After removing
modprobe.d/bumblebee.conf
(which disables Nvidia driver use),optirun
/primusrun
andpvkrun
with OpenGL applications works well, renders on the 1060 and displays on the 710.My
xorg.conf
hasDriver "nvidia"
set forBusID "PCI:10:0:0"
which is the GT 710. PrimusVK is installed from AUR (also tested with primus_vk).However, when trying a Vulkan application (
pvkrun vkcube
), I get the following:Then I modified
/etc/bumblebee/xorg.conf.nvidia
withBusID "PCI:09:00:0"
which is the GT 1060, and I get:Progress, but host GPU still doesn't want to be the display one. I've also tried setting
VK_ICD_FILENAMES
but no success so far. I understand this is likely a problem with 2 Nvidia GPUs.The text was updated successfully, but these errors were encountered: