365


Hello, everyone. Welcome to the PCI endpoint subsystem open items discussion. Let me go straight to the slides.

Okay. So this is the agenda for today's discussion. So, first, I'm going to present the state of the value of support in the PCI endpoint subsystem. And next, I'm going to propose using QEMU for testing the PCI endpoint subsystem, which is currently lacking. And then the third one is proposing a way to send the doorbell from the host to the endpoint device because the PC aspect hasn't defined that. So, I'll try to cover at least the first two things today.

So let me give a quick recap of what happened last year. So last year, I presented three proposals for adding the value of support in the PCI endpoint subsystem. And out of three, I got consensus to move with the proposal from Shunsuke.

So this was the proposal that was agreed last year. So on the left side, we have the host system, which is going to act as the value of the front-end. And on the right side, we have the endpoint device, which is the physical endpoint device, unlike the virtual PCI device in the virtualized environment. So that is going to act as the value of the backend. And so, the value of the backend is going to expose the value of the device to the front-end through PCI transport. And as I said, the difference is going to be... In this case, the endpoint is going to be the real physical endpoint device and unlike in the virtualized environment.

So, when I tried to implement that proposal on some of our Qualcomm endpoint devices, I actually hit a few showstopper bugs. Oops. Not this one, actually. So, this one, even though it's a pretty good book, the complexity of the showstopper bugs that I'm going to describe is not anywhere near to that. What is described in this book?

So I actually hit a few implementation issues in the agreed proposal. Luckily, those implementation issues were not related to Qualcomm devices, which is a surprise. So in that proposal, there was no MSI or MSIX support; it only had Intex. Don't ask me why. And that really affected the performance of the virtual transport. And then it only exposed the... the legacy Virtio devices only, because if you know of Virtio, there are two versions of the spec, right? The legacy Virtio spec and then the modern Virtio spec. And this one, unfortunately, supported only the legacy one. And then the third one was there was no IOMMU support. So everything actually worked on that. I mean, so the Shunsuke actually tested that proposal on, I think, Renesas SOC, both host and endpoint. And in that SOC, these were not an issue. But if you take... if you take any latest endpoint SOCs, and these three things need to be addressed. And then I also identified some race between the Virtio device and the driver when I brought up the Virtio support on Qualcomm endpoint devices. So let me go through each and every issue. And I'll also propose the solution of how to fix them.

So, first one is how to get MSI and MSIX. So the Virtio spec itself supports both INTx and MSIX, but no MSI. I don't know why. Even though MSI has its own issues, somebody might... But the problem with this thing is that most of the low-end endpoint devices out there in the market don't support MSIX. But they only support MSI. I think somebody might be cursing the hardware designers in the crowd. But I actually submitted a proposal for adding MSI to the Virtio spec. It was pretty much straightforward. I submitted a proposal, I got some constructive feedback for the version one, and then I submitted version two. And there has been silence for like six months or so. And then I also submitted the corresponding Linux kernel patch to enable MSI in the Virtio driver. And that also didn't get any response. But I think things are looking good. I don't see any blocker here. Sorry?

You need the specs before the patch goes in.

Well, I did submit the patch for adding it in the spec. And I also wanted to show how it looks like in the driver world. So I just submitted these two things.

Yeah, but again, submitting the kernel patch without the specs being finalized is kind of pointless.

Of course. Of course, I agree with that. But at least I was hoping that the spec patch at least gets merged first. But I don't know.

So the next one was supporting the modern Virtio device. Because I realized that exposing the legacy Virtio device is not going to be a smart way to do in 2024. But when I tried to implement the modern Virtio device in Qualcomm Associates, I hit a few roadblocks. So one of which is that it requires configurable PCI vendor capability. So the modern Virtio spec expects you to have this PCI vendor capability registers. It’s designed in such a way that it exposes the offsets of the Virtio structures, like where the Virtio structures lie in the specific power region, et cetera. But these, yeah, these are used to, I mean, these offsets are used to discover the Virtio structures, which were actually hard-coded in the legacy Virtio spec. And unfortunately, these vendor capability registers are not configurable on the production-ready devices. And this is not a problem in a virtualized environment like QEMU, because our hypervisors, where the virtual PCI devices are going to be exposed. And the hypervisor on QEMU will have the complete control over the device. But when it comes to a real PCI endpoint device, the hardware vendors wouldn’t allow us to configure the vendor capability registers. They have some fixed vendor capability. And we have to work with that. So I’m now working on a proposal to allow Virtio spec to discover the Virtio structure without using that configure vendor capability. I haven’t submitted it. And I’m thinking about going with a fixed offset in the BAR. Because I don’t know how, I mean, how else we can actually expose this information without the vendor capability. Because vendor capability seems to be a nice way, and that’s why people stick to it. So, yeah.

This was one such thing. And then the next one is IOMMU support. Again. Most of the modern endpoint SOCs or the host platforms, they have IOMMU for PCI and other peripherals as well. So if you have IOMMU, the bus address, the PCI address, is not going to be the same as the physical address, right? It has to be translated. But the problem with the legacy Virtio spec is that it actually works only with the physical address. So whenever the frontend exposes the address of the Virtio word queues to the backend, and those addresses are actually the physical addresses, and unfortunately, when the endpoint tries to access those addresses, if there is an IOMMU on the host, it will trigger an IOMMU fault because you cannot access the physical address. You have to use the translated address. And this issue is going to get solved when we migrate to the modern Virtio spec.

And so this is the other thing I discovered when I tried to implement that, because I found some race between the Virtio device and the driver during the initialization phase. So the reason for the race is that the Virtio spec itself uses some registers in such a way, like it has some device feature register, which is a 32-bit register, but it uses that register to expose a 64-bit value. And the way it achieves this is by using another register called, I think, the set queue register. So whenever the set queue register, I mean, the host writes zero to the set queue register, the lower 32-bit value will be read from the device feature register. And whenever one is written, it has to read the upper 64-bit. It's not a problem in a virtualized environment because whenever you write, whenever, let's take an example of QEMU. Whenever the guest writes to an endpoint BAR region, we will trigger, or the QEMU will actually trap the accesses, and then it can do whatever it wants to, you know, as an action for that register thing. But that kind of trapping is not really possible on real endpoint devices, because when you write to an endpoint BAR, the write just goes through it. There's no way to trap that thing in the real PC endpoint devices. So we are actually seeing a race due to this behavior. And for that, I'm going to propose adding some sync points between the virtual device and the driver. So whenever... So let's take an example of the device feature. So whenever the host wants to read the upper 64-bit value, it has to wait for a sync between the host and the endpoint. Then only it can actually go and read the upper 64-bit value. I don't know, it's the best solution or not, but I'm just thinking about proposing it. Let's see. I'm going to propose in the coming days.

Sorry, I'm going real quick. The next thing is using QEMU for testing the PC endpoint subsystem.

So the problem with the PC endpoint subsystem is that whenever you want to test it, you need to have two systems. One is the host, and then the other one is the endpoint. And it's not always feasible to carry these two things around. So I'm thinking about using a software model to test this PC endpoint subsystem, and QEMU seems to be the natural choice.

So this is the setup, what we need to have on the host. So we have the root complex that is emulated by QEMU, and then that will be controlled by the controller driver in the Linux kernel, which is the guest. And then we also need to have some endpoint test device emulated in the QEMU itself, and that will be controlled by the endpoint test device in the kernel. So the kernel part is already there, so pretty much there is no issue with the host side of QEMU. But when it comes to the endpoint side, what we need to have is an emulated endpoint controller in QEMU, and also a corresponding driver in the kernel. And so that driver is going to talk to the EPF test driver.

And when we stitch these two things together, what we need is communication between the endpoint test device and then the emulator endpoint controller device. Because these two things need to talk to each other, and these two things can be in the same host as two different guest operating systems, or they can be in a different host. That's implementation defined.

For this solution, we have a proposal, again from Shunsuke. So in his proposal, he implemented in such a way, that it requires two guests on the same host, and on the endpoints, and these two things communicate over Unix domain socket. I mean, this the communication between the endpoint controller and then endpoint test device happens over the Unix domain socket because we need a way to transfer the TLPs from the endpoint to the host to in order to simulate the actual PCI communication. On the endpoint side, so his proposal has the PCI endpoint controller implemented as a common PCI device, and then that requires a new controller device driver on the in the Linux kernel. Unfortunately, I'm not in favor of that because when you, when you are anyway going to emulate the PCI controller, why can't you just go and emulate the existing PCI controller device right, like DesignWare or something like that. Why would you need a new controller device? And then on the host, we his proposal has the QEMU EPF bridge device, and this EPF bridge device is exposing the PCI endpoint device to the guest operating system at the same time it also communicates with the endpoint endpoint side of the guest operating system. So, the the problem is again the problem is that the EPF bridge device is something like, it's not a real device, and it's it kind of acting like a bridge, but in my opinion, this has to be something like what I described in the diagram; it has to be like a real PCI device and under the hood, it has to communicate between the endpoint side and also it has to translate that thing to the existing existing guest operating system, so but again, the patches has been submitted, and there were not much review happened, and then it just went dormant somebody needs to revive this.

and then, third one is interesting because the PCI spec has defined MSI-INTx or MSIX in order to trigger the interrupt from the endpoint side to the host, but it hasn't defined a way to trigger interrupts on the host as our endpoint from the host side, and that's what we are trying to achieve because whenever we want to, the bidirection of communication has to happen, we need to have interrupts firing from both ends, right.

and for achieving this, so for achieving this, we were trying to repurpose the inter controller in the endpoint side. So all the inter controllers have the inter vector address, right? And then, in the inter vector address, you have to write some values to trigger the interrupts. So what we are trying to achieve is that expose that inter vector address to the host side and then let the host know what kind of value it has to write to the data, so that whenever the host wants to trigger a doorbell or interrupt to the endpoint device, it can just write to that address at a specific value, so the interrupt gets triggered on the endpoint device. So this also got submitted by Frank, but this got a feedback from Thomas to you know, design is using IMS. I think IMS got into 6.211, but I haven't really looked into how to achieve this thing using IMS, but this looks like a you know, much needed proposal because currently, with that is no spec defined way to trigger doorbells on the endpoint side, and whenever we want to go with things like what I owe for PC endpoint subsystem, this is pretty much needed otherwise the endpoint is

Well, but no, you can't do that. I mean, you're going to break your host. It can be anything, anything that supports Virtio. And if you have to modify that thing to raise an interrupt for having your endpoint working correctly, you're just breaking the protocol. So you can't do that.

No, it's not breaking the protocol.

It is, because if you read the Virtio spec, putting a doorbell somewhere in some memory doesn't tell you that you have to raise an interrupt with it, not necessarily. It depends on the protocol.

Currently, it's not defined, but you cannot just pull the thing from a real endpoint device. It might work well for a hypervisor environment, but it won't work well for the real endpoint device because it's just an endpoint device, and it can have other processes.

It does work, you just poll like crazy until you get something.

Well, I don't know, I mean.

It works. Yeah, sure, it's not pretty, that's for sure, but it works.

It works, but what I'm saying is that the endpoint device is going to be a general purpose Linux computer, right? And if you wanted to just pull some specific registers of Virtio, and if it gets some high priority thing done, and you are going to miss the notifications, right? And that's what we are seeing.

No, you're not.

Well, that's.

Well, if you're missing notifications, you have a bug. I mean, I've done the NVMe endpoint driver, it's coming, I'm late on that. It's coming, and that's what we're doing. We're just pulling the doorbells for the SQs because that's the only way we have to discover that new commands are coming in because there's no interrupt. And it works just perfectly, we never miss anything. That's. And let's say, the latency was, it's actually pretty fast, much faster than interrupts because we don't.

That's interesting to see. So, let's see, yeah, we can discuss it offline, yeah. Yep, that's it. Great, thank you.

So, thank you very much. Before wrapping it up, I think I want to mention that tomorrow, at 3 o'clock, there is a BOF on PCI authentication and SPDM, if I'm not mistaken. So, I mean, so please attend it if you're interested. And last but not least, I mean, on behalf of all the Microconference organizers, thank you very much for showing up and showing interest for Virtio, IOMMU, and PCI. Thank you very much and see you next year. Thank you.