-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path229
30 lines (15 loc) · 16.2 KB
/
229
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Good afternoon all. Thanks for joining this session. Today I'm going to talk about the PCIe add-in card manageability, out-of-band manageability. So basically I will be talking about what are all the interfaces it will be using and then the, like, with some examples of the CXL related stuff. And basically myself, I'm Apparao. I work at Intel as a firmware development architect and along with me there are two more people on this paper who is Arun PM and Hema and Misha. They couldn't be able to make it to this meeting. I'll be presenting on behalf of everybody.
Okay, so yeah, basically I have a big agenda on these things, so we can see, like. I'll be touching upon the basics of the OpenBMC, why we need an OpenBMC kind of thing and then I'll be touching upon what are all the ingredients we needed for the add-in cards manageability, right. So be it whether it is transport and other information, right. So I'll be touching upon those things.And then, yeah, we'll be talking about, like, how in the system, right. So we have multiple components or multiple manageable devices or multiple manageable controllers, different things will be there in the system. So all of them, how we are going to, like, handle them and then how it will be, like, manageability is important in that matters, right. And then, yeah, I'll be touching upon some of the things which are all related to the PLDM and other things and then the CXL, some of the things. And then, yeah, I'll be talking about the Redfish interface for those add-in cards. So basically how the Redfish you can model it on those items, okay.
And coming to the, like, overview of the OpenBMC. So I think we have most of the people who have seen that, yeah, what is OpenBMC, what is BMC, right. It's a baseboard management controller which is basically used for the servers, managing the servers, be it in the, like, remotely or be it in the, like, I mean, automatically or, like, monitoring and controlling some of the items into the server, right. So all of them are--will be done by the BMC. I'm not going to deep dive into those things, but, yeah. And the one of the--another important aspect in terms of OCP is something like, yeah, why you need a--why we need to have an open source kind of thing. So there are multiple reasons which is, like, we have something like the shareable development efforts, we can use it to--using the OCP, and then the--yeah. So it will basically, with OpenBMC, so we'll have something like early adoption of the, like, components. Basically, if you look at the right topmost--topmost figure, so if you see that OpenBMC is a--I mean, we are talking about the, like, a disaggregated platform manageability kind of thing. And so in that, so how we can--different components together, we'll be, like, able to build a system without having something like a cross dependency on those things, right. So basically, OpenBMC is the one which will be, like, creating a framework and so that, yeah, you will have anybody, any vendor who is part of this disaggregated platform manageability, be it whether it is NICs or be it whether it is other components, PSUs, or anything else, or the, like, different controller vendors, everybody can contribute into that particular thing. So you have an infrastructure ready for those things. So along with that, there are other advantages also there which are listed there, but, yeah. So--and also recently, we moved on to the DCSCM based OpenBMC, which is also talked about in one of the session. So for Intel--so I'm from Intel, and for Intel, it's basically the OpenBMC is the infrastructure used for the multiple components here. So you can see the Intel fabric, Intel telemetry architecture, all these, like, Intel SmartNICs and then Intel accelerators and CXL cards, Intel 3D Xpoint, all of them. So basically, we will be, like, using the OpenBMC so that, yeah, we'll be able to reach into the, like, the industry on the, like, early adoption of the industry on those things, okay?
So, yeah, jumping into that, next thing, platform architecture for the add-in cards, majorly. So you will see something like--like, there are--in the platform, there are multiple components, as I spoke about that. So basically, with the disaggregated platform management. You will have multiple controllers which will be connecting it to it, be it whether it is something like NICs or be it whether it is CXL cards or be it whether it is any other, like, add-in cards in the PCIe segment, right? So you will see all of them. So those will be, like--some of them are--will be managed devices and some of them will be, like, managed controllers. into that and then how we will be, like, communicating and then getting that information via the IP, all those things, right? So there are multiple internal protocols which will be used internally. So I'll be talking about these things in the next slide with some--but, yeah, I mean, basically, we'll be using that IPMI which was, like, previously has been there for quite some time. So 20 plus years from now. And then, yeah, industry is moving towards other aspects of, like, MCTP and manageability, PLDM and other things, okay? So--and the other one is the external interfaces what we have. So it's basically the Redfish is one of the external interface which will be used for the server manageability across, like, be it whether it is something like a UI-based thing you wanted to make it or be it you wanted to use it as a CLI-based approach kind of thing. So you can use the REST APIs to manage the systems in the--which are all internally--I mean, manage the system, manage the servers in terms of, like, including the internal components in that system which is, like, including the add-in cards in that, right?
So, yeah, moving on, so basically, this is basically for the add-in card manageability. So for example, any of the NICs or anything, if you take it, right, so basically here you can see that those will be, like, connected to the, like, BMC via, like--it could be, like, any interface in that, I mean, transport layer you use it, whether it is SMBus, whether it is PCIe, whether it is I3C or whatever it is interface. So MCTP is the one of the interface which will be, like--MCTP is the protocol which is used on top of the transport layer whichever we are using it. We're using that so you can actually--all the add-in cards, right, so you can control the add-in cards and discover the add-in cards. Discovery in the sense, like, what kind of add-in card it is and true information in that, right, so, I mean, field repository unit in that, right? So basically, what version of the--what version of it and then what are the information it was showing that. All of them you should be able to get it. These are all will be part of the PLDM spec, so whatever we see here, control and discovery, protocol and true and then monitoring and then controlling of the add-in cards. For example, if anything, like, some of the errors are happening in the add-in card, how automatically you can control those, like, how automatically you can recover those systems or sometimes, like, you cannot able to recover those things, so how the RAS will be, like, doing that. For example, like, collecting the--I mean, information, something like that, right? So all of them, you should be able to--we should be able to do this here, so using the PLDM protocol. So basically, for the add-in card manageability, we'll be using the MCTP as well as the PLDM here.
Okay, so, yeah, I mean, these are all some of the links I had put it together for the, like, add-in card manageability and especially what are all the things we'll use it here. As I spoke about that, MCTP is the base specification, so which is DSP0236 and then on top of that, you will have a PLDM and other things. So coming to this, right, so basically, in the figure here, so you can see that the physical layer of the interface transport binding, it could be anything. It could be like a SMBus or it could be like I3C or the PCIe. On top of that, so you will have a transport layer, so that is where you will have something like the MCTP, so basically management component transport protocol, MCTP means. So MCTP has a base specification which is the DMTF specifications, you can see it here. And then on top of that, I mean, underneath that, you can see that the different specifications are called out for the, like, different bindings. So by MCTP binding or SMBus binding or I3C and these things. On top of that, you will have something like a PLDM, so PLDM, basically, we will be using it for the, like, add-in card manageability, for the monitoring and the control of the add-in cards and also the through and firmware, sorry, firmware updates of the add-in cards, we will be using that, okay. And the other side of it, so using the MCTP, we should be able to get the CXL type III CCI commands, so all of them, CCI mean, CXL-related cards also should be able to handle those.
I have a few more slides on this CXL-related things, I'm just running for that, okay. So, yeah, I--most of these things, I talked about that, so basically for the add-in cards, we use, like, PLDM, inventory and telemetry-related information. So where it will be connected to the BMC via the PLDM and then, yeah, I mean, there you can--you should be able to monitor the different types of sensors. It could be a metric-related sensors, it could be, like, state-related sensors or it could be, like, state effectors and numeric--numeric and state effectors kind of things. So, all of them, so which will be, like, using the PLDM protocol, we should be able to get that information and then give that metrics to the, like, the orchestration firmware so that, yeah. Iit will be, depending on the actions, depending on the configurations, we should be able to take some actions on that, okay. And then, yeah, I mean, PLDM also supports the platform eventings, so, I mean, which is for the, like, if there is something happened asynchronously, you should be able to log and then take some actions depending on the event, right? So that are all information you can use it. And then the opening VMC, you have a BMC where this is one of the Redfish interface. Which you--with which you should be able to externally access those things by the orchestration firmware and then grab all these add-in card information to the, like, externally.
Similarly, in the PLDM firmware, there is a PLDM firmware update, so basically, all the add-in cards, whatever is there on the system, it will be discovered by the, like, BMC using the PLDM and the Fru and other things. And then on top of that, so once you discover that, what are the version number and other information you should be able to do. So, it is basically the software inventory is the one item you should be able to get the complete PLDM, like, complete add-in card inventory information, be it whether it is hardware inventory, hardware information related to the model and everything, model serial number and other thing. And similarly, you should be able to get it into the, like, software inventory. Which is like versions which--which version of the software is running on those things, right? So, using this particular thing, you can get the complete PLDM.
And moving on, so the CXL. So basically, CXL is also similar to the, like, you can see that in the PLDM add-in card manageability. So you can see that MCTP or PCI is the one which will be used for this CXL manageability. So, using the CXL manageability, you will have something like different kinds of sensors, like status sensors, information and status related commands are there. And then, yeah, there are other, like, the events, there are logs, there are something maintenance, such kind of things you should be able to get it using this particular CXL--from the CXL devices. And then similar to that other one, you should be--we should be able to expose that over the Redfish here. And similarly, whenever there is something like a RAS related function, RAS related operations, right, for example, PPR maintenance or SPPR or HPPR, something like that, right? So, all those things, we should be able to get that and configure that using the BMC, Auto Band Management.
So, these are all the CXL component commands, so basically, information and status commands which is basically used for identifying the what is kind of--what kind of CXL card it is. And then what are the capabilities of that card and then the different kinds of--I mean, we can set that depending on the transport layer. What we are using, the SMbus or PCIe or I3C or depending on that, you should be able to set that some limits to that particular thing. And then the events is something like, yeah, whenever there is some--something happens on that particular CXL cards, you should be able to grab that like and get that events and then log into the like--log it in the BMC. Similarly, firmware updates is another one, so--which will be supported and then the--I mean, timestamp, you can get and set the timestamp of that add-in card, CXL, I mean, and then the logs. And features is something like, yeah, what kind of features it is supporting. We know that there are some new features that are added into the CXL 3.1 which are all related to the like PBR and port-based routing and then hierarchical-based routing kind of things, right? So all of them and also the like different--I mean, there is one more thing here. I didn't add that but it is something like a device management control components, APIs and also the fabric management is another one which will be a part of this command. So, all of them, so we should be able to get it to the BMC using the previous--the internal interfaces.
Okay. So, yeah, I mean, and the final one, so final present--final slide on these things. So basically, the Redfish modeling is the one which we can do that. So basically, depending on the relationship what we have with the CXL card. So it could be like under the charges you have CXL card or it could be something like the systems you have, I mean, CXL card. So depending on that, so there are PCI devices which is already like mentioned in the--like already a part of this thing. Under the PCI devices, you will see something like CXL logical device collections. It could be in the systems or charges. And then on top of that, so you'll have something like the CXL device information. So under that, yeah, you'll have all the--whatever the previous slide I'm showing it, something like logs, collected logs or the events or whatever it is. So all of them, you should be able to map it using this particular tree matrix, okay. And similarly, CXL card firmware updates is also can be done and then firmware inventory also you should be able to--we should be able to get it using that particular interface. And similarly, there are--if there are some Add-in Card sensors, you will be able to map it accordingly.
And security, so basically, I'm not deep diving into these things. So SPDM is the one which will be used on top of the MCTP for the security, I mean, Add-in Card security related things. So we should be able to use this SPDM on top of it.
So call to actions. Basically, so it's--yeah, it's a open source standards, whatever we are using. I talked about multiple standards here, multiple DMTF standards and multiple CXL standards, multiple DM--sorry, PMCI specifications are--I talked about. So please do contribute into that to your capacity. It could be in terms of development, it could be in terms of like giving the feedbacks or it could be like enabling the industry kind of thing, right? So whatever--so OpenBMC is another one which we mostly work on this. And then for the disaggregated platforms. OpenBMC contribution is another one which we are--which I'm calling out for the contributions. And with that, I'll leave it for the questions. I think I am already running one minute. So, is there any questions?
One last question. You mentioned all these events and event commands. Is there a provision for error injection?
Error injection, yes. Basically, there are maintenance commands are there. And the CXL, yes. So under that, we should be able to inject the errors and then, yeah, we should be able to monitor those particular things. Okay. Nothing. Thank you all. Thanks for attending.