229


Good afternoon all. Thanks for joining this session. Today I'm going to talk about the  PCIe add-in card manageability, out-of-band manageability. So basically I will be talking  about what are all the interfaces it will be using and then the, like, with some examples  of the CXL related stuff. And basically myself, I'm Apparao. I work at Intel as a firmware development architect  and along with me there are two more people on this paper who is Arun PM and Hema and  Misha. They couldn't be able to make it to this meeting. I'll be presenting on behalf  of everybody.

Okay, so yeah, basically I have a big agenda on these things, so we can see, like. I'll  be touching upon the basics of the OpenBMC, why we need an OpenBMC kind of thing and then  I'll be touching upon what are all the ingredients we needed for the add-in cards manageability,  right. So be it whether it is transport and other information, right. So I'll be touching  upon those things.And then, yeah, we'll be talking about, like, how in the system, right. So we have multiple  components or multiple manageable devices or multiple manageable controllers, different  things will be there in the system. So all of them, how we are going to, like, handle  them and then how it will be, like, manageability is important in that matters, right. And then, yeah, I'll be touching upon some of the things which are all related to the  PLDM and other things and then the CXL, some of the things. And then, yeah, I'll be talking  about the Redfish interface for those add-in cards. So basically how the Redfish you can  model it on those items, okay.

And coming to the, like, overview of the OpenBMC. So I think we have most of the people who  have seen that, yeah, what is OpenBMC, what is BMC, right. It's a baseboard management  controller which is basically used for the servers, managing the servers, be it in the,  like, remotely or be it in the, like, I mean, automatically or, like, monitoring and controlling  some of the items into the server, right. So all of them are--will be done by the BMC. I'm not going to deep dive into those things, but, yeah. And the one of the--another important  aspect in terms of OCP is something like, yeah, why you need a--why we need to have  an open source kind of thing. So there are multiple reasons which is, like, we have something  like the shareable development efforts, we can use it to--using the OCP, and then the--yeah. So it will basically, with OpenBMC, so we'll have something like early adoption of the,  like, components. Basically, if you look at the right topmost--topmost figure, so if you  see that OpenBMC is a--I mean, we are talking about the, like, a disaggregated platform  manageability kind of thing. And so in that, so how we can--different components together,  we'll be, like, able to build a system without having something like a cross dependency on  those things, right. So basically, OpenBMC is the one which will be, like, creating a  framework and so that, yeah, you will have anybody, any vendor who is part of this disaggregated  platform manageability, be it whether it is NICs or be it whether it is other components,  PSUs, or anything else, or the, like, different controller vendors, everybody can contribute  into that particular thing. So you have an infrastructure ready for those things. So  along with that, there are other advantages also there which are listed there, but, yeah. So--and also recently, we moved on to the DCSCM based OpenBMC, which is also talked  about in one of the session. So for Intel--so I'm from Intel, and for Intel, it's basically  the OpenBMC is the infrastructure used for the multiple components here. So you can see  the Intel fabric, Intel telemetry architecture, all these, like, Intel SmartNICs and then  Intel accelerators and CXL cards, Intel 3D Xpoint, all of them. So basically, we will  be, like, using the OpenBMC so that, yeah, we'll be able to reach into the, like, the  industry on the, like, early adoption of the industry on those things, okay? 

So, yeah,  jumping into that, next thing, platform architecture for the add-in cards, majorly. So you will  see something like--like, there are--in the platform, there are multiple components, as  I spoke about that. So basically, with the disaggregated platform management. You will  have multiple controllers which will be connecting it to it, be it whether it is something like  NICs or be it whether it is CXL cards or be it whether it is any other, like, add-in cards  in the PCIe segment, right? So you will see all of them. So those will be, like--some  of them are--will be managed devices and some of them will be, like, managed controllers.  into that and then how we will be, like, communicating and then getting that information via the  IP, all those things, right? So there are multiple internal protocols which will be  used internally. So I'll be talking about these things in the next slide with some--but,  yeah, I mean, basically, we'll be using that IPMI which was, like, previously has been  there for quite some time. So 20 plus years from now. And then, yeah, industry is moving  towards other aspects of, like, MCTP and manageability, PLDM and other things, okay? So--and the other  one is the external interfaces what we have. So it's basically the Redfish is one of the  external interface which will be used for the server manageability across, like, be  it whether it is something like a UI-based thing you wanted to make it or be it you wanted  to use it as a CLI-based approach kind of thing. So you can use the REST APIs to manage  the systems in the--which are all internally--I mean, manage the system, manage the servers  in terms of, like, including the internal components in that system which is, like,  including the add-in cards in that, right?

So, yeah, moving on, so basically, this is  basically for the add-in card manageability. So for example, any of the NICs or anything,  if you take it, right, so basically here you can see that those will be, like, connected  to the, like, BMC via, like--it could be, like, any interface in that, I mean, transport  layer you use it, whether it is SMBus, whether it is PCIe, whether it is I3C or whatever  it is interface. So MCTP is the one of the interface which will be, like--MCTP is the  protocol which is used on top of the transport layer whichever we are using it. We're using  that so you can actually--all the add-in cards, right, so you can control the add-in  cards and discover the add-in cards. Discovery in the sense, like, what kind of add-in card  it is and true information in that, right, so, I mean, field repository unit in that,  right? So basically, what version of the--what version of it and then what are the information  it was showing that. All of them you should be able to get it. These are all will be part  of the PLDM spec, so whatever we see here, control and discovery, protocol and true and  then monitoring and then controlling of the add-in cards. For example, if anything, like,  some of the errors are happening in the add-in card, how automatically you can control those,  like, how automatically you can recover those systems or sometimes, like, you cannot able  to recover those things, so how the RAS will be, like, doing that. For example, like, collecting  the--I mean, information, something like that, right? So all of them, you should be able  to--we should be able to do this here, so using the PLDM protocol. So basically, for  the add-in card manageability, we'll be using the MCTP as well as the PLDM here. 

Okay, so,  yeah, I mean, these are all some of the links I had put it together for the, like, add-in  card manageability and especially what are all the things we'll use it here. As I spoke  about that, MCTP is the base specification, so which is DSP0236 and then on top of that,  you will have a PLDM and other things. So coming to this, right, so basically, in the  figure here, so you can see that the physical layer of the interface transport binding,  it could be anything. It could be like a SMBus or it could be like I3C or the PCIe. On top  of that, so you will have a transport layer, so that is where you will have something like  the MCTP, so basically management component transport protocol, MCTP means. So MCTP has  a base specification which is the DMTF specifications, you can see it here. And then on top of that,  I mean, underneath that, you can see that the different specifications are called out  for the, like, different bindings. So by MCTP binding or SMBus binding or I3C and these  things. On top of that, you will have something like a PLDM, so PLDM, basically, we will be  using it for the, like, add-in card manageability, for the monitoring and the control of the  add-in cards and also the through and firmware, sorry, firmware updates of the add-in cards,  we will be using that, okay. And the other side of it, so using the MCTP, we should be  able to get the CXL type III CCI commands, so all of them, CCI mean, CXL-related cards  also should be able to handle those. 

I have a few more slides on this CXL-related things,  I'm just running for that, okay. So, yeah, I--most of these things, I talked about that,  so basically for the add-in cards, we use, like, PLDM, inventory and telemetry-related  information. So where it will be connected to the BMC via the PLDM and then, yeah, I  mean, there you can--you should be able to monitor the different types of sensors. It  could be a metric-related sensors, it could be, like, state-related sensors or it could  be, like, state effectors and numeric--numeric and state effectors kind of things. So, all  of them, so which will be, like, using the PLDM protocol, we should be able to get that  information and then give that metrics to the, like, the orchestration firmware so that,  yeah. Iit will be, depending on the actions, depending on the configurations, we should  be able to take some actions on that, okay. And then, yeah, I mean, PLDM also supports  the platform eventings, so, I mean, which is for the, like, if there is something happened  asynchronously, you should be able to log and then take some actions depending on the  event, right? So that are all information you can use it. And then the opening VMC,  you have a BMC where this is one of the Redfish interface. Which you--with which you should  be able to externally access those things by the orchestration firmware and then grab  all these add-in card information to the, like, externally.

Similarly, in the PLDM firmware,  there is a PLDM firmware update, so basically, all the add-in cards, whatever is there on  the system, it will be discovered by the, like, BMC using the PLDM and the Fru and other  things. And then on top of that, so once you discover that, what are the version number  and other information you should be able to do. So, it is basically the software inventory  is the one item you should be able to get the complete PLDM, like, complete add-in card  inventory information, be it whether it is hardware inventory, hardware information related  to the model and everything, model serial number and other thing. And similarly, you  should be able to get it into the, like, software inventory. Which is like versions which--which  version of the software is running on those things, right? So, using this particular thing,  you can get the complete PLDM.

And moving on, so the CXL. So basically, CXL is also  similar to the, like, you can see that in the PLDM add-in card manageability. So you  can see that MCTP or PCI is the one which will be used for this CXL manageability. So,  using the CXL manageability, you will have something like different kinds of sensors,  like status sensors, information and status related commands are there. And then, yeah,  there are other, like, the events, there are logs, there are something maintenance, such  kind of things you should be able to get it using this particular CXL--from the CXL devices. And then similar to that other one, you should be--we should be able to expose that over  the Redfish here. And similarly, whenever there is something like a RAS related function,  RAS related operations, right, for example, PPR maintenance or SPPR or HPPR, something  like that, right? So, all those things, we should be able to get that and configure that  using the BMC, Auto Band Management.

So, these are all the CXL component commands, so basically,  information and status commands which is basically used for identifying the what is kind of--what  kind of CXL card it is. And then what are the capabilities of that card and then the different  kinds of--I mean, we can set that depending on the transport layer. What we are using,  the SMbus or PCIe or I3C or depending on that, you should be able to set that some  limits to that particular thing. And then the events is something like, yeah, whenever  there is some--something happens on that particular CXL cards, you should be able to grab that  like and get that events and then log into the like--log it in the BMC. Similarly, firmware  updates is another one, so--which will be supported and then the--I mean, timestamp,  you can get and set the timestamp of that add-in card, CXL, I mean, and then the logs. And features is something like, yeah, what kind of features it is supporting. We know  that there are some new features that are added into the CXL 3.1 which are all related  to the like PBR and port-based routing and then hierarchical-based routing kind of things,  right? So all of them and also the like different--I mean, there is one more thing here. I didn't  add that but it is something like a device management control components, APIs and also  the fabric management is another one which will be a part of this command. So, all of  them, so we should be able to get it to the BMC using the previous--the internal interfaces.

Okay. So, yeah, I mean, and the final one, so final present--final slide on these things. So basically, the Redfish modeling is the one which we can do that. So basically, depending  on the relationship what we have with the CXL card. So it could be like under the charges  you have CXL card or it could be something like the systems you have, I mean, CXL card. So depending on that, so there are PCI devices which is already like mentioned in the--like  already a part of this thing. Under the PCI devices, you will see something like CXL logical  device collections. It could be in the systems or charges. And then on top of that, so you'll  have something like the CXL device information. So under that, yeah, you'll have all the--whatever  the previous slide I'm showing it, something like logs, collected logs or the events or  whatever it is. So all of them, you should be able to map it using this particular tree  matrix, okay. And similarly, CXL card firmware updates is also can be done and then firmware  inventory also you should be able to--we should be able to get it using that particular interface. And similarly, there are--if there are some Add-in Card sensors, you will be able to map  it accordingly. 

And security, so basically, I'm not deep diving into these things. So  SPDM is the one which will be used on top of the MCTP for the security, I mean, Add-in  Card security related things. So we should be able to use this SPDM on top of it. 

So  call to actions. Basically, so it's--yeah, it's a open source standards, whatever we  are using. I talked about multiple standards here, multiple DMTF standards and multiple  CXL standards, multiple DM--sorry, PMCI specifications are--I talked about. So please do contribute  into that to your capacity. It could be in terms of development, it could be in terms  of like giving the feedbacks or it could be like enabling the industry kind of thing,  right? So whatever--so OpenBMC is another one which we mostly work on this. And then  for the disaggregated platforms. OpenBMC contribution is another one which we are--which  I'm calling out for the contributions. And with that, I'll leave it for the questions. I think I am already running one minute. So, is there any questions?

One last question. You mentioned all these events and event commands. Is there a provision  for error injection? 

Error injection, yes. Basically, there  are maintenance commands are there. And the CXL, yes. So under that, we should be able to inject the errors and then, yeah,  we should be able to monitor those particular things. Okay. Nothing. Thank you all. Thanks  for attending.