-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path391
72 lines (36 loc) · 24.3 KB
/
391
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
I’ve been doing management standards for about 20 years. And before that, I did RDMA and InfiniBand and VIA before that, and got tired of doing high-speed networking, but now I’m back in helping that out again, and kind of do this security standard called SPDM as a side job. So, I get to have a lot of fun here in the industry. Just a quick disclaimer—did it not go?
There we go. Everything is subject to change, and you know, that’s just kind of normal.
So, quick background on DMTF. I’ll roll through those pretty quick. Most of those are here for anyone who’s not going to watch this on YouTube or recording and is going to pull down the slides. And then we’ll get into the local storage model real briefly because Rochelle touched on that, but where that comes in is into the fabrics model, okay? So, any of that can fit into the fabrics model. I’ll do a kind of a light dive on that; you’ll hopefully it won’t—those bubble diagrams won’t be too much—it can be a lot to take in all at once. Some changes that we’re coming across from the log model for new technologies and new functionality that we’re trying to do, changes coming along for the storage and metrics, and how you gather them. And the gentleman who asked the question about Prometheus, the changes coming will actually impact any of the telemetry streaming services, pretty quick here.
Just a quick background on DMTF—you know—we’ve been around for a very long time—used to be known—those letters used to be known for something—and they changed several times—and now we’re just known as DMTF. And basically, what we work on is manageability and some inside-the-box ish, and you know, that box can be kind of anything because now the rack or the data center is kind of becoming the box. But when it comes to security and some of the interoperability, especially around manageability, that’s kind of what we specialize in in DMTF. Lately, we’ve changed our open source, open standards business model. The last DMTF board face-to-face, we just said, ‘You know, we used to limit ourselves to what open source we could work on.’ We did this thing called libspdm, and there are 2,000 downloads of it a month with somewhere around another 10,000 refreshes a month, and we figured out, Well, we’re really doing production-level code, why don’t we just admit it?’ So, we have the ability to do production-level code of any of the standards as well.
Quick view of the board—companies really appreciate everything they do for us.
As well as our alliance partners, we’ve got a pretty rich alliance partner program centered around a couple of different areas—Redfish, Swordfish being one of them, of course—SPDM and the inside-the-box stuff—MCTP and PLDM.
So, we do a lot more than Redfish, although Redfish is why you’re here today. SMBIOS is everywhere; it’s in every machine shipped in the last I don’t know 15 years or so, so it’s billions of machines out there. PMCI is that standard inside the box that you’ve never heard of—if you wondered where how that fan information gets to something that’s monitoring it, or you know, all that sensor data has got to come somewhere. Redfish has got to get that information out of the box; that’s where that comes from. SPDM is extending the root of trust beyond just the BMC or whatever your silicon root of trust is, and that’s how we manage that. There’ll be a session tomorrow afternoon on SPDM and how it relates to storage and some of the new developments going on in storage. And then CIM has been around forever—Common Information Model—if you remember SMI before that. We still maintain CIM and do about one release a year on that.
So, this is the Redfish model for storage; you’ve seen that, I’m not going to spend any time on that. What I really want you to get out of this is the sandbox we created, and that was one of the successes of Redfish. If you look at those higher-level collections, ‘systems’ and ‘chassis,’ we basically made anybody able to create their own sandbox, and so, you know, if you want to do fabrics, ‘fabrics’ is a bolt-on to that. If you want to do storage at a root-level thing, we can give you a thing for that. When it comes to some of the sustainability, or cooling units, or whatever, we’ve kind of had the ability to give everybody their own sandbox and leverage some of the common services—accounting, logs, auditing, all that kind of stuff—eventing. All the common-service stuff is the stuff people don’t want to try and solve when they create a new manageability platform. Just ‘Let me pay attention to my widgets, don’t make me go off and do everything else,’ and ‘I certainly don’t want to.’ We have to revalidate everything, and so, if you can find those implementations out there in the industry, and just leverage them, and so we’ve had a huge amount of success because of that. Like OPAF—these are the Open Process Automation Foundation—these are people that do fat pipes, meters, they’re redoing their factories on flow because they want to buy industry-standard components too, and so they’re leveraging Redfish, and same with PICMG where they’re doing data center modeling of inside your factory floor. ‘I want my actuator to go 50%,’ so they don’t want to reinvent everything from scratch, either. And that was kind of where Swordfish, they were the first ones that really looked at what we had and said, ‘Gee, we’re buying a computing platform, all we want to do is change what us Redfish people would be 10% of course, to them, it looks like 90%, but really, it was just the strength of that initial data model that they were able to leverage.’
So, get ready for a couple of—I don’t know if you care for these bubble diagrams, but they’re—we use them a lot—so the common fabrics model has been something we’ve really been working on extending for CXL especially, we’re working right now on CXL 3.1 because we’ve already got the 2.0 and the 1.0 published. When we initially designed the fabrics model, we thought, you know, ‘what? all of these fabrics are kind of the same. I’ve got a fabric, you know, and I’ve got switches in it, and those switches talk to endpoints, and what endpoints I’m allowed to talk to, we call those zones.’ Don’t think Fiber Channel zones, think subnet kind of thing. ‘What am I allowed to talk to?’ And then individual connections are based, you know, so we tried to make that generally accessible and applicable, and we’ve been able to map it successfully to SAS and PCIe and CXL, some of the new Ethernet-based technology, some of the old Ethernet-based technologies. The telco guys took a swag at it to try and map it to what they do. We think it can be mapped to InfiniBand and Fiber Channel—I don’t know if anybody’s actually tried—but anything that has these basic fundamental concepts of things on endpoints with switches trying to talk to each other with kind of the same kind of telemetry model-ish kind of stuff, it’s all broadly applicable, and that’s why we have the three-way effort going on with the Open Fabrics Alliance, SNIA, and DMTF. If you’re going to Supercomputing, you’ll see demos around that. If you haven’t taken a look at what the Open Fabrics Alliance is doing with Swordfish, go ahead and feel free to take a look at that. That’s completely open, or as an organization, you can go out there and dig through what they’re doing because they’re trying to create the user universal fabric management framework. The thought was, if you get—and I know some Swordfish guys are talking, so I’ll shut up. I forgot about that.
So anyway, this is kind of the general fabric model. So if you want to map it to a Type 3 local device in CXL, this is what that looks like. You’ve got your, you know, CXL leverages the PCIe model, so you’ve got your PCIe devices with the PCIe functions, and those map to the logical functions in CXL because CXL is going over a PCIe lane somewhere over on the far right-hand side. That memory domain is basically your memory controller. We couldn’t call it ‘memory controller’ because we wanted it to have a bigger scope, but really, that’s what it is. I’ve got a bunch of DIMMs in a memory controller, and I can chunk them up any way I want, and we call those ‘memory chunks,’ and that’s a nice name because it didn’t have a lot of overloaded use already. So, trying to invent a little bit of a new quote term. In a local processor, you can think of that as an interleave set in a CXL viewpoint. That’s a logical memory region, logical zone, and that’s going to map because you’re going to map it into your address space locally. That’s what the memory domain and memory chunks, and ‘memory’ over here is your remote memory chunk becomes a local memory, don’t it looks like a DIMM to the local processor, it can’t tell. And so we can actually create its own interleave sets or chunks or blocks or whatever out of those CXL memory regions. So, that was the thought behind all of this.
And how we kind of mapped it, it gets a little bit more complex when you throw networking in the middle, and you’re starting to do fabric-attached memory. If you look on the right, all of that stuff is really the same. I’ve still got my memory chunk, I’ve still got my memory, I’ve still got my memory domain, and over here on the left, I still have my memory domain, my memory chunk, but my processor now has ports, and on the other side, my memory has a memory adapter, fabric adapter. And then you’ve got all that networking stuff in the middle. You’ll recognize that from the first slide: zones, switches, ports, endpoints, and connections. And what this allows you to do is a couple of things: you can use Redfish to control your CXL Fabric Manager. Your local BMC on this side can pick everything up that it sees, and it’s going to report just what it sees over here on the left. It’s not going to necessarily be able to crawl out with the fabric—that’s the fabric manager’s job. If you’ve got a fabric-attached memory controller over on the right, that’s going to be that part of the elephant that you see is the memory domains and the fabric adapter. Only the fabric manager is going to really have universal access to everything. And even then, it doesn’t have access to the memory address space, right? It doesn’t have the page tables and page entries—page table entries. So you kind of have to have all parts of the elephant to figure out exactly what maps to what. And it doesn’t—the CXL framework—do you necessarily care about the left side and how it mapped it from a fabric manager’s viewpoint? Maybe not. But something has to be able to see a global view of it, and we’re hoping that’s where Swordfish comes in as well. So it’ll be the fabric manager, but it’ll also be an aggregator of all the Redfish resources. So that’s the CXL memory model. I’m doing pretty good on time.
If you want more educational resources, there’s a whole lot more on this, right? Because if I go back, those memory regions and memory chunks and everything over on the right-hand model with a Type 2 device, nobody’s building them, but with PCIe unordered I/O, I can actually stick a network adapter or a storage adapter out on the right-hand of that CXL interface, and all of this works whether you’re doing PCIe switch or a CXL switch because CXL Type 1 device will go negotiate back to unordered I/O if you support that. So this model should work for PCIe devices as well once they support—if your fabric manager supports CXL. So we have a couple of educational resources. You can go out to the Redfish Developer Hub, ‘redfish.dmtf.org,’ and find all this stuff there. There’s the data model spec that’s programmatically generated from the CSDL files that define this Redfish schema, and the cool thing is the profiles that Rochelle discussed. The same document generator can take a Redfish profile, use that, and go through the schema and do a line-item veto to create a document out of your profile that you can then hand to a vendor and say, ‘This is what I want.’ And so all of that’s done with the document generator. We’ve got a mapping guide that we call the Rosetta Stone, and it’s something that we’re going to try with everything from now on. There are some other fabrics out there, and you can expect people to work on Redfish mappings for those, and so we’re going to take that Rosetta Stone concept and apply that to everything from now on. So what that involves is people with feet in both standards bodies to sit down and painfully crawl through the spec line-by-line, table-by-table, and row-by-row and say, ‘How does this map to Redfish?’ or ‘How does this map to Swordfish?’ or ‘How does this map to whatever?’ Because if you don’t do that, everybody is going to grab data from different places and put them in different parts of the model, and everybody is going to be different, and that doesn’t do anybody any good. So we’ve got the Rosetta Stone out there for CXL, and that’s 288 there’s a bunch of Redfish schools out there. If you haven’t gone through and subscribed to the YouTube channel, there are three of them out there on fabrics and configurations and routing. There’s a support for Compute Express Link specifically version of that, so there’s a bunch of stuff out there, and then there’s the fabrics white paper as well.
Changes to log service that are coming along.
So, we’re doing three real big things to log service. Number one is, if you’ve ever crawled through a log, we’ve got ‘auto-clear resolved entries’ now. So, you can apply this, and every time there’s an assert for those of you around for IPMI, there’s asserts and de-asserts, right? ‘My fan went over critical, my fan went under critical.’ ‘That’s no longer a problem. Would you auto-clear all those things please?’ Maybe I do care, and I want to see those, and that’s fine, but if you’re the kind of person that ‘Only tell me the criticals, don’t tell me the warnings. Only tell me the conditions I want to see right now. What’s the problems in my data center?’ You can auto-clear resolved entries, and that’s coming from the cloud providers that are really downloading huge amounts of log files. They don’t need all the extra bandwidth because they don’t care that it went critical; they just care about what’s going on now. ‘Do I shoot this thing in the head or not?’ So, we added it’s a CXL log entry type originally when we did log entry. We tried to get everybody to converge. We also knew that really wasn’t going to happen, so we created this generic log entry, and so you can find we’ve got ‘CPER objects’ now for notification type and selection type. We’ve got CXL log records because if we were to take a CXL log entry and break it down and put it into Redfish, we would create a bunch of CXL-specific properties. It’s like, ‘Well why bother doing that? Why don’t you just take the raw thing, stick it in there?’ And somebody who knows how to decrypt a CXL log entry can go ahead and pull that apart. And then we added diagnostic data. So there’s a way to create a log entry when your device does a diagnostic dump. You can then pull that through a Redfish log, and that was designed. There’s a new feature in PMCI called ‘FileIO,’ and so you can do from a device trigger a data dump in your live system, pull it through your BMC, and retrieve it through a log entry, and that’s a log entry of diagnostic data now. That can be old diagnostic data or live diagnostic data depending on what your device supports.
Metrics abound in Redfish; they’re all over the place. So, if you’re doing storage, obviously drive metrics, storage controller metrics matter to you, but port metrics are there as well, through the whole port model. So, you can see what’s actually going in and out. If you’re connected to a network device function, there’s metrics on that; there’s memory metrics for what’s going on in CXL as well. And we’re looking at redoing the telemetry service. What we found out is the old-style telemetry service doesn’t match Prometheus and things like that when you’re looking for a whole lot of data. And some of these people are looking at filling up a 100-gig in constant feed of telemetry data coming out of their system on the management network. We figured out a way to shorten that, and there’ll be a presentation—there’s a presentation on that at OCP on day one, one of the very first discussions. Sorry, day one, which is Tuesday. Once that one gets released, you’ll see that out so that’ll be able to be viewed in the public in about a month.
Environmental metrics, we basically redid the environmental metrics model. If you remember the old ‘fan’ and ‘cooling,’ they were all arrays that really didn’t work. We created this concept called ‘excerpt,’ and so everything now is a sensor, and you can still get ‘here, use-case telemetry. What’s my power situation? What’s my cooling situation?’ but instead, you’re getting it as an excerpt with a series of collections. It’s a much faster interface and much more easy to digest.
So, we’ve got an ‘environmental metrics’ resource, and that’s kind of a summary. And so you can still go out there and get the ‘fans’ collection and go to individual fans if you’ve got one that’s problematic. There’s always a way to drill down into the model based on whatever kind of information you’re getting, whether it’s a power supply that’s acting up. But we’ve got these general ‘environmental metrics’ resource where you can look at everything that’s going on in the system and take a look at it and kind of get a broad dashboard kind of thing and get it all in one I/O instead of a bunch of little I/Os. Again, you still have the ability to drill down, but it’s basically a single schema definition, so you’re going to write the code once, and then we leverage it everywhere in processors and memory and drive and adapters and controllers and heat exchangers and cooling units and CDUs and all of that. So, we try to leverage the model as much as we can.
We’ve got a work-in-progress out there for data center infrastructure management. If you’re into the power and CDUs and that kind of cooling units, you can go out on the DMTF website, out to the Redfish interface, and pull down the work-in-progress. That’s all there is that’s on DCIM. We’ve got a task force spun up on it. These are the people from companies that you know produce these kinds of equipment that are in there helping drive this data model, so you can take a look at that.
So, in summary, Redfish, along with the rest of the working groups, we’re really trying to complete interoperability down to the almost molecular level inside the infrastructure and get that out of the platform and out of the data center in a form that you can digest in the modern toolchain. And we’re also solving all the security attestation and root-of-trust kind of issues. Come see that session tomorrow that enables a zero-trust platform.
As always, if you need any information, we’ve got the Redfish Developer Hub out there. We’ve got mockups and toolsets and toolchains you can play with. When I develop profiles, I go to a live system; I use the profile gen or the I forget the name of the download tool to download a live system, and then I’ll go in there and edit it. And I’ll use the mockup server to generate a mockup on my local machine based on real data, and then start running a profile against it with a profile interop validator, and then use the document generator to generate out of it. So, there’s a plethora of tools out there to help you assist, whether you’re a client or someone building these kinds of infrastructures out there with Redfish. That’s all I have. Are there any questions?
Can you talk a little bit more about what’s going on with Redfish?
The Redfish session at OCP?
So, DMTF, there’s a couple of folks; if y’all are aware of the afternoon that first Tuesday afternoon when OCP, all the keynotes kick off on the morning, and then you’ve got a bunch of—there’s no other sessions going on, but there’s three sessions this year, and one of those is on manageability strictly manageability, hosted by the DMTF forum. What we’ve done is there were a lot of papers that didn’t make it into OCP. The manageability track got shrunk down to like eight 20-minute sessions, and so what we did was we bought an afternoon, and we’re contacting those authors that didn’t get accepted to do presentations because most of OCP, those sessions if they’re short, they’re turning into workstream updates, so all you’re doing is finding out what the workstreams are doing; you’re not finding out, you’re finding out what the industry is doing, or ‘Hey, I did this cool thing around this,’ or ‘I’d like to change that,’ and so we said, ‘We’d like to hold a session where we can tell you what’s going on and what things are coming out, but also stories from people that are actually implementing it, like an OpenBMC and things like that.’ So, we thought we’d invite them in and have an afternoon conversation. So, I think SONiC is doing it parallel, and then CXL is doing their—well, it isn’t actually CXL, it’s one of the vendors supporting CXL that’s bringing in a bunch of people to brand, but we thought we’d do one as well. So, it’s the same time as the expo is going on, so if you’re looking for something to do, and manageability is near and dear to your heart, like I said, the telemetry update will be one of the first sessions right after lunch, and we’ve got some PQC; you can find the event calendar; it’s already uploaded on the OCP website as well as the DMTF website. So, if you’re going to OCP, go ahead and do that. If you’re going to OCP Global Summit, you can pick up those sessions, and then all of that will become public later, so they record everything.
Can I just make one editorial comment about the metrics you presented, circling back to my presentation, and then I can go to this question? Those first two items on there were the joint metrics that I was talking about earlier in my presentation. Those were Swordfish, SNIA-maintained—so many metrics! What are the metrics that you’ve been adding to that?
Well, out of fairness, if we’re going back to that, I mean, this whole connections part of the model that was brought down here on the left was strictly invented because of NVMe, and we didn’t do ‘endpoint groups’ either in Redfish.
Endpoint groups?
Yeah, because you don’t need it inside of a single server; it’s just not a thing.
Zones of zones for storage?
Yeah, well, ‘zones of zones’ was for storage and for telco.
Yeah, all right, yeah, go ahead.
Just a question: is there a protocol or a mechanism to add extensions to the metrics? Just to give you the context, I mean, I’m from and the work of DNA data storage. These are the variances; they’re based on DNA data storage, so we have a need for monitoring for metrics that are slightly different from what is using monitors, but we would like to have expansion or extension from Redfish.
Sure, so if you’re looking for something OEM-specific, there’s an OEM object inside of every resource in Redfish, and you can start looking at OEMs; you can start looking at OEM Redfish objects, and you can do your own there, and a lot of people do. And none of that is interoperable, but at least it’s called out, ‘Hey, this is OEM, this name, this vendor.’ We’ve had some people use that for standards body and throw that one in there, and we’re like, ‘No, no, no, no, don’t do that.’ You see that red box down there? We’ve got a user forum that you can get through through the website, submit a question, say, ‘Hey, these are the things I’m looking at adding,’ because if you want it there, there’s a good chance somebody else out there wants it too.
Yeah, that would be part of the DNA Data Storage Alliance under SNIA, but the… so, that you recreate what we understood, we could like a pioneer of this, we could implement the OEM extension initially.
yes.
And I hope I get that through the DNA.
Yep, and my company used to do that, and I’m out of time. My company used to do that, and now what we figured is, it’s just not worth it. Take it to standard because what you end up with is technical debt.