SNIAVideo

tomoakisato · tomoakisato · commit 7b33d26c2396 · 2025-02-18T13:15:05.000+09:00
diff --git a/385 b/385
@@ -1 +1,48 @@
-(Negotiations are underway with the YouTube channel concerned.)
+
+Hi everybody, my name is Trond Myklebust, I'm the current CTO of Hammerspace, but today I'll probably be speaking more in my capacity as the Linux NFS client maintainer, and what I'd like to talk about is some of the recent changes that have been made to the kernel NFS client, and a little bit also about how we see the roadmap moving forward, and what's driving that roadmap.
+
+So let's start with the first question. What is actually driving NFS in the enterprise today? You know, as you know, NFS has been declared dead so many times at this point, and yet it keeps coming back. In this case, you know, it's mainly been the AI revolution. The fact that you need these huge data centers, you know, filled to the brim with servers, you know, each one of them running, you know, with several cell tens or in some cases, you know, maybe even hundreds of GPUs. And the vast amounts of data that those consume in order to train the AI models. Large language models and generative AI and that sort of thing.
+
+So you probably saw this slide earlier today when David was presenting. But, you know, it's important to note, you know, what kind of requirements um these large language models are, are, imposing on us. You know, it's extremely hard to serve up, you know, data at a fast enough rate, you know, without the help of a parallel file system. You know, typically, you know, this particular customer has, you know, a performance requirement of roughly 12 and a half terabytes per second, you know, with today's models. You know, the next generation is probably going to be an order of magnitude beyond that. There are, you know, typically several tens of petabytes of data that's needed to be feeding these clusters. You know, as I said, you know, the GPUs are in the thousands or tens of thousands. And in order to feed that, you need data servers in the thousands of nodes. So, in order to benefit as much as possible from sort of multiple sources, standardization is very much a requirement. There are plenty of proprietary solutions for parallel file systems. They all tend to have the... requirements that you need special drivers, special this and that in order to run them. By using a standards-based model, you know, you can have clients run out of the box, you know, delivered by your software vendor. You do not need to supply... you know, change your infrastructure in order to adapt to the requirements of the storage that you're using.
+
+And you know, this is very much driven the... you know, a lot of the conversations in the IETF and in the last few years. One of the requirements that we have in the Linux NFS community in general, and Chuck is here to keep me honest about that, is that we like to make sure that everything is standards-based. And there are good reasons, you know, beyond just customers' requirements for that. You know, as a maintainer... I need to understand, you know, the code that I'm developing. I need to understand, you know, whether or not some feature that I find is a bug or if it's something that the protocol requires me to do or if it's something that I can optimize or even sometimes remove. With Linux being largely... documented by code rather than by the standard method of sort of writing up, writing up, you know, writing up papers beforehand, you know, the protocol basically keeps us honest and ensures that... you know, we know what the contents of the kernel are and what we can do with it. So the other thing that standards do, obviously, is they keep vendors honest. You know, it's a community-driven process. Everybody who wants to have a voice, you know, is able to participate. And technically, the IETF doesn't recognize companies, you know. Everybody who's participating is actually participating on their... representing themselves rather than a company interest. And that, too, you know, helps keep... it helps keep things honest, keeps, you know, make sure that not just the Linux community, but also other communities, BSD, et cetera, and even larger vendors can be represented through, you know, through that method.
+
+So let's talk a little bit about, you know, how PNFS and the standards-based file systems. So how do you figure out how PNFS, you know, is basically a way of splitting the architecture of NFS into two separate parts? You've got the metadata part. You've got the data part. And the client, you know, basically talks metadata, talks stateful, you know, talks state with the metadata server, and once it has data to read or write, you know, it talks directly to the... To the NFS server. So this model sort of goes back really into the early 2000s. It was itself a result of, you know, a standardization process involving several vendors. As a result, it's a little... It was initially not well adopted. Due to sort of various things, one was, you know, the lack of a proper market for it. HPC was never really a big thing until, you know, the AI workloads appeared. So the other thing was that... The vendors found themselves competing with other scale-out architectures. As David mentioned earlier today, you know, you had Isilon that turned up and basically killed the enterprise PNFS market, at least for a while. And, you know, made it very simpler to implement because you didn't have, you know, the complicated clients that needed to know, you know, advanced versions of NFS v4. They could continue using NFS v3 and hobble along until... You know, until the bigger data requirements came along in the last few years.
+
+So let's talk a little bit about, you know, what we have done in recent years in NFS v4.2. There's been a constant... There's been a constant optimization of... Of the various operations that the NFS client sends to the servers. Compound operations, you know, which were introduced in NFS v4 sort of back in the early 2000s, never really were able to make use of, you know, the... The optimization... You know, that compounds promised, largely due to the POSIX API, you know, that they are constantly required to follow. However, there... The introduction of new modes of operation... In particular... The ability to use delegations, which those of you who are more familiar with SMB will know as leases. You know, meant that clients were able to cache not just data and attributes, but also state. And that, you know, led to a requirement for more complex uh compounds to deal with uh, you know, the fact that you uh you have to recover these things, you have to be able to um return them efficiently uh and that sort of thing. Um, what's more, you know, we added in um, we were able to add in a few new additions to the protocol itself, NFS v4.2, unlike previous generations of the NFS v4 protocol, allows for extensions beyond the basic NFS v4, v4.2 set of operations that were written down in the RFC back in the days. So this process has been used on several occasions in the last ten years to add more complex operations in order, as we've learned, how better to use the protocol. And what the holes in the protocol were that were preventing us from making full use of the potential. One of the main things, in my view, that was added in fact very recently is the ability to have a delegation that not just allows the client to be the source of truth for the data and the simpler metadata, it also allows it to act as the source of truth for the timestamps. This basically eliminated 80% of the get out of traffic. Because pretty much every stat call that is being put on the wire by clients that are doing I/0 tends to be just in order to ask, you know, what is the what is the mtime? What is the atime for this file so that you can report it to the application. And, you know, as most of us know, in most cases, the application doesn't make use of that information. So the ability to cache it and synchronize at the end of the I/0 is extremely valuable and eliminates most of the attribute traffic within our set up. The other optimization that was made is around file opens and closes. In the original version of NFS v4.2, it was typical that you needed several round trips just in order to create a file, because not only do you need to open the file, but in order to get proper only once semantics, there was an extra step whereby you saved the attributes. You saved a cookie in the attributes, and the client would then use that to determine that yes, it did create this file, and then it would have to overwrite the attributes. These days we're seeing file systems turn up that allow you to store these cookies as extended attributes and other places in the metadata. So, you know, that cuts down your create operations by a factor of 50 percent. And then, you know, the fact that you have these delegations, the attribute delegations in the first place basically means that you can cache open state, you can cache... You can make the close operation asynchronous. And so, really, in order to open a file, you know, write some data, and then close it, you know, the number of synchronous operations basically gets reduced to two. You have the open, you have the write, and then you can cache the close. Yes? Question?
+
+I assume the changes are both in server and client.
+
+Yes.
+
+In your previous slide, you said that you have the open, you have the write, and then you can cache the close. Where is the actual NFS server and what version it should be in the three boxes that you have here, the green box and the last? Where is the NFS server and which... What version it has to be?
+
+So, in order to use attribute delegations, your server has to support NFS v4.2, and it has to implement, you know, the extension that was... that is in the process of being finalized right now in the IETF, you know, it's... at this point, it is, you know, the wire protocol, et cetera, is fully determined, which is why we have taken it into the NFS client. I believe Jeff Layton has been working on a Linux server implementation for it. So, but yes, you need NFS v4.2 in order to use this.
+
+Is that on the metadata server or even on the...
+
+That's on the metadata server. No. Yes, your storage, the storage can be any NFS, or you can, you know, if you're using... There are different versions of PNFS, you know, you can be using Flex files, which uses NFS v3. You can use files, which is NFS v4.1, or, you know, or there are various block-based protocols.
+
+So, in addition to, you know, the access protocol chatter, you know, there's also been a lot of work on other performance enhancements. You know, not too long ago, we introduced the ability... We introduced the ability to set up multiple connections. You know, this basically allows the TCP stack to perform better, typically by allowing, you know, the use of bonded networks and also, you know, accelerations within the network. Within sort of standard networking cards. And there's the... the other set of improvements, you know, tend to be around durability. The ability to write mirrored data, and so, building highly available systems for advanced data nodes. Okay, we can skip.
+
+So, let's talk a little bit about, you know, what the general roadmap, at least as I see it, is for the NFS evolution. You know, being driven by high-performance storage and the AI thing. I mean, we've already seen the parallel NFS, you know, we've seen how Flex files allows you to use... aggregate NFS v3 clients to act as, you know, a single storage with linear scalability. We've seen how, you know, you have the ability to do multiple connections. The attribute delegations, you know, that I mentioned earlier, just going in. And in addition, we just merged some code to allow faster failover in situations where you're using PNFS with Flex files. Basically, by allowing the client to notify the metadata server what happened while the metadata server was rebooting. So, basically, you know, if the client was writing during that period, then the metadata server would either have to scan all the open files that it knew of before the reboot happened in order to see if they changed in any way. Or, you know, you'd have to have a notification mechanism to allow the client to tell it, you know, what did actually change. And by doing the latter, basically, that allows the server to come up much faster and avoid, you know, this unnecessary scanning. Another thing that is being pushed is, you know, for when you have clients and uh NFS servers that are collocated, uh, which is often the case, uh in containerized environments. You know, if you're running in a container and you have another container on the same node acting as an NFS server, then there are optimizations that can be done to avoid the network hop between these two containers. And since they're running on the same hardware, they're running in the same, you know, on top of the same kernel, you know, one of the things that can be done is to basically just allow the client to, you know, within seconds. With certain restrictions, open the file directly on the server. And do IO directly through the kernel without going over the network. You know, this has been seen to give, you know, significant performance improvements in certain areas. And so is one of the things that we're hoping to get in relatively soon. The other thing that is also being driven by these AI workloads is the need for data durability and high availability. Essentially means that, you know, if you want to do, if you want to store the data, you know, in a capacity-efficient way. Rather than just doing pure mirroring. Then you really need something like erasure coding. So one of the things that we are trying to organize within the IETF at this point. Is a drive towards adopting erasure-coding standards for the client. The reason for doing this on the client, you know, is it's the only way to scale. You know, when you have sort of several tens of thousands of clients writing, you know, several tens of terabytes per second. You know, if you were to feed that through a server. That then has to erasure code. And then write the erasure-coded data onto sort of separate servers. You know, you are introducing a bottleneck in the system. And, you know, you are. Lastly, you know, the final sort of thing that's being driven again by these AI workloads is improved security and privacy for your data. NFS has long supported the RPCsec GSS. RPCsec GSS standard for protecting data. However, you know, much of the acceleration work on accelerating privacy has tended to center more around improving the performance of TLS. You know, rather than the older RPCsec. Despite the fact that, you know, that is, that tends to be more prevalent in enterprise storage settings. TLS is obviously being driven more by the fact that you have web servers. And so, you know. We have already standardized TLS for use within RPC and NFS. What remains to be done there is really extend it into parallel file systems. And allow its use within PNFS.
+
+Any questions? Yes?
+
+In the roadmap slide, do you also have information on which supports each of them? And which NFS server is needed for those features? Do you have those details?
+
+So, most of the attribute delegation stuff just went in. So, it will be in 6.11 when... it will be... 6.11 just came out this weekend. So... So, you'll find it there. You know, a lot of the more futuristic things, you know, we'll see when that lands.
+
+The existing ones?
+
+Sorry, can you repeat that?
+
+You're describing client PC as a capability. I think it's being added to Flex files. Or is it the layout?
+
+It's being... No, we're trying... I said we're trying to motivate, you know the... So, there's no draft. There's no... I believe it was presented at the July IETF meeting, but it's literally just a talking point. It's a personal draft at this point by Tom Haynes.
+
+I see. One second question. What is the barrier to already using TLS?
+
+Right now, it's just the protocol itself. There is no way for the server to tell the client that it should be using TLS or for them to negotiate what kind of certificates need to be presented, that sort of thing. So... So... So, it's literally just a question of updating the protocol.
+
+Any other questions? All right. Thank you everyone for your time.