Skip to content

Commit 4075b10

Browse files
sean-jcyamahata
authored andcommitted
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based memory that is tied to a specific KVM virtual machine and whose primary purpose is to serve guest memory. A guest-first memory subsystem allows for optimizations and enhancements that are kludgy or outright infeasible to implement/support in a generic memory subsystem. With guest_memfd, guest protections and mapping sizes are fully decoupled from host userspace mappings. E.g. KVM currently doesn't support mapping memory as writable in the guest without it also being writable in host userspace, as KVM's ABI uses VMA protections to define the allow guest protection. Userspace can fudge this by establishing two mappings, a writable mapping for the guest and readable one for itself, but that’s suboptimal on multiple fronts. Similarly, KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size, e.g. KVM doesn’t support creating a 1GiB guest mapping unless userspace also has a 1GiB guest mapping. Decoupling the mappings sizes would allow userspace to precisely map only what is needed without impacting guest performance, e.g. to harden against unintentional accesses to guest memory. Decoupling guest and userspace mappings may also allow for a cleaner alternative to high-granularity mappings for HugeTLB, which has reached a bit of an impasse and is unlikely to ever be merged. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to mmap() guest memory). More immediately, being able to map memory into KVM guests without mapping said memory into the host is critical for Confidential VMs (CoCo VMs), the initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent untrusted software from reading guest private data by encrypting guest memory with a key that isn't usable by the untrusted host, projects such as Protected KVM (pKVM) provide confidentiality and integrity *without* relying on memory encryption. And with SEV-SNP and TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as being mappable only by KVM (or a similarly enlightened kernel subsystem). That approach was abandoned largely due to it needing to play games with PROT_NONE to prevent userspace from accessing guest memory. Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping guest private memory into userspace, but that approach failed to meet several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel wouldn't easily be able to enforce a 1:1 page:guest association, let alone a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory that isn't backed by 'struct page', e.g. if devices gain support for exposing encrypted memory regions to guests. Attempt #3 was to extend the memfd() syscall and wrap shmem to provide dedicated file-based guest memory. That approach made it as far as v10 before feedback from Hugh Dickins and Christian Brauner (and others) led to it demise. Hugh's objection was that piggybacking shmem made no sense for KVM's use case as KVM didn't actually *want* the features provided by shmem. I.e. KVM was using memfd() and shmem to avoid having to manage memory directly, not because memfd() and shmem were the optimal solution, e.g. things like read/write/mmap in shmem were dead weight. Christian pointed out flaws with implementing a partial overlay (wrapping only _some_ of shmem), e.g. poking at inode_operations or super_operations would show shmem stuff, but address_space_operations and file_operations would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM stop being lazy and create a proper API. Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey Link: https://lore.kernel.org/linux-mm/[email protected] Cc: Fuad Tabba <[email protected]> Cc: Vishal Annapurve <[email protected]> Cc: Ackerley Tng <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Maciej Szmigiero <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Quentin Perret <[email protected]> Cc: Michael Roth <[email protected]> Cc: Wang <[email protected]> Cc: Liam Merwick <[email protected]> Cc: Isaku Yamahata <[email protected]> Co-developed-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Co-developed-by: Yu Zhang <[email protected]> Signed-off-by: Yu Zhang <[email protected]> Co-developed-by: Chao Peng <[email protected]> Signed-off-by: Chao Peng <[email protected]> Co-developed-by: Ackerley Tng <[email protected]> Signed-off-by: Ackerley Tng <[email protected]> Co-developed-by: Isaku Yamahata <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Co-developed-by: Paolo Bonzini <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Co-developed-by: Michael Roth <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent 70c537b commit 4075b10

File tree

8 files changed

+764
-15
lines changed

8 files changed

+764
-15
lines changed

Documentation/virt/kvm/api.rst

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6201,6 +6201,15 @@ superset of the features supported by the system.
62016201
:Parameters: struct kvm_userspace_memory_region2 (in)
62026202
:Returns: 0 on success, -1 on error
62036203

6204+
KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
6205+
allows mapping guest_memfd memory into a guest. All fields shared with
6206+
KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_PRIVATE in
6207+
flags to have KVM bind the memory region to a given guest_memfd range of
6208+
[guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd
6209+
must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
6210+
the target range must not be bound to any other memory region. All standard
6211+
bounds checks apply (use common sense).
6212+
62046213
::
62056214

62066215
struct kvm_userspace_memory_region2 {
@@ -6209,9 +6218,24 @@ superset of the features supported by the system.
62096218
__u64 guest_phys_addr;
62106219
__u64 memory_size; /* bytes */
62116220
__u64 userspace_addr; /* start of the userspace allocated memory */
6221+
__u64 guest_memfd_offset;
6222+
__u32 guest_memfd;
6223+
__u32 pad1;
6224+
__u64 pad2[14];
62126225
};
62136226

6214-
See KVM_SET_USER_MEMORY_REGION.
6227+
A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memory) and
6228+
userspace_addr (shared memory). However, "valid" for userspace_addr simply
6229+
means that the address itself must be a legal userspace address. The backing
6230+
mapping for userspace_addr is not required to be valid/populated at the time of
6231+
KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
6232+
on-demand.
6233+
6234+
When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
6235+
userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
6236+
state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute
6237+
is '0' for all gfns. Userspace can control whether memory is shared/private by
6238+
toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
62156239

62166240
4.140 KVM_SET_MEMORY_ATTRIBUTES
62176241
-------------------------------
@@ -6249,6 +6273,49 @@ the state of a gfn/page as needed.
62496273

62506274
The "flags" field is reserved for future extensions and must be '0'.
62516275

6276+
4.141 KVM_CREATE_GUEST_MEMFD
6277+
----------------------------
6278+
6279+
:Capability: KVM_CAP_GUEST_MEMFD
6280+
:Architectures: none
6281+
:Type: vm ioctl
6282+
:Parameters: struct struct kvm_create_guest_memfd(in)
6283+
:Returns: 0 on success, <0 on error
6284+
6285+
KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
6286+
that refers to it. guest_memfd files are roughly analogous to files created
6287+
via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage,
6288+
and are automatically released when the last reference is dropped. Unlike
6289+
"regular" memfd_create() files, guest_memfd files are bound to their owning
6290+
virtual machine (see below), cannot be mapped, read, or written by userspace,
6291+
and cannot be resized (guest_memfd files do however support PUNCH_HOLE).
6292+
6293+
::
6294+
6295+
struct kvm_create_guest_memfd {
6296+
__u64 size;
6297+
__u64 flags;
6298+
__u64 reserved[6];
6299+
};
6300+
6301+
Conceptually, the inode backing a guest_memfd file represents physical memory,
6302+
i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The
6303+
file itself, which is bound to a "struct kvm", is that instance's view of the
6304+
underlying memory, e.g. effectively provides the translation of guest addresses
6305+
to host memory. This allows for use cases where multiple KVM structures are
6306+
used to manage a single virtual machine, e.g. when performing intrahost
6307+
migration of a virtual machine.
6308+
6309+
KVM currently only supports mapping guest_memfd via KVM_SET_USER_MEMORY_REGION2,
6310+
and more specifically via the guest_memfd and guest_memfd_offset fields in
6311+
"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset
6312+
into the guest_memfd instance. For a given guest_memfd file, there can be at
6313+
most one mapping per page, i.e. binding multiple memory regions to a single
6314+
guest_memfd range is not allowed (any number of memory regions can be bound to
6315+
a single guest_memfd file, but the bound ranges must not overlap).
6316+
6317+
See KVM_SET_USER_MEMORY_REGION2 for additional details.
6318+
62526319
5. The kvm_run structure
62536320
========================
62546321

include/linux/kvm_host.h

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -591,8 +591,20 @@ struct kvm_memory_slot {
591591
u32 flags;
592592
short id;
593593
u16 as_id;
594+
595+
#ifdef CONFIG_KVM_PRIVATE_MEM
596+
struct {
597+
struct file __rcu *file;
598+
pgoff_t pgoff;
599+
} gmem;
600+
#endif
594601
};
595602

603+
static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
604+
{
605+
return slot && (slot->flags & KVM_MEM_PRIVATE);
606+
}
607+
596608
static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
597609
{
598610
return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -687,6 +699,17 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
687699
}
688700
#endif
689701

702+
/*
703+
* Arch code must define kvm_arch_has_private_mem if support for private memory
704+
* is enabled.
705+
*/
706+
#if !defined(kvm_arch_has_private_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
707+
static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
708+
{
709+
return false;
710+
}
711+
#endif
712+
690713
struct kvm_memslots {
691714
u64 generation;
692715
atomic_long_t last_used_slot;
@@ -1401,6 +1424,7 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
14011424
void kvm_mmu_invalidate_begin(struct kvm *kvm);
14021425
void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
14031426
void kvm_mmu_invalidate_end(struct kvm *kvm);
1427+
bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
14041428

14051429
long kvm_arch_dev_ioctl(struct file *filp,
14061430
unsigned int ioctl, unsigned long arg);
@@ -2356,6 +2380,30 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
23562380
struct kvm_gfn_range *range);
23572381
bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
23582382
struct kvm_gfn_range *range);
2383+
2384+
static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
2385+
{
2386+
return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
2387+
kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
2388+
}
2389+
#else
2390+
static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
2391+
{
2392+
return false;
2393+
}
23592394
#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
23602395

2396+
#ifdef CONFIG_KVM_PRIVATE_MEM
2397+
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
2398+
gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
2399+
#else
2400+
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
2401+
struct kvm_memory_slot *slot, gfn_t gfn,
2402+
kvm_pfn_t *pfn, int *max_order)
2403+
{
2404+
KVM_BUG_ON(1, kvm);
2405+
return -EIO;
2406+
}
2407+
#endif /* CONFIG_KVM_PRIVATE_MEM */
2408+
23612409
#endif

include/uapi/linux/kvm.h

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,10 @@ struct kvm_userspace_memory_region2 {
102102
__u64 guest_phys_addr;
103103
__u64 memory_size;
104104
__u64 userspace_addr;
105-
__u64 pad[16];
105+
__u64 guest_memfd_offset;
106+
__u32 guest_memfd;
107+
__u32 pad1;
108+
__u64 pad2[14];
106109
};
107110

108111
/*
@@ -112,6 +115,7 @@ struct kvm_userspace_memory_region2 {
112115
*/
113116
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
114117
#define KVM_MEM_READONLY (1UL << 1)
118+
#define KVM_MEM_PRIVATE (1UL << 2)
115119

116120
/* for KVM_IRQ_LINE */
117121
struct kvm_irq_level {
@@ -1221,6 +1225,7 @@ struct kvm_ppc_resize_hpt {
12211225
#define KVM_CAP_USER_MEMORY2 231
12221226
#define KVM_CAP_MEMORY_FAULT_INFO 232
12231227
#define KVM_CAP_MEMORY_ATTRIBUTES 233
1228+
#define KVM_CAP_GUEST_MEMFD 234
12241229

12251230
#ifdef KVM_CAP_IRQ_ROUTING
12261231

@@ -2301,4 +2306,12 @@ struct kvm_memory_attributes {
23012306

23022307
#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
23032308

2309+
#define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
2310+
2311+
struct kvm_create_guest_memfd {
2312+
__u64 size;
2313+
__u64 flags;
2314+
__u64 reserved[6];
2315+
};
2316+
23042317
#endif /* __LINUX_KVM_H */

virt/kvm/Kconfig

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,3 +100,7 @@ config KVM_GENERIC_MMU_NOTIFIER
100100
config KVM_GENERIC_MEMORY_ATTRIBUTES
101101
select KVM_GENERIC_MMU_NOTIFIER
102102
bool
103+
104+
config KVM_PRIVATE_MEM
105+
select XARRAY_MULTI
106+
bool

virt/kvm/Makefile.kvm

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
1212
kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
1313
kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
1414
kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
15+
kvm-$(CONFIG_KVM_PRIVATE_MEM) += $(KVM)/guest_memfd.o

0 commit comments

Comments
 (0)