OSDN Git Service

tomoyo/tomoyo-test1.git
20 months agoKVM: selftests: Rename perf_test_util.[ch] to memstress.[ch]
David Matlack [Wed, 12 Oct 2022 16:57:27 +0000 (09:57 -0700)]
KVM: selftests: Rename perf_test_util.[ch] to memstress.[ch]

Rename the perf_test_util.[ch] files to memstress.[ch]. Symbols are
renamed in the following commit to reduce the amount of churn here in
hopes of playiing nice with git's file rename detection.

The name "memstress" was chosen to better describe the functionality
proveded by this library, which is to create and run a VM that
reads/writes to guest memory on all vCPUs in parallel.

"memstress" also contains the same number of chracters as "perf_test",
making it a drop-in replacement in symbols, e.g. function names, without
impacting line lengths. Also the lack of underscore between "mem" and
"stress" makes it clear "memstress" is a noun.

Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221012165729.3505266-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: randomize page access order
Colton Lewis [Mon, 7 Nov 2022 18:22:08 +0000 (18:22 +0000)]
KVM: selftests: randomize page access order

Create the ability to randomize page access order with the -a
argument. This includes the possibility that the same pages may be hit
multiple times during an iteration or not at all.

Population has random access as false to ensure all pages will be
touched by population and avoid page faults in late dirty memory that
would pollute the test results.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20221107182208.479157-5-coltonlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: randomize which pages are written vs read
Colton Lewis [Mon, 7 Nov 2022 18:22:07 +0000 (18:22 +0000)]
KVM: selftests: randomize which pages are written vs read

Randomize which pages are written vs read using the random number
generator.

Change the variable wr_fract and associated function calls to
write_percent that now operates as a percentage from 0 to 100 where X
means each page has an X% chance of being written. Change the -f
argument to -w to reflect the new variable semantics. Keep the same
default of 100% writes.

Population always uses 100% writes to ensure all memory is actually
populated and not just mapped to the zero page. The prevents expensive
copy-on-write faults from occurring during the dirty memory iterations
below, which would pollute the performance results.

Each vCPU calculates its own random seed by adding its index to the
seed provided.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20221107182208.479157-4-coltonlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: create -r argument to specify random seed
Colton Lewis [Mon, 7 Nov 2022 18:22:06 +0000 (18:22 +0000)]
KVM: selftests: create -r argument to specify random seed

Create a -r argument to specify a random seed. If no argument is
provided, the seed defaults to 1. The random seed is set with
perf_test_set_random_seed() and must be set before guest_code runs to
apply.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20221107182208.479157-3-coltonlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: implement random number generator for guest code
Colton Lewis [Mon, 7 Nov 2022 18:22:05 +0000 (18:22 +0000)]
KVM: selftests: implement random number generator for guest code

Implement random number generator for guest code to randomize parts
of the test, making it less predictable and a more accurate reflection
of reality.

The random number generator chosen is the Park-Miller Linear
Congruential Generator, a fancy name for a basic and well-understood
random number generator entirely sufficient for this purpose.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20221107182208.479157-2-coltonlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: Allowing running dirty_log_perf_test on specific CPUs
Vipin Sharma [Thu, 3 Nov 2022 19:17:19 +0000 (12:17 -0700)]
KVM: selftests: Allowing running dirty_log_perf_test on specific CPUs

Add a command line option, -c, to pin vCPUs to physical CPUs (pCPUs),
i.e.  to force vCPUs to run on specific pCPUs.

Requirement to implement this feature came in discussion on the patch
"Make page tables for eager page splitting NUMA aware"
https://lore.kernel.org/lkml/YuhPT2drgqL+osLl@google.com/

This feature is useful as it provides a way to analyze performance based
on the vCPUs and dirty log worker locations, like on the different NUMA
nodes or on the same NUMA nodes.

To keep things simple, implementation is intentionally very limited,
either all of the vCPUs will be pinned followed by an optional main
thread or nothing will be pinned.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221103191719.1559407-8-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: Add atoi_positive() and atoi_non_negative() for input validation
Vipin Sharma [Thu, 3 Nov 2022 19:17:18 +0000 (12:17 -0700)]
KVM: selftests: Add atoi_positive() and atoi_non_negative() for input validation

Many KVM selftests take command line arguments which are supposed to be
positive (>0) or non-negative (>=0). Some tests do these validation and
some missed adding the check.

Add atoi_positive() and atoi_non_negative() to validate inputs in
selftests before proceeding to use those values.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221103191719.1559407-7-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: Shorten the test args in memslot_modification_stress_test.c
Vipin Sharma [Thu, 3 Nov 2022 19:17:17 +0000 (12:17 -0700)]
KVM: selftests: Shorten the test args in memslot_modification_stress_test.c

Change test args memslot_modification_delay and nr_memslot_modifications
to delay and nr_iterations for simplicity.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221103191719.1559407-6-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: Use SZ_* macros from sizes.h in max_guest_memory_test.c
Vipin Sharma [Thu, 3 Nov 2022 19:17:16 +0000 (12:17 -0700)]
KVM: selftests: Use SZ_* macros from sizes.h in max_guest_memory_test.c

Replace size_1gb defined in max_guest_memory_test.c with the SZ_1G,
SZ_2G and SZ_4G from linux/sizes.h header file.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221103191719.1559407-5-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: Add atoi_paranoid() to catch errors missed by atoi()
Vipin Sharma [Thu, 3 Nov 2022 19:17:15 +0000 (12:17 -0700)]
KVM: selftests: Add atoi_paranoid() to catch errors missed by atoi()

atoi() doesn't detect errors. There is no way to know that a 0 return
is correct conversion or due to an error.

Introduce atoi_paranoid() to detect errors and provide correct
conversion. Replace all atoi() calls with atoi_paranoid().

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221103191719.1559407-4-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: Put command line options in alphabetical order in dirty_log_perf_test
Vipin Sharma [Thu, 3 Nov 2022 19:17:14 +0000 (12:17 -0700)]
KVM: selftests: Put command line options in alphabetical order in dirty_log_perf_test

There are 13 command line options and they are not in any order. Put
them in alphabetical order to make it easy to add new options.

No functional change intended.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221103191719.1559407-3-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: selftests: Add missing break between -e and -g option in dirty_log_perf_test
Vipin Sharma [Thu, 3 Nov 2022 19:17:13 +0000 (12:17 -0700)]
KVM: selftests: Add missing break between -e and -g option in dirty_log_perf_test

Passing -e option (Run VCPUs while dirty logging is being disabled) in
dirty_log_perf_test also unintentionally enables -g (Do not enable
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2). Add break between two switch case
logic.

Fixes: cfe12e64b065 ("KVM: selftests: Add an option to run vCPUs while disabling dirty logging")
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221103191719.1559407-2-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
20 months agoKVM: replace direct irq.h inclusion
Paolo Bonzini [Thu, 3 Nov 2022 14:44:10 +0000 (10:44 -0400)]
KVM: replace direct irq.h inclusion

virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source
directory (i.e. not from arch/*/include) for the sole purpose of retrieving
irqchip_in_kernel.

Making the function inline in a header that is already included,
such as asm/kvm_host.h, is not possible because it needs to look at
struct kvm which is defined after asm/kvm_host.h is included.  So add a
kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is
only performance critical on arm64 and x86, and the non-inline function
is enough on all other architectures.

irq.h can then be deleted from all architectures except x86.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter
Like Xu [Fri, 23 Sep 2022 00:13:55 +0000 (00:13 +0000)]
KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter

Defer reprogramming counters and handling overflow via KVM_REQ_PMU
when incrementing counters.  KVM skips emulated WRMSR in the VM-Exit
fastpath, the fastpath runs with IRQs disabled, skipping instructions
can increment and reprogram counters, reprogramming counters can
sleep, and sleeping is disallowed while IRQs are disabled.

 [*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
 [*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM
 [*] preempt_count: 1, expected: 0
 [*] RCU nest depth: 0, expected: 0
 [*] INFO: lockdep is turned off.
 [*] irq event stamp: 0
 [*] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 [*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0
 [*] softirqs last  enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0
 [*] softirqs last disabled at (0): [<0000000000000000>] 0x0
 [*] Preemption disabled at:
 [*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm]
 [*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2
 [*] Call Trace:
 [*]  <TASK>
 [*]  dump_stack_lvl+0x6c/0x9b
 [*]  __might_resched.cold+0x22e/0x297
 [*]  __mutex_lock+0xc0/0x23b0
 [*]  perf_event_ctx_lock_nested+0x18f/0x340
 [*]  perf_event_pause+0x1a/0x110
 [*]  reprogram_counter+0x2af/0x1490 [kvm]
 [*]  kvm_pmu_trigger_event+0x429/0x950 [kvm]
 [*]  kvm_skip_emulated_instruction+0x48/0x90 [kvm]
 [*]  handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm]
 [*]  vmx_vcpu_run+0x268e/0x3b80 [kvm_intel]
 [*]  vcpu_enter_guest+0x1d22/0x3dc0 [kvm]

Add a field to kvm_pmc to track the previous counter value in order
to defer overflow detection to kvm_pmu_handle_event() (the counter must
be paused before handling overflow, and that may increment the counter).

Opportunistically shrink sizeof(struct kvm_pmc) a bit.

Suggested-by: Wanpeng Li <wanpengli@tencent.com>
Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20220831085328.45489-6-likexu@tencent.com
[sean: avoid re-triggering KVM_REQ_PMU on overflow, tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220923001355.3741194-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
Like Xu [Fri, 23 Sep 2022 00:13:54 +0000 (00:13 +0000)]
KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()

Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus
deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming
a counter multiple times during a single VM-Exit.

Deferring programming will also allow KVM to fix a bug where immediately
reprogramming a counter can result in sleeping (taking a mutex) while
interrupts are disabled in the VM-Exit fastpath.

Introduce kvm_pmu_request_counter_reprogam() to make it obvious that
KVM is _requesting_ a reprogram and not actually doing the reprogram.

Opportunistically refine related comments to avoid misunderstandings.

Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20220831085328.45489-5-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220923001355.3741194-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/pmu: Clear "reprogram" bit if counter is disabled or disallowed
Sean Christopherson [Fri, 23 Sep 2022 00:13:53 +0000 (00:13 +0000)]
KVM: x86/pmu: Clear "reprogram" bit if counter is disabled or disallowed

When reprogramming a counter, clear the counter's "reprogram pending" bit
if the counter is disabled (by the guest) or is disallowed (by the
userspace filter).  In both cases, there's no need to re-attempt
programming on the next coincident KVM_REQ_PMU as enabling the counter by
either method will trigger reprogramming.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220923001355.3741194-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/pmu: Force reprogramming of all counters on PMU filter change
Sean Christopherson [Fri, 23 Sep 2022 00:13:52 +0000 (00:13 +0000)]
KVM: x86/pmu: Force reprogramming of all counters on PMU filter change

Force vCPUs to reprogram all counters on a PMU filter change to provide
a sane ABI for userspace.  Use the existing KVM_REQ_PMU to do the
programming, and take advantage of the fact that the reprogram_pmi bitmap
fits in a u64 to set all bits in a single atomic update.  Note, setting
the bitmap and making the request needs to be done _after_ the SRCU
synchronization to ensure that vCPUs will reprogram using the new filter.

KVM's current "lazy" approach is confusing and non-deterministic.  It's
confusing because, from a developer perspective, the code is buggy as it
makes zero sense to let userspace modify the filter but then not actually
enforce the new filter.  The lazy approach is non-deterministic because
KVM enforces the filter whenever a counter is reprogrammed, not just on
guest WRMSRs, i.e. a guest might gain/lose access to an event at random
times depending on what is going on in the host.

Note, the resulting behavior is still non-determinstic while the filter
is in flux.  If userspace wants to guarantee deterministic behavior, all
vCPUs should be paused during the filter update.

Jim Mattson <jmattson@google.com>

Fixes: 66bb8a065f5a ("KVM: x86: PMU Event Filter")
Cc: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220923001355.3741194-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zapped
Sean Christopherson [Wed, 19 Oct 2022 16:56:18 +0000 (16:56 +0000)]
KVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zapped

Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the
TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed
NX huge page regardless of the MMU type.  Recovery runs while holding
mmu_lock for write and so it should be impossible to get false positives
on the WARN.

Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221019165618.927057-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()
Mingwei Zhang [Wed, 19 Oct 2022 16:56:17 +0000 (16:56 +0000)]
KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()

Explicitly check if a NX huge page is disallowed when determining if a
page fault needs to be forced to use a smaller sized page.  KVM currently
assumes that the NX huge page mitigation is the only scenario where KVM
will force a shadow page instead of a huge page, and so unnecessarily
keeps an existing shadow page instead of replacing it with a huge page.

Any scenario that causes KVM to zap leaf SPTEs may result in having a SP
that can be made huge without violating the NX huge page mitigation.
E.g. prior to commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when
disabling dirty logging"), KVM would keep shadow pages after disabling
dirty logging due to a live migration being canceled, resulting in
degraded performance due to running with 4kb pages instead of huge pages.

Although the dirty logging case is "fixed", that fix is coincidental,
i.e. is an implementation detail, and there are other scenarios where KVM
will zap leaf SPTEs.  E.g. zapping leaf SPTEs in response to a host page
migration (mmu_notifier invalidation) to create a huge page would yield a
similar result; KVM would see the shadow-present non-leaf SPTE and assume
a huge page is disallowed.

Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: use spte_to_child_sp(), massage changelog, fold into if-statement]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Message-Id: <20221019165618.927057-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: Add helper to convert SPTE value to its shadow page
Sean Christopherson [Wed, 19 Oct 2022 16:56:16 +0000 (16:56 +0000)]
KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

Add a helper to convert a SPTE to its shadow page to deduplicate a
variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
get the shadow page for a SPTE without dropping high bits.

Opportunistically add a comment in mmu_free_root_page() documenting why
it treats the root HPA as a SPTE.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221019165618.927057-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages
Sean Christopherson [Wed, 19 Oct 2022 16:56:15 +0000 (16:56 +0000)]
KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages

Track the number of TDP MMU "shadow" pages instead of tracking the pages
themselves. With the NX huge page list manipulation moved out of the common
linking flow, elminating the list-based tracking means the happy path of
adding a shadow page doesn't need to acquire a spinlock and can instead
inc/dec an atomic.

Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
pages is very, very useful for detecting KVM bugs.

Tracking the number of pages will also make it trivial to expose the
counter to userspace as a stat in the future, which may or may not be
desirable.

Note, the TDP MMU needs to use a separate counter (and stat if that ever
comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
TDP MMU consumes so few pages relative to shadow paging), and including TDP
MMU pages in that counter would break both the shrinker and shadow MMUs,
e.g. if a VM is using nested TDP.

Cc: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Message-Id: <20221019165618.927057-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE
Sean Christopherson [Wed, 19 Oct 2022 16:56:14 +0000 (16:56 +0000)]
KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE

Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
visible to other readers, i.e. before setting its SPTE.  This will allow
KVM to query the flag when determining if a shadow page can be replaced
by a NX huge page without violating the rules of the mitigation.

Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible
for another CPU to see a shadow page without an up-to-date
nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated
dance.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Message-Id: <20221019165618.927057-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs
Sean Christopherson [Wed, 19 Oct 2022 16:56:13 +0000 (16:56 +0000)]
KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs

Account and track NX huge pages for nonpaging MMUs so that a future
enhancement to precisely check if a shadow page can't be replaced by a NX
huge page doesn't get false positives.  Without correct tracking, KVM can
get stuck in a loop if an instruction is fetching and writing data on the
same huge page, e.g. KVM installs a small executable page on the fetch
fault, replaces it with an NX huge page on the write fault, and faults
again on the fetch.

Alternatively, and perhaps ideally, KVM would simply not enforce the
workaround for nonpaging MMUs.  The guest has no page tables to abuse
and KVM is guaranteed to switch to a different MMU on CR0.PG being
toggled so there's no security or performance concerns.  However, getting
make_spte() to play nice now and in the future is unnecessarily complex.

In the current code base, make_spte() can enforce the mitigation if TDP
is enabled or the MMU is indirect, but make_spte() may not always have a
vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
promotion when disabling dirty logging.

Without a vCPU/MMU, KVM could either pass in the correct information
and/or derive it from the shadow page, but the former is ugly and the
latter subtly non-trivial due to the possibility of direct shadow pages
in indirect MMUs.  Given that using shadow paging with an unpaged guest
is far from top priority _and_ has been subjected to the workaround since
its inception, keep it simple and just fix the accounting glitch.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221019165618.927057-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: Rename NX huge pages fields/functions for consistency
Sean Christopherson [Wed, 19 Oct 2022 16:56:12 +0000 (16:56 +0000)]
KVM: x86/mmu: Rename NX huge pages fields/functions for consistency

Rename most of the variables/functions involved in the NX huge page
mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
applies only to the NX huge page mitigation, not to any condition that
prevents creating a huge page.

Add a comment explaining what the newly named "possible_nx_huge_pages"
tracks.

Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
stone.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221019165618.927057-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
Sean Christopherson [Wed, 19 Oct 2022 16:56:11 +0000 (16:56 +0000)]
KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked

Tag shadow pages that cannot be replaced with an NX huge page regardless
of whether or not zapping the page would allow KVM to immediately create
a huge page, e.g. because something else prevents creating a huge page.

I.e. track pages that are disallowed from being NX huge pages regardless
of whether or not the page could have been huge at the time of fault.
KVM currently tracks pages that were disallowed from being huge due to
the NX workaround if and only if the page could otherwise be huge.  But
that fails to handled the scenario where whatever restriction prevented
KVM from installing a huge page goes away, e.g. if dirty logging is
disabled, the host mapping level changes, etc...

Failure to tag shadow pages appropriately could theoretically lead to
false negatives, e.g. if a fetch fault requests a small page and thus
isn't tracked, and a read/write fault later requests a huge page, KVM
will not reject the huge page as it should.

To avoid yet another flag, initialize the list_head and use list_empty()
to determine whether or not a page is on the list of NX huge pages that
should be recovered.

Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is
more involved due to mmu_lock being held for read.  This will be
addressed in a future commit.

Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221019165618.927057-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoselftests: kvm/x86: Test the flags in MSR filtering and MSR exiting
Aaron Lewis [Wed, 21 Sep 2022 15:15:25 +0000 (15:15 +0000)]
selftests: kvm/x86: Test the flags in MSR filtering and MSR exiting

When using the flags in KVM_X86_SET_MSR_FILTER and
KVM_CAP_X86_USER_SPACE_MSR it is expected that an attempt to write to
any of the unused bits will fail.  Add testing to walk over every bit
in each of the flag fields in MSR filtering and MSR exiting to verify
that unused bits return and error and used bits, i.e. valid bits,
succeed.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Message-Id: <20220921151525.904162-6-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: Add a VALID_MASK for the flags in kvm_msr_filter_range
Aaron Lewis [Wed, 21 Sep 2022 15:15:24 +0000 (15:15 +0000)]
KVM: x86: Add a VALID_MASK for the flags in kvm_msr_filter_range

Add the mask KVM_MSR_FILTER_RANGE_VALID_MASK for the flags in the
struct kvm_msr_filter_range.  This simplifies checks that validate
these flags, and makes it easier to introduce new flags in the future.

No functional change intended.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Message-Id: <20220921151525.904162-5-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: Add a VALID_MASK for the flag in kvm_msr_filter
Aaron Lewis [Wed, 21 Sep 2022 15:15:23 +0000 (15:15 +0000)]
KVM: x86: Add a VALID_MASK for the flag in kvm_msr_filter

Add the mask KVM_MSR_FILTER_VALID_MASK for the flag in the struct
kvm_msr_filter.  This makes it easier to introduce new flags in the
future.

No functional change intended.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Message-Id: <20220921151525.904162-4-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: Add a VALID_MASK for the MSR exit reason flags
Aaron Lewis [Wed, 21 Sep 2022 15:15:22 +0000 (15:15 +0000)]
KVM: x86: Add a VALID_MASK for the MSR exit reason flags

Add the mask KVM_MSR_EXIT_REASON_VALID_MASK for the MSR exit reason
flags.  This simplifies checks that validate these flags, and makes it
easier to introduce new flags in the future.

No functional change intended.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Message-Id: <20220921151525.904162-3-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: Disallow the use of KVM_MSR_FILTER_DEFAULT_ALLOW in the kernel
Aaron Lewis [Wed, 21 Sep 2022 15:15:21 +0000 (15:15 +0000)]
KVM: x86: Disallow the use of KVM_MSR_FILTER_DEFAULT_ALLOW in the kernel

Protect the kernel from using the flag KVM_MSR_FILTER_DEFAULT_ALLOW.
Its value is 0, and using it incorrectly could have unintended
consequences. E.g. prevent someone in the kernel from writing something
like this.

if (filter.flags & KVM_MSR_FILTER_DEFAULT_ALLOW)
        <allow the MSR>

and getting confused when it doesn't work.

It would be more ideal to remove this flag altogether, but userspace
may already be using it, so protecting the kernel is all that can
reasonably be done at this point.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220921151525.904162-2-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agokvm: x86: Allow to respond to generic signals during slow PF
Peter Xu [Tue, 11 Oct 2022 19:59:47 +0000 (15:59 -0400)]
kvm: x86: Allow to respond to generic signals during slow PF

Enable x86 slow page faults to be able to respond to non-fatal signals,
returning -EINTR properly when it happens.

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221011195947.557281-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agokvm: Add interruptible flag to __gfn_to_pfn_memslot()
Peter Xu [Tue, 11 Oct 2022 19:58:08 +0000 (15:58 -0400)]
kvm: Add interruptible flag to __gfn_to_pfn_memslot()

Add a new "interruptible" flag showing that the caller is willing to be
interrupted by signals during the __gfn_to_pfn_memslot() request.  Wire it
up with a FOLL_INTERRUPTIBLE flag that we've just introduced.

This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the
SIGIPI) even during e.g. handling an userfaultfd page fault.

No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221011195809.557016-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agokvm: Add KVM_PFN_ERR_SIGPENDING
Peter Xu [Tue, 11 Oct 2022 19:58:07 +0000 (15:58 -0400)]
kvm: Add KVM_PFN_ERR_SIGPENDING

Add a new pfn error to show that we've got a pending signal to handle
during hva_to_pfn_slow() procedure (of -EINTR retval).

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221011195809.557016-3-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agomm/gup: Add FOLL_INTERRUPTIBLE
Peter Xu [Tue, 11 Oct 2022 19:58:06 +0000 (15:58 -0400)]
mm/gup: Add FOLL_INTERRUPTIBLE

We have had FAULT_FLAG_INTERRUPTIBLE but it was never applied to GUPs.  One
issue with it is that not all GUP paths are able to handle signal delivers
besides SIGKILL.

That's not ideal for the GUP users who are actually able to handle these
cases, like KVM.

KVM uses GUP extensively on faulting guest pages, during which we've got
existing infrastructures to retry a page fault at a later time.  Allowing
the GUP to be interrupted by generic signals can make KVM related threads
to be more responsive.  For examples:

  (1) SIGUSR1: which QEMU/KVM uses to deliver an inter-process IPI,
      e.g. when the admin issues a vm_stop QMP command, SIGUSR1 can be
      generated to kick the vcpus out of kernel context immediately,

  (2) SIGINT: which can be used with interactive hypervisor users to stop a
      virtual machine with Ctrl-C without any delays/hangs,

  (3) SIGTRAP: which grants GDB capability even during page faults that are
      stuck for a long time.

Normally hypervisor will be able to receive these signals properly, but not
if we're stuck in a GUP for a long time for whatever reason.  It happens
easily with a stucked postcopy migration when e.g. a network temp failure
happens, then some vcpu threads can hang death waiting for the pages.  With
the new FOLL_INTERRUPTIBLE, we can allow GUP users like KVM to selectively
enable the ability to trap these signals.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20221011195809.557016-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: smm: preserve interrupt shadow in SMRAM
Maxim Levitsky [Tue, 25 Oct 2022 12:47:41 +0000 (15:47 +0300)]
KVM: x86: smm: preserve interrupt shadow in SMRAM

When #SMI is asserted, the CPU can be in interrupt shadow due to sti or
mov ss.

It is not mandatory in  Intel/AMD prm to have the #SMI blocked during the
shadow, and on top of that, since neither SVM nor VMX has true support
for SMI window, waiting for one instruction would mean single stepping
the guest.

Instead, allow #SMI in this case, but both reset the interrupt window and
stash its value in SMRAM to restore it on exit from SMM.

This fixes rare failures seen mostly on windows guests on VMX, when #SMI
falls on the sti instruction which mainfest in VM entry failure due
to EFLAGS.IF not being set, but STI interrupt window still being set
in the VMCS.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-24-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: SVM: don't save SVM state to SMRAM when VM is not long mode capable
Maxim Levitsky [Tue, 25 Oct 2022 12:47:40 +0000 (15:47 +0300)]
KVM: x86: SVM: don't save SVM state to SMRAM when VM is not long mode capable

When the guest CPUID doesn't have support for long mode, 32 bit SMRAM
layout is used and it has no support for preserving EFER and/or SVM
state.

Note that this isn't relevant to running 32 bit guests on VM which is
long mode capable - such VM can still run 32 bit guests in compatibility
mode.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-23-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: SVM: use smram structs
Maxim Levitsky [Tue, 25 Oct 2022 12:47:39 +0000 (15:47 +0300)]
KVM: x86: SVM: use smram structs

Use SMM structs in the SVM code as well, which removes the last user of
put_smstate/GET_SMSTATE so remove these macros as well.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-22-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: svm: drop explicit return value of kvm_vcpu_map
Maxim Levitsky [Tue, 25 Oct 2022 12:47:38 +0000 (15:47 +0300)]
KVM: svm: drop explicit return value of kvm_vcpu_map

if kvm_vcpu_map returns non zero value, error path should be triggered
regardless of the exact returned error value.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-21-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: smm: use smram struct for 64 bit smram load/restore
Maxim Levitsky [Tue, 25 Oct 2022 12:47:37 +0000 (15:47 +0300)]
KVM: x86: smm: use smram struct for 64 bit smram load/restore

Use kvm_smram_state_64 struct to save/restore the 64 bit SMM state
(used when X86_FEATURE_LM is present in the guest CPUID,
regardless of 32-bitness of the guest).

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-20-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: smm: use smram struct for 32 bit smram load/restore
Maxim Levitsky [Tue, 25 Oct 2022 12:47:36 +0000 (15:47 +0300)]
KVM: x86: smm: use smram struct for 32 bit smram load/restore

Use kvm_smram_state_32 struct to save/restore 32 bit SMM state
(used when X86_FEATURE_LM is not present in the guest CPUID).

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-19-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: smm: use smram structs in the common code
Maxim Levitsky [Tue, 25 Oct 2022 12:47:35 +0000 (15:47 +0300)]
KVM: x86: smm: use smram structs in the common code

Use kvm_smram union instad of raw arrays in the common smm code.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-18-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: smm: add structs for KVM's smram layout
Maxim Levitsky [Tue, 25 Oct 2022 12:47:34 +0000 (15:47 +0300)]
KVM: x86: smm: add structs for KVM's smram layout

Add structs that will be used to define and read/write the KVM's
SMRAM layout, instead of reading/writing to raw offsets.

Also document the differences between KVM's SMRAM layout and SMRAM
layout that is used by real Intel/AMD cpus.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-17-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: smm: check for failures on smm entry
Maxim Levitsky [Tue, 25 Oct 2022 12:47:33 +0000 (15:47 +0300)]
KVM: x86: smm: check for failures on smm entry

In the rare case of the failure on SMM entry, the KVM should at
least terminate the VM instead of going south.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-16-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: do not define SMM-related constants if SMM disabled
Paolo Bonzini [Mon, 7 Nov 2022 15:11:42 +0000 (10:11 -0500)]
KVM: x86: do not define SMM-related constants if SMM disabled

The hidden processor flags HF_SMM_MASK and HF_SMM_INSIDE_NMI_MASK
are not needed if CONFIG_KVM_SMM is turned off.  Remove the
definitions altogether and the code that uses them.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: zero output of KVM_GET_VCPU_EVENTS before filling in the struct
Paolo Bonzini [Thu, 27 Oct 2022 16:44:28 +0000 (12:44 -0400)]
KVM: zero output of KVM_GET_VCPU_EVENTS before filling in the struct

This allows making some fields optional, as will be the case soon
for SMM-related data.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: do not define KVM_REQ_SMI if SMM disabled
Paolo Bonzini [Thu, 29 Sep 2022 17:20:16 +0000 (13:20 -0400)]
KVM: x86: do not define KVM_REQ_SMI if SMM disabled

This ensures that all the relevant code is compiled out, in fact
the process_smi stub can be removed too.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220929172016.319443-9-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: remove SMRAM address space if SMM is not supported
Paolo Bonzini [Thu, 29 Sep 2022 17:20:15 +0000 (13:20 -0400)]
KVM: x86: remove SMRAM address space if SMM is not supported

If CONFIG_KVM_SMM is not defined HF_SMM_MASK will always be zero, and
we can spare userspace the hassle of setting up the SMRAM address space
simply by reporting that only one address space is supported.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220929172016.319443-8-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: compile out vendor-specific code if SMM is disabled
Paolo Bonzini [Thu, 29 Sep 2022 17:20:14 +0000 (13:20 -0400)]
KVM: x86: compile out vendor-specific code if SMM is disabled

Vendor-specific code that deals with SMI injection and saving/restoring
SMM state is not needed if CONFIG_KVM_SMM is disabled, so remove the
four callbacks smi_allowed, enter_smm, leave_smm and enable_smi_window.
The users in svm/nested.c and x86.c also have to be compiled out; the
amount of #ifdef'ed code is small and it's not worth moving it to
smm.c.

enter_smm is now used only within #ifdef CONFIG_KVM_SMM, and the stub
can therefore be removed.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220929172016.319443-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: allow compiling out SMM support
Paolo Bonzini [Thu, 29 Sep 2022 17:20:13 +0000 (13:20 -0400)]
KVM: allow compiling out SMM support

Some users of KVM implement the UEFI variable store through a paravirtual device
that does not require the "SMM lockbox" component of edk2; allow them to
compile out system management mode, which is not a full implementation
especially in how it interacts with nested virtualization.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220929172016.319443-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: do not go through ctxt->ops when emulating rsm
Paolo Bonzini [Thu, 29 Sep 2022 17:20:12 +0000 (13:20 -0400)]
KVM: x86: do not go through ctxt->ops when emulating rsm

Now that RSM is implemented in a single emulator callback, there is no
point in going through other callbacks for the sake of modifying
processor state.  Just invoke KVM's own internal functions directly,
and remove the callbacks that were only used by em_rsm; the only
substantial difference is in the handling of the segment registers
and descriptor cache, which have to be parsed into a struct kvm_segment
instead of a struct desc_struct.

This also fixes a bug where emulator_set_segment was shifting the
limit left by 12 if the G bit is set, but the limit had not been
shifted right upon entry to SMM.

The emulator context is still used to restore EIP and the general
purpose registers.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220929172016.319443-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: move SMM exit to a new file
Paolo Bonzini [Fri, 28 Oct 2022 10:01:26 +0000 (06:01 -0400)]
KVM: x86: move SMM exit to a new file

Some users of KVM implement the UEFI variable store through a paravirtual
device that does not require the "SMM lockbox" component of edk2, and
would like to compile out system management mode.  In preparation for
that, move the SMM exit code out of emulate.c and into a new file.

The code is still written as a series of invocations of the emulator
callbacks, but the two exiting_smm and leave_smm callbacks are merged
into one, and all the code from em_rsm is now part of the callback.
This removes all knowledge of the format of the SMM save state area
from the emulator.  Further patches will clean up the code and
invoke KVM's own functions to access control registers, descriptor
caches, etc.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220929172016.319443-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: move SMM entry to a new file
Paolo Bonzini [Thu, 29 Sep 2022 17:20:10 +0000 (13:20 -0400)]
KVM: x86: move SMM entry to a new file

Some users of KVM implement the UEFI variable store through a paravirtual
device that does not require the "SMM lockbox" component of edk2, and
would like to compile out system management mode.  In preparation for
that, move the SMM entry code out of x86.c and into a new file.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220929172016.319443-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: start moving SMM-related functions to new files
Paolo Bonzini [Thu, 29 Sep 2022 17:20:09 +0000 (13:20 -0400)]
KVM: x86: start moving SMM-related functions to new files

Create a new header and source with code related to system management
mode emulation.  Entry and exit will move there too; for now,
opportunistically rename put_smstate to PUT_SMSTATE while moving
it to smm.h, and adjust the SMM state saving code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220929172016.319443-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: Name and check reserved fields with structs offset
Carlos Bilbao [Mon, 24 Oct 2022 16:44:48 +0000 (11:44 -0500)]
KVM: SVM: Name and check reserved fields with structs offset

Rename reserved fields on all structs in arch/x86/include/asm/svm.h
following their offset within the structs. Include compile time checks for
this in the same place where other BUILD_BUG_ON for the structs are.

This also solves that fields of struct sev_es_save_area are named by their
order of appearance, but right now they jump from reserved_5 to reserved_7.

Link: https://lkml.org/lkml/2022/10/22/376
Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
Message-Id: <20221024164448.203351-1-carlos.bilbao@amd.com>
[Use ASSERT_STRUCT_OFFSET + fix a couple wrong offsets. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agobug: introduce ASSERT_STRUCT_OFFSET
Maxim Levitsky [Tue, 25 Oct 2022 12:47:27 +0000 (15:47 +0300)]
bug: introduce ASSERT_STRUCT_OFFSET

ASSERT_STRUCT_OFFSET allows to assert during the build of
the kernel that a field in a struct have an expected offset.

KVM used to have such macro, but there is almost nothing KVM specific
in it so move it to build_bug.h, so that it can be used in other
places in KVM.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221025124741.228045-10-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agox86/kvm: Remove unused virt to phys translation in kvm_guest_cpu_init()
Rafael Mendonca [Fri, 21 Oct 2022 02:01:13 +0000 (23:01 -0300)]
x86/kvm: Remove unused virt to phys translation in kvm_guest_cpu_init()

Presumably, this was introduced due to a conflict resolution with
commit ef68017eb570 ("x86/kvm: Handle async page faults directly through
do_page_fault()"), given that the last posted version [1] of the blamed
commit was not based on the aforementioned commit.

[1] https://lore.kernel.org/kvm/20200525144125.143875-9-vkuznets@redhat.com/

Fixes: b1d405751cd5 ("KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Message-Id: <20221021020113.922027-1-rafaelmendsr@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: Insert "AMD" in KVM_X86_FEATURE_PSFD
Jim Mattson [Tue, 30 Aug 2022 22:52:09 +0000 (15:52 -0700)]
KVM: x86: Insert "AMD" in KVM_X86_FEATURE_PSFD

Intel and AMD have separate CPUID bits for each SPEC_CTRL bit. In the
case of every bit other than PFSD, the Intel CPUID bit has no vendor
name qualifier, but the AMD CPUID bit does. For consistency, rename
KVM_X86_FEATURE_PSFD to KVM_X86_FEATURE_AMD_PSFD.

No functional change intended.

Signed-off-by: Jim Mattson <jmattson@google.com>
Cc: Babu Moger <Babu.Moger@amd.com>
Message-Id: <20220830225210.2381310-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: use helper macro SPTE_ENT_PER_PAGE
Miaohe Lin [Tue, 13 Sep 2022 08:54:52 +0000 (16:54 +0800)]
KVM: x86/mmu: use helper macro SPTE_ENT_PER_PAGE

Use helper macro SPTE_ENT_PER_PAGE to get the number of spte entries
per page. Minor readability improvement.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220913085452.25561-1-linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/mmu: fix some comment typos
Miaohe Lin [Tue, 13 Sep 2022 09:17:25 +0000 (17:17 +0800)]
KVM: x86/mmu: fix some comment typos

Fix some typos in comments.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220913091725.35953-1-linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: remove obsolete kvm_mmu_gva_to_gpa_fetch()
Miaohe Lin [Tue, 13 Sep 2022 09:05:37 +0000 (17:05 +0800)]
KVM: x86: remove obsolete kvm_mmu_gva_to_gpa_fetch()

There's no caller. Remove it.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220913090537.25195-1-linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: Directly query supported PERF_CAPABILITIES for WRMSR checks
Sean Christopherson [Thu, 6 Oct 2022 00:03:14 +0000 (00:03 +0000)]
KVM: x86: Directly query supported PERF_CAPABILITIES for WRMSR checks

Use kvm_caps.supported_perf_cap directly instead of bouncing through
kvm_get_msr_feature() when checking the incoming value for writes to
PERF_CAPABILITIES.

Note, kvm_get_msr_feature() is guaranteed to succeed when getting
PERF_CAPABILITIES, i.e. dropping that check is a nop.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221006000314.73240-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: Handle PERF_CAPABILITIES in common x86's kvm_get_msr_feature()
Sean Christopherson [Thu, 6 Oct 2022 00:03:13 +0000 (00:03 +0000)]
KVM: x86: Handle PERF_CAPABILITIES in common x86's kvm_get_msr_feature()

Handle PERF_CAPABILITIES directly in kvm_get_msr_feature() now that the
supported value is available in kvm_caps.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221006000314.73240-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: Init vcpu->arch.perf_capabilities in common x86 code
Sean Christopherson [Thu, 6 Oct 2022 00:03:12 +0000 (00:03 +0000)]
KVM: x86: Init vcpu->arch.perf_capabilities in common x86 code

Initialize vcpu->arch.perf_capabilities in x86's kvm_arch_vcpu_create()
instead of deferring initialization to vendor code.  For better or worse,
common x86 handles reads and writes to the MSR, and so common x86 should
also handle initializing the MSR.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221006000314.73240-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: Track supported PERF_CAPABILITIES in kvm_caps
Sean Christopherson [Thu, 6 Oct 2022 00:03:11 +0000 (00:03 +0000)]
KVM: x86: Track supported PERF_CAPABILITIES in kvm_caps

Track KVM's supported PERF_CAPABILITIES in kvm_caps instead of computing
the supported capabilities on the fly every time.  Using kvm_caps will
also allow for future cleanups as the kvm_caps values can be used
directly in common x86 code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Like Xu <likexu@tencent.com>
Message-Id: <20221006000314.73240-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoperf/x86/core: Zero @lbr instead of returning -1 in x86_perf_get_lbr() stub
Sean Christopherson [Thu, 6 Oct 2022 00:03:07 +0000 (00:03 +0000)]
perf/x86/core: Zero @lbr instead of returning -1 in x86_perf_get_lbr() stub

Drop the return value from x86_perf_get_lbr() and have the stub zero out
the @lbr structure instead of returning -1 to indicate "no LBR support".
KVM doesn't actually check the return value, and instead subtly relies on
zeroing the number of LBRs in intel_pmu_init().

Formalize "nr=0 means unsupported" so that KVM doesn't need to add a
pointless check on the return value to fix KVM's benign bug.

Note, the stub is necessary even though KVM x86 selects PERF_EVENTS and
the caller exists only when CONFIG_KVM_INTEL=y.  Despite the name,
KVM_INTEL doesn't strictly require CPU_SUP_INTEL, it can be built with
any of INTEL || CENTAUR || ZHAOXIN CPUs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221006000314.73240-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoMerge tag 'kvm-s390-master-6.1-1' of https://git.kernel.org/pub/scm/linux/kernel...
Paolo Bonzini [Wed, 9 Nov 2022 17:28:15 +0000 (12:28 -0500)]
Merge tag 'kvm-s390-master-6.1-1' of https://git./linux/kernel/git/kvms390/linux into HEAD

A PCI allocation fix and a PV clock fix.

20 months agoKVM: x86/pmu: Limit the maximum number of supported AMD GP counters
Like Xu [Mon, 19 Sep 2022 09:10:08 +0000 (17:10 +0800)]
KVM: x86/pmu: Limit the maximum number of supported AMD GP counters

The AMD PerfMonV2 specification allows for a maximum of 16 GP counters,
but currently only 6 pairs of MSRs are accepted by KVM.

While AMD64_NUM_COUNTERS_CORE is already equal to 6, increasing without
adjusting msrs_to_save_all[] could result in out-of-bounds accesses.
Therefore introduce a macro (named KVM_AMD_PMC_MAX_GENERIC) to
refer to the number of counters supported by KVM.

Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220919091008.60695-3-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/pmu: Limit the maximum number of supported Intel GP counters
Like Xu [Mon, 19 Sep 2022 09:10:07 +0000 (17:10 +0800)]
KVM: x86/pmu: Limit the maximum number of supported Intel GP counters

The Intel Architectural IA32_PMCx MSRs addresses range allows for a
maximum of 8 GP counters, and KVM cannot address any more.  Introduce a
local macro (named KVM_INTEL_PMC_MAX_GENERIC) and use it consistently to
refer to the number of counters supported by KVM, thus avoiding possible
out-of-bound accesses.

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220919091008.60695-2-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86/pmu: Do not speculatively query Intel GP PMCs that don't exist yet
Like Xu [Mon, 19 Sep 2022 09:10:06 +0000 (17:10 +0800)]
KVM: x86/pmu: Do not speculatively query Intel GP PMCs that don't exist yet

The SDM lists an architectural MSR IA32_CORE_CAPABILITIES (0xCF)
that limits the theoretical maximum value of the Intel GP PMC MSRs
allocated at 0xC1 to 14; likewise the Intel April 2022 SDM adds
IA32_OVERCLOCKING_STATUS at 0x195 which limits the number of event
selection MSRs to 15 (0x186-0x194).

Limiting the maximum number of counters to 14 or 18 based on the currently
allocated MSRs is clearly fragile, and it seems likely that Intel will
even place PMCs 8-15 at a completely different range of MSR indices.
So stop at the maximum number of GP PMCs supported today on Intel
processors.

There are some machines, like Intel P4 with non Architectural PMU, that
may indeed have 18 counters, but those counters are in a completely
different MSR address range and are not supported by KVM.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Fixes: cf05a67b68b8 ("KVM: x86: omit "impossible" pmu MSRs from MSR list")
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220919091008.60695-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: Only dump VMSA to klog at KERN_DEBUG level
Peter Gonda [Fri, 4 Nov 2022 14:22:20 +0000 (07:22 -0700)]
KVM: SVM: Only dump VMSA to klog at KERN_DEBUG level

Explicitly print the VMSA dump at KERN_DEBUG log level, KERN_CONT uses
KERNEL_DEFAULT if the previous log line has a newline, i.e. if there's
nothing to continuing, and as a result the VMSA gets dumped when it
shouldn't.

The KERN_CONT documentation says it defaults back to KERNL_DEFAULT if the
previous log line has a newline. So switch from KERN_CONT to
print_hex_dump_debug().

Jarkko pointed this out in reference to the original patch. See:
https://lore.kernel.org/all/YuPMeWX4uuR1Tz3M@kernel.org/
print_hex_dump(KERN_DEBUG, ...) was pointed out there, but
print_hex_dump_debug() should similar.

Fixes: 6fac42f127b8 ("KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog")
Signed-off-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Harald Hoyer <harald@profian.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Message-Id: <20221104142220.469452-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agotools/kvm_stat: update exit reasons for vmx/svm/aarch64/userspace
Rong Tao [Mon, 7 Nov 2022 14:52:49 +0000 (22:52 +0800)]
tools/kvm_stat: update exit reasons for vmx/svm/aarch64/userspace

Update EXIT_REASONS from source, including VMX_EXIT_REASONS,
SVM_EXIT_REASONS, AARCH64_EXIT_REASONS, USERSPACE_EXIT_REASONS.

Signed-off-by: Rong Tao <rongtao@cestc.cn>
Message-Id: <tencent_00082C8BFA925A65E11570F417F1CD404505@qq.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agotools/kvm_stat: fix incorrect detection of debugfs
Matthias Gerstner [Thu, 3 Nov 2022 13:59:27 +0000 (14:59 +0100)]
tools/kvm_stat: fix incorrect detection of debugfs

The first field in /proc/mounts can be influenced by unprivileged users
through the widespread `fusermount` setuid-root program. Example:

```
user$ mkdir ~/mydebugfs
user$ export _FUSE_COMMFD=0
user$ fusermount ~/mydebugfs -ononempty,fsname=debugfs
user$ grep debugfs /proc/mounts
debugfs /home/user/mydebugfs fuse rw,nosuid,nodev,relatime,user_id=1000,group_id=100 0 0
```

If there is no debugfs already mounted in the system then this can be
used by unprivileged users to trick kvm_stat into using a user
controlled file system location for obtaining KVM statistics.
Even though the root user is not allowed to access non-root FUSE mounts
for security reasons, the unprivileged user can unmount the FUSE mount
before kvm_stat uses the mounted path.  If it wins the race, kvm_stat
will read from the location where the FUSE mount resided.

Note that the files in debugfs are only opened for reading, so the
attacker can cause very large data to be read in by kvm_stat, or fake
data to be processed, but there should be no viable way to turn this
into a privilege escalation.

The fix is simply to use the file system type field instead. Whitespace
in the mount path is escaped in /proc/mounts thus no further safety
measures in the parsing should be necessary to make this correct.

Message-Id: <20221103135927.13656-1-matthias.gerstner@suse.de>
Signed-off-by: Matthias Gerstner <matthias.gerstner@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agox86, KVM: remove unnecessary argument to x86_virt_spec_ctrl and callers
Paolo Bonzini [Fri, 30 Sep 2022 18:48:24 +0000 (14:48 -0400)]
x86, KVM: remove unnecessary argument to x86_virt_spec_ctrl and callers

x86_virt_spec_ctrl only deals with the paravirtualized
MSR_IA32_VIRT_SPEC_CTRL now and does not handle MSR_IA32_SPEC_CTRL
anymore; remove the corresponding, unused argument.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: move MSR_IA32_SPEC_CTRL save/restore to assembly
Paolo Bonzini [Fri, 30 Sep 2022 18:24:40 +0000 (14:24 -0400)]
KVM: SVM: move MSR_IA32_SPEC_CTRL save/restore to assembly

Restoration of the host IA32_SPEC_CTRL value is probably too late
with respect to the return thunk training sequence.

With respect to the user/kernel boundary, AMD says, "If software chooses
to toggle STIBP (e.g., set STIBP on kernel entry, and clear it on kernel
exit), software should set STIBP to 1 before executing the return thunk
training sequence." I assume the same requirements apply to the guest/host
boundary. The return thunk training sequence is in vmenter.S, quite close
to the VM-exit. On hosts without V_SPEC_CTRL, however, the host's
IA32_SPEC_CTRL value is not restored until much later.

To avoid this, move the restoration of host SPEC_CTRL to assembly and,
for consistency, move the restoration of the guest SPEC_CTRL as well.
This is not particularly difficult, apart from some care to cover both
32- and 64-bit, and to share code between SEV-ES and normal vmentry.

Cc: stable@vger.kernel.org
Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk")
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: restore host save area from assembly
Paolo Bonzini [Mon, 7 Nov 2022 08:49:59 +0000 (03:49 -0500)]
KVM: SVM: restore host save area from assembly

Allow access to the percpu area via the GS segment base, which is
needed in order to access the saved host spec_ctrl value.  In linux-next
FILL_RETURN_BUFFER also needs to access percpu data.

For simplicity, the physical address of the save area is added to struct
svm_cpu_data.

Cc: stable@vger.kernel.org
Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Analyzed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: move guest vmsave/vmload back to assembly
Paolo Bonzini [Mon, 7 Nov 2022 10:14:27 +0000 (05:14 -0500)]
KVM: SVM: move guest vmsave/vmload back to assembly

It is error-prone that code after vmexit cannot access percpu data
because GSBASE has not been restored yet.  It forces MSR_IA32_SPEC_CTRL
save/restore to happen very late, after the predictor untraining
sequence, and it gets in the way of return stack depth tracking
(a retbleed mitigation that is in linux-next as of 2022-11-09).

As a first step towards fixing that, move the VMCB VMSAVE/VMLOAD to
assembly, essentially undoing commit fb0c4a4fee5a ("KVM: SVM: move
VMLOAD/VMSAVE to C code", 2021-03-15).  The reason for that commit was
that it made it simpler to use a different VMCB for VMLOAD/VMSAVE versus
VMRUN; but that is not a big hassle anymore thanks to the kvm-asm-offsets
machinery and other related cleanups.

The idea on how to number the exception tables is stolen from
a prototype patch by Peter Zijlstra.

Cc: stable@vger.kernel.org
Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk")
Link: <https://lore.kernel.org/all/f571e404-e625-bae1-10e9-449b2eb4cbd8@citrix.com/>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: do not allocate struct svm_cpu_data dynamically
Paolo Bonzini [Wed, 9 Nov 2022 14:07:55 +0000 (09:07 -0500)]
KVM: SVM: do not allocate struct svm_cpu_data dynamically

The svm_data percpu variable is a pointer, but it is allocated via
svm_hardware_setup() when KVM is loaded.  Unlike hardware_enable()
this means that it is never NULL for the whole lifetime of KVM, and
static allocation does not waste any memory compared to the status quo.
It is also more efficient and more easily handled from assembly code,
so do it and don't look back.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: remove dead field from struct svm_cpu_data
Paolo Bonzini [Wed, 9 Nov 2022 13:54:20 +0000 (08:54 -0500)]
KVM: SVM: remove dead field from struct svm_cpu_data

The "cpu" field of struct svm_cpu_data has been write-only since commit
4b656b120249 ("KVM: SVM: force new asid on vcpu migration", 2009-08-05).
Remove it.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: remove unused field from struct vcpu_svm
Paolo Bonzini [Wed, 9 Nov 2022 09:54:40 +0000 (04:54 -0500)]
KVM: SVM: remove unused field from struct vcpu_svm

The pointer to svm_cpu_data in struct vcpu_svm looks interesting from
the point of view of accessing it after vmexit, when the GSBASE is still
containing the guest value.  However, despite existing since the very
first commit of drivers/kvm/svm.c (commit 6aa8b732ca01, "[PATCH] kvm:
userspace interface", 2006-12-10), it was never set to anything.

Ignore the opportunity to fix a 16 year old "bug" and delete it; doing
things the "harder" way makes it possible to remove more old cruft.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: retrieve VMCB from assembly
Paolo Bonzini [Mon, 7 Nov 2022 09:17:29 +0000 (04:17 -0500)]
KVM: SVM: retrieve VMCB from assembly

Continue moving accesses to struct vcpu_svm to vmenter.S.  Reducing the
number of arguments limits the chance of mistakes due to different
registers used for argument passing in 32- and 64-bit ABIs; pushing the
VMCB argument and almost immediately popping it into a different
register looks pretty weird.

32-bit ABI is not a concern for __svm_sev_es_vcpu_run() which is 64-bit
only; however, it will soon need @svm to save/restore SPEC_CTRL so stay
consistent with __svm_vcpu_run() and let them share the same prototype.

No functional change intended.

Cc: stable@vger.kernel.org
Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: adjust register allocation for __svm_vcpu_run()
Paolo Bonzini [Fri, 28 Oct 2022 21:30:07 +0000 (17:30 -0400)]
KVM: SVM: adjust register allocation for __svm_vcpu_run()

32-bit ABI uses RAX/RCX/RDX as its argument registers, so they are in
the way of instructions that hardcode their operands such as RDMSR/WRMSR
or VMLOAD/VMRUN/VMSAVE.

In preparation for moving vmload/vmsave to __svm_vcpu_run(), keep
the pointer to the struct vcpu_svm in %rdi.  In particular, it is now
possible to load svm->vmcb01.pa in %rax without clobbering the struct
vcpu_svm pointer.

No functional change intended.

Cc: stable@vger.kernel.org
Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: SVM: replace regs argument of __svm_vcpu_run() with vcpu_svm
Paolo Bonzini [Fri, 30 Sep 2022 18:14:44 +0000 (14:14 -0400)]
KVM: SVM: replace regs argument of __svm_vcpu_run() with vcpu_svm

Since registers are reachable through vcpu_svm, and we will
need to access more fields of that struct, pass it instead
of the regs[] array.

No functional change intended.

Cc: stable@vger.kernel.org
Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: x86: use a separate asm-offsets.c file
Paolo Bonzini [Tue, 8 Nov 2022 08:44:53 +0000 (09:44 +0100)]
KVM: x86: use a separate asm-offsets.c file

This already removes an ugly #include "" from asm-offsets.c, but
especially it avoids a future error when trying to define asm-offsets
for KVM's svm/svm.h header.

This would not work for kernel/asm-offsets.c, because svm/svm.h
includes kvm_cache_regs.h which is not in the include path when
compiling asm-offsets.c.  The problem is not there if the .c file is
in arch/x86/kvm.

Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
20 months agoKVM: s390: pci: Fix allocation size of aift kzdev elements
Rafael Mendonca [Wed, 26 Oct 2022 01:32:33 +0000 (22:32 -0300)]
KVM: s390: pci: Fix allocation size of aift kzdev elements

The 'kzdev' field of struct 'zpci_aift' is an array of pointers to
'kvm_zdev' structs. Allocate the proper size accordingly.

Reported by Coccinelle:
  WARNING: Use correct pointer type argument for sizeof

Fixes: 98b1d33dac5f ("KVM: s390: pci: do initial setup for AEN interpretation")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/20221026013234.960859-1-rafaelmendsr@gmail.com
Message-Id: <20221026013234.960859-1-rafaelmendsr@gmail.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
20 months agoKVM: s390: pv: don't allow userspace to set the clock under PV
Nico Boehr [Tue, 11 Oct 2022 16:07:12 +0000 (18:07 +0200)]
KVM: s390: pv: don't allow userspace to set the clock under PV

When running under PV, the guest's TOD clock is under control of the
ultravisor and the hypervisor isn't allowed to change it. Hence, don't
allow userspace to change the guest's TOD clock by returning
-EOPNOTSUPP.

When userspace changes the guest's TOD clock, KVM updates its
kvm.arch.epoch field and, in addition, the epoch field in all state
descriptions of all VCPUs.

But, under PV, the ultravisor will ignore the epoch field in the state
description and simply overwrite it on next SIE exit with the actual
guest epoch. This leads to KVM having an incorrect view of the guest's
TOD clock: it has updated its internal kvm.arch.epoch field, but the
ultravisor ignores the field in the state description.

Whenever a guest is now waiting for a clock comparator, KVM will
incorrectly calculate the time when the guest should wake up, possibly
causing the guest to sleep for much longer than expected.

With this change, kvm_s390_set_tod() will now take the kvm->lock to be
able to call kvm_s390_pv_is_protected(). Since kvm_s390_set_tod_clock()
also takes kvm->lock, use __kvm_s390_set_tod_clock() instead.

The function kvm_s390_set_tod_clock is now unused, hence remove it.
Update the documentation to indicate the TOD clock attr calls can now
return -EOPNOTSUPP.

Fixes: 0f3035047140 ("KVM: s390: protvirt: Do only reset registers that are accessible")
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20221011160712.928239-2-nrb@linux.ibm.com
Message-Id: <20221011160712.928239-2-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
20 months agoLinux 6.1-rc4 v6.1-rc4
Linus Torvalds [Sun, 6 Nov 2022 23:07:11 +0000 (15:07 -0800)]
Linux 6.1-rc4

20 months agoMerge tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 6 Nov 2022 21:09:52 +0000 (13:09 -0800)]
Merge tag 'cxl-fixes-for-6.1-rc4' of git://git./linux/kernel/git/cxl/cxl

Pull cxl fixes from Dan Williams:
 "Several fixes for CXL region creation crashes, leaks and failures.

  This is mainly fallout from the original implementation of dynamic CXL
  region creation (instantiate new physical memory pools) that arrived
  in v6.0-rc1.

  Given the theme of "failures in the presence of pass-through decoders"
  this also includes new regression test infrastructure for that case.

  Summary:

   - Fix region creation crash with pass-through decoders

   - Fix region creation crash when no decoder allocation fails

   - Fix region creation crash when scanning regions to enforce the
     increasing physical address order constraint that CXL mandates

   - Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
     1:1 memory-device-to-region associations.

   - Fix a memory leak for cxl_region objects when regions with active
     targets are deleted

   - Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
     emulated proximity domains.

   - Fix region creation failure for switch attached devices downstream
     of a single-port host-bridge

   - Fix false positive memory leak of cxl_region objects by recycling
     recently used region ids rather than freeing them

   - Add regression test infrastructure for a pass-through decoder
     configuration

   - Fix some mailbox payload handling corner cases"

* tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/region: Recycle region ids
  cxl/region: Fix 'distance' calculation with passthrough ports
  tools/testing/cxl: Add a single-port host-bridge regression config
  tools/testing/cxl: Fix some error exits
  cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
  cxl/region: Fix cxl_region leak, cleanup targets at region delete
  cxl/region: Fix region HPA ordering validation
  cxl/pmem: Use size_add() against integer overflow
  cxl/region: Fix decoder allocation crash
  ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set
  cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.
  cxl/region: Fix null pointer dereference due to pass through decoder commit
  cxl/mbox: Add a check on input payload size

20 months agoMerge tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groec...
Linus Torvalds [Sun, 6 Nov 2022 20:59:12 +0000 (12:59 -0800)]
Merge tag 'hwmon-for-v6.1-rc4' of git://git./linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:
 "Fix two regressions:

   - Commit 54cc3dbfc10d ("hwmon: (pmbus) Add regulator supply into
     macro") resulted in regulator undercount when disabling regulators.
     Revert it.

   - The thermal subsystem rework caused the scmi driver to no longer
     register with the thermal subsystem because index values no longer
     match. To fix the problem, the scmi driver now directly registers
     with the thermal subsystem, no longer through the hwmon core"

* tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  Revert "hwmon: (pmbus) Add regulator supply into macro"
  hwmon: (scmi) Register explicitly with Thermal Framework

20 months agoMerge tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 6 Nov 2022 20:41:32 +0000 (12:41 -0800)]
Merge tag 'perf_urgent_for_v6.1_rc4' of git://git./linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Add Cooper Lake's stepping to the PEBS guest/host events isolation
   fixed microcode revisions checking quirk

 - Update Icelake and Sapphire Rapids events constraints

 - Use the standard energy unit for Sapphire Rapids in RAPL

 - Fix the hw_breakpoint test to fail more graciously on !SMP configs

* tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]
  perf/x86/intel: Fix pebs event constraints for SPR
  perf/x86/intel: Fix pebs event constraints for ICL
  perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain
  perf/hw_breakpoint: test: Skip the test if dependencies unmet

20 months agoMerge tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 6 Nov 2022 20:36:47 +0000 (12:36 -0800)]
Merge tag 'x86_urgent_for_v6.1_rc4' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Add new Intel CPU models

 - Enforce that TDX guests are successfully loaded only on TDX hardware
   where virtualization exception (#VE) delivery on kernel memory is
   disabled because handling those in all possible cases is "essentially
   impossible"

 - Add the proper include to the syscall wrappers so that BTF can see
   the real pt_regs definition and not only the forward declaration

* tag 'x86_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Add several Intel server CPU model numbers
  x86/tdx: Panic on bad configs that #VE on "private" memory access
  x86/tdx: Prepare for using "INFO" call for a second purpose
  x86/syscall: Include asm/ptrace.h in syscall_wrapper header

20 months agoMerge tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 6 Nov 2022 20:23:10 +0000 (12:23 -0800)]
Merge tag 'kbuild-fixes-v6.1-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Use POSIX-compatible grep options

 - Document git-related tips for reproducible builds

 - Fix a typo in the modpost rule

 - Suppress SIGPIPE error message from gcc-ar and llvm-ar

 - Fix segmentation fault in the menuconfig search

* tag 'kbuild-fixes-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: fix segmentation fault in menuconfig search
  kbuild: fix SIGPIPE error message for AR=gcc-ar and AR=llvm-ar
  kbuild: fix typo in modpost
  Documentation: kbuild: Add description of git for reproducible builds
  kbuild: use POSIX-compatible grep option

20 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sun, 6 Nov 2022 18:46:59 +0000 (10:46 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

   - Fix the pKVM stage-1 walker erronously using the stage-2 accessor

   - Correctly convert vcpu->kvm to a hyp pointer when generating an
     exception in a nVHE+MTE configuration

   - Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them

   - Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE

   - Document the boot requirements for FGT when entering the kernel at
     EL1

  x86:

   - Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()

   - Make argument order consistent for kvcalloc()

   - Userspace API fixes for DEBUGCTL and LBRs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Fix a typo about the usage of kvcalloc()
  KVM: x86: Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()
  KVM: VMX: Ignore guest CPUID for host userspace writes to DEBUGCTL
  KVM: VMX: Fold vmx_supported_debugctl() into vcpu_supported_debugctl()
  KVM: VMX: Advertise PMU LBRs if and only if perf supports LBRs
  arm64: booting: Document our requirements for fine grained traps with SME
  KVM: arm64: Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE
  KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling them
  KVM: arm64: Fix bad dereference on MTE-enabled systems
  KVM: arm64: Use correct accessor to parse stage-1 PTEs

20 months agoMerge tag 'for-linus-6.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 6 Nov 2022 18:42:29 +0000 (10:42 -0800)]
Merge tag 'for-linus-6.1-rc4-tag' of git://git./linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
 "One fix for silencing a smatch warning, and a small cleanup patch"

* tag 'for-linus-6.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: simplify sysenter and syscall setup
  x86/xen: silence smatch warning in pmu_msr_chk_emulated()

20 months agoMerge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 6 Nov 2022 18:30:29 +0000 (10:30 -0800)]
Merge tag 'ext4_for_linus_stable' of git://git./linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a number of bugs, including some regressions, the most serious of
  which was one which would cause online resizes to fail with file
  systems with metadata checksums enabled.

  Also fix a warning caused by the newly added fortify string checker,
  plus some bugs that were found using fuzzed file systems"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix fortify warning in fs/ext4/fast_commit.c:1551
  ext4: fix wrong return err in ext4_load_and_init_journal()
  ext4: fix warning in 'ext4_da_release_space'
  ext4: fix BUG_ON() when directory entry has invalid rec_len
  ext4: update the backup superblock's at the end of the online resize

20 months agoMerge tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 6 Nov 2022 18:19:39 +0000 (10:19 -0800)]
Merge tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "One symlink handling fix and two fixes foir multichannel issues with
  iterating channels, including for oplock breaks when leases are
  disabled"

* tag '6.1-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix use-after-free on the link name
  cifs: avoid unnecessary iteration of tcp sessions
  cifs: always iterate smb sessions using primary channel

20 months agoMerge tag 'trace-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Sun, 6 Nov 2022 17:57:38 +0000 (09:57 -0800)]
Merge tag 'trace-v6.1-rc3' of git://git./linux/kernel/git/trace/linux-trace

Pull `lTracing fixes for 6.1-rc3:

 - Fixed NULL pointer dereference in the ring buffer wait-waiters code
   for machines that have less CPUs than what nr_cpu_ids returns.

   The buffer array is of size nr_cpu_ids, but only the online CPUs get
   initialized.

 - Fixed use after free call in ftrace_shutdown.

 - Fix accounting of if a kprobe is enabled

 - Fix NULL pointer dereference on error path of fprobe rethook_alloc().

 - Fix unregistering of fprobe_kprobe_handler

 - Fix memory leak in kprobe test module

* tag 'trace-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: kprobe: Fix memory leak in test_gen_kprobe/kretprobe_cmd()
  tracing/fprobe: Fix to check whether fprobe is registered correctly
  fprobe: Check rethook_alloc() return in rethook initialization
  kprobe: reverse kp->flags when arm_kprobe failed
  ftrace: Fix use-after-free for dynamic ftrace_ops
  ring-buffer: Check for NULL cpu_buffer in ring_buffer_wake_waiters()

20 months agoMerge tag 'kvmarm-fixes-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmar...
Paolo Bonzini [Sun, 6 Nov 2022 08:25:59 +0000 (03:25 -0500)]
Merge tag 'kvmarm-fixes-6.1-3' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD

* Fix the pKVM stage-1 walker erronously using the stage-2 accessor

* Correctly convert vcpu->kvm to a hyp pointer when generating
  an exception in a nVHE+MTE configuration

* Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them

* Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE

* Document the boot requirements for FGT when entering the kernel
  at EL1

20 months agoMerge branch 'kvm-master' into HEAD
Paolo Bonzini [Sun, 6 Nov 2022 08:22:56 +0000 (03:22 -0500)]
Merge branch 'kvm-master' into HEAD

x86:
* Use SRCU to protect zap in __kvm_set_or_clear_apicv_inhibit()

* Make argument order consistent for kvcalloc()

* Userspace API fixes for DEBUGCTL and LBRs

20 months agoext4: fix fortify warning in fs/ext4/fast_commit.c:1551
Theodore Ts'o [Sun, 6 Nov 2022 03:42:36 +0000 (23:42 -0400)]
ext4: fix fortify warning in fs/ext4/fast_commit.c:1551

With the new fortify string system, rework the memcpy to avoid this
warning:

memcpy: detected field-spanning write (size 60) of single field "&raw_inode->i_generation" at fs/ext4/fast_commit.c:1551 (size 4)

Cc: stable@kernel.org
Fixes: 54d9469bc515 ("fortify: Add run-time WARN for cross-field memcpy()")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
20 months agoext4: fix wrong return err in ext4_load_and_init_journal()
Jason Yan [Tue, 25 Oct 2022 04:02:06 +0000 (12:02 +0800)]
ext4: fix wrong return err in ext4_load_and_init_journal()

The return value is wrong in ext4_load_and_init_journal(). The local
variable 'err' need to be initialized before goto out. The original code
in __ext4_fill_super() is fine because it has two return values 'ret'
and 'err' and 'ret' is initialized as -EINVAL. After we factor out
ext4_load_and_init_journal(), this code is broken. So fix it by directly
returning -EINVAL in the error handler path.

Cc: stable@kernel.org
Fixes: 9c1dd22d7422 ("ext4: factor out ext4_load_and_init_journal()")
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221025040206.3134773-1-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>