Age | Commit message (Collapse) | Author | Files | Lines |
|
Currently, the #ifdef CONFIG_XIP_KERNEL usage can be divided into the
following three types:
The first one is for functions/declarations only used in XIP case.
The second one is for XIP_FIXUP case. Something as below:
|foo_type foo;
|#ifdef CONFIG_XIP_KERNEL
|#define foo (*(foo_type *)XIP_FIXUP(&foo))
|#endif
Usually, it's better to let the foo macro sit with the foo var
together. But if various foos are defined adjacently, we can
save some #ifdef CONFIG_XIP_KERNEL usage by grouping them together.
The third one is for different implementations for XIP, usually, this
is a #ifdef...#else...#endif case.
This patch moves the pt_ops macro to adjacent #ifdef CONFIG_XIP_KERNEL
and group first type usage cases into one.
Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
|
|
Try our best to replace the conditional compilation using
"#ifdef CONFIG_XIP_KERNEL" with "IS_ENABLED(CONFIG_XIP_KERNEL)", to
simplify the code and to increase compile coverage.
Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
|
|
Except "pt_ops", other global vars when CONFIG_XIP_KERNEL=y is defined
as below:
|foo_type foo;
|#ifdef CONFIG_XIP_KERNEL
|#define foo (*(foo_type *)XIP_FIXUP(&foo))
|#endif
Follow the same way for pt_ops to unify the style and to simplify code.
Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
|
|
Try our best to replace the conditional compilation using
"#ifdef CONFIG_64BIT" by a check for "IS_ENABLED(CONFIG_64BIT)", to
simplify the code and to increase compile coverage.
Now we can also remove the __maybe_unused used in max_mapped_addr
declaration.
We also remove the BUG_ON check of mapping the last 4K bytes of the
addressable memory since this is always true for every kernel actually.
Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
|
|
The is_kdump_kernel() returns false for !CRASH_DUMP case, so we don't
need the #ifdef CONFIG_CRASH_DUMP for is_kdump_kernel() checking.
Signed-off-by: Jisheng Zhang <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
|
|
To pick the changes in this cset:
980fe2fddcff2193 ("x86/fpu: Extend fpu_xstate_prctl() with guest permissions")
This picks these new prctls:
$ tools/perf/trace/beauty/x86_arch_prctl.sh > /tmp/before
$ cp arch/x86/include/uapi/asm/prctl.h tools/arch/x86/include/uapi/asm/prctl.h
$ tools/perf/trace/beauty/x86_arch_prctl.sh > /tmp/after
$ diff -u /tmp/before /tmp/after
--- /tmp/before 2022-01-19 14:40:05.049394977 -0300
+++ /tmp/after 2022-01-19 14:40:35.628154565 -0300
@@ -9,6 +9,8 @@
[0x1021 - 0x1001]= "GET_XCOMP_SUPP",
[0x1022 - 0x1001]= "GET_XCOMP_PERM",
[0x1023 - 0x1001]= "REQ_XCOMP_PERM",
+ [0x1024 - 0x1001]= "GET_XCOMP_GUEST_PERM",
+ [0x1025 - 0x1001]= "REQ_XCOMP_GUEST_PERM",
};
#define x86_arch_prctl_codes_2_offset 0x2001
$
With this 'perf trace' can translate those numbers into strings and use
the strings in filter expressions:
# perf trace -e prctl
0.000 ( 0.011 ms): DOM Worker/3722622 prctl(option: SET_NAME, arg2: 0x7f9c014b7df5) = 0
0.032 ( 0.002 ms): DOM Worker/3722622 prctl(option: SET_NAME, arg2: 0x7f9bb6b51580) = 0
5.452 ( 0.003 ms): StreamT~ns #30/3722623 prctl(option: SET_NAME, arg2: 0x7f9bdbdfeb70) = 0
5.468 ( 0.002 ms): StreamT~ns #30/3722623 prctl(option: SET_NAME, arg2: 0x7f9bdbdfea70) = 0
24.494 ( 0.009 ms): IndexedDB #556/3722624 prctl(option: SET_NAME, arg2: 0x7f562a32ae28) = 0
24.540 ( 0.002 ms): IndexedDB #556/3722624 prctl(option: SET_NAME, arg2: 0x7f563c6d4b30) = 0
670.281 ( 0.008 ms): systemd-userwo/3722339 prctl(option: SET_NAME, arg2: 0x564be30805c8) = 0
670.293 ( 0.002 ms): systemd-userwo/3722339 prctl(option: SET_NAME, arg2: 0x564be30800f0) = 0
^C#
This addresses these perf build warnings:
Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/prctl.h' differs from latest version at 'arch/x86/include/uapi/asm/prctl.h'
diff -u tools/arch/x86/include/uapi/asm/prctl.h arch/x86/include/uapi/asm/prctl.h
Cc: Paolo Bonzini <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
[BUG]
After commit 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to
implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
the file from previously finished location.
[CAUSE]
The recent refactoring of defrag only focuses on defrag ioctl subpage
support, doesn't take autodefrag into consideration.
There are two problems involved which prevents autodefrag to restart its
scan:
- No range.start update
Previously when one defrag target is found, range->start will be
updated to indicate where next search should start from.
But now btrfs_defrag_file() doesn't update it anymore, making all
autodefrag to rescan from file offset 0.
This would also make autodefrag to mark the same range dirty again and
again, causing extra IO.
- No proper quick exit for defrag_one_cluster()
Currently if we reached or exceed @max_sectors limit, we just exit
defrag_one_cluster(), and let next defrag_one_cluster() call to do a
quick exit.
This makes @cur increase, thus no way to properly know which range is
defragged and which range is skipped.
[FIX]
The fix involves two modifications:
- Update range->start to next cluster start
This is a little different from the old behavior.
Previously range->start is updated to the next defrag target.
But in the end, the behavior should still be pretty much the same,
as now we skip to next defrag target inside btrfs_defrag_file().
Thus if auto-defrag determines to re-scan, then we still do the skip,
just at a different timing.
- Make defrag_one_cluster() to return >0 to indicate a quick exit
So that btrfs_defrag_file() can also do a quick exit, without
increasing @cur to the range end, and re-use @cur to update
@range->start.
- Add comment for btrfs_defrag_file() to mention the range->start update
Currently only autodefrag utilize this behavior, as defrag ioctl won't
set @max_to_defrag parameter, thus unless interrupted it will always
try to defrag the whole range.
Reported-by: Filipe Manana <[email protected]>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
CC: [email protected] # 5.16
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
[BUG]
There are users using autodefrag mount option reporting obvious increase
in IO:
> If I compare the write average (in total, I don't have it per process)
> when taking idle periods on the same machine:
> Linux 5.16:
> without autodefrag: ~ 10KiB/s
> with autodefrag: between 1 and 2MiB/s.
>
> Linux 5.15:
> with autodefrag:~ 10KiB/s (around the same as without
> autodefrag on 5.16)
[CAUSE]
When autodefrag mount option is enabled, btrfs_defrag_file() will be
called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many
sectors we can defrag in one try.
And then use the number of sectors defragged to determine if we need to
re-defrag.
But commit b18c3ab2343d ("btrfs: defrag: introduce helper to defrag one
cluster") uses wrong unit to increase @sectors_defragged, which should
be in unit of sector, not byte.
This means, if we have defragged any sector, then @sectors_defragged
will be >= sectorsize (normally 4096), which is larger than
BTRFS_DEFRAG_BATCH.
This makes the @max_sectors check in defrag_one_cluster() to underflow,
rendering the whole @max_sectors check useless.
Thus causing way more IO for autodefrag mount options, as now there is
no limit on how many sectors can really be defragged.
[FIX]
Fix the problems by:
- Use sector as unit when increasing @sectors_defragged
- Include @sectors_defragged > @max_sectors case to break the loop
- Add extra comment on the return value of btrfs_defrag_file()
Reported-by: Anthony Ruhier <[email protected]>
Fixes: b18c3ab2343d ("btrfs: defrag: introduce helper to defrag one cluster")
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
CC: [email protected] # 5.16
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Change the cifs filesystem to take account of the changes to fscache's
indexing rewrite and reenable caching in cifs.
The following changes have been made:
(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole.
(2) The session cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). That takes three parameters: a string
representing the "volume" in the index, a string naming the cache to
use (or NULL) and a u64 that conveys coherency metadata for the
volume.
For cifs, I've made it render the volume name string as:
"cifs,<ipaddress>,<sharename>"
where the sharename has '/' characters replaced with ';'.
This probably needs rethinking a bit as the total name could exceed
the maximum filename component length.
Further, the coherency data is currently just set to 0. It needs
something else doing with it - I wonder if it would suffice simply to
sum the resource_id, vol_create_time and vol_serial_number or maybe
hash them.
(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.
fscache_acquire_cookie() is passed the same keying and coherency
information as before.
(4) The functions to set/reset cookies are removed and
fscache_use_cookie() and fscache_unuse_cookie() are used instead.
fscache_use_cookie() is passed a flag to indicate if the cookie is
opened for writing. fscache_unuse_cookie() is passed updates for the
metadata if we changed it (ie. if the file was opened for writing).
These are called when the file is opened or closed.
(5) cifs_setattr_*() are made to call fscache_resize() to change the size
of the cache object.
(6) The functions to read and write data are stubbed out pending a
conversion to use netfslib.
Changes
=======
ver #8:
- Abstract cache invalidation into a helper function.
- Fix some checkpatch warnings[3].
ver #7:
- Removed the accidentally added-back call to get the super cookie in
cifs_root_iget().
- Fixed the right call to cifs_fscache_get_super_cookie() to take account
of the "-o fsc" mount flag.
ver #6:
- Moved the change of gfpflags_allow_blocking() to current_is_kswapd() for
cifs here.
- Fixed one of the error paths in cifs_atomic_open() to jump around the
call to use the cookie.
- Fixed an additional successful return in the middle of cifs_open() to
use the cookie on the way out.
- Only get a volume cookie (and thus inode cookies) when "-o fsc" is
supplied to mount.
ver #5:
- Fixed a couple of bits of cookie handling[2]:
- The cookie should be released in cifs_evict_inode(), not
cifsFileInfo_put_final(). The cookie needs to persist beyond file
closure so that writepages will be able to write to it.
- fscache_use_cookie() needs to be called in cifs_atomic_open() as it is
for cifs_open().
ver #4:
- Fixed the use of sizeof with memset.
- tcon->vol_create_time is __le64 so doesn't need cpu_to_le64().
ver #3:
- Canonicalise the cifs coherency data to make the cache portable.
- Set volume coherency data.
ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- Upgraded to -rc4 to allow for upstream changes[1].
- fscache_acquire_volume() now returns errors.
Signed-off-by: David Howells <[email protected]>
Acked-by: Jeff Layton <[email protected]>
cc: Steve French <[email protected]>
cc: Shyam Prasad N <[email protected]>
cc: [email protected]
cc: [email protected]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23b55d673d7527b093cd97b7c217c82e70cd1af0 [1]
Link: https://lore.kernel.org/r/[email protected]/ [2]
Link: https://lore.kernel.org/r/CAH2r5muTanw9pJqzAHd01d9A8keeChkzGsCEH6=0rHutVLAF-A@mail.gmail.com/ [3]
Link: https://lore.kernel.org/r/163819671009.215744.11230627184193298714.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906982979.143852.10672081929614953210.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967187187.1823006.247415138444991444.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021579335.640689.2681324337038770579.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/[email protected]/ # v5
Link: https://lore.kernel.org/r/[email protected]/ # v6
Signed-off-by: Steve French <[email protected]>
|
|
During defrag, at btrfs_defrag_file(), we have this loop that iterates
over a file range in steps no larger than 256K subranges. If the range
is too long, there's no way to interrupt it. So make the loop check in
each iteration if there's signal pending, and if there is, break and
return -AGAIN to userspace.
Before kernel 5.16, we used to allow defrag to be cancelled through a
signal, but that was lost with commit 7b508037d4cac3 ("btrfs: defrag:
use defrag_one_cluster() to implement btrfs_defrag_file()").
This change adds back the possibility to cancel a defrag with a signal
and keeps the same semantics, returning -EAGAIN to user space (and not
the usually more expected -EINTR).
This is also motivated by a recent bug on 5.16 where defragging a 1 byte
file resulted in iterating from file range 0 to (u64)-1, as hitting the
bug triggered a too long loop, basically requiring one to reboot the
machine, as it was not possible to cancel defrag.
Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
CC: [email protected] # 5.16
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
When attempting to defrag a file with a single byte, we can end up in a
too long loop, which is nearly infinite because at btrfs_defrag_file()
we end up with the variable last_byte assigned with a value of
18446744073709551615 (which is (u64)-1). The problem comes from the fact
we end up doing:
last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
So if last_byte was assigned 0, which is i_size - 1, we underflow and
end up with the value 18446744073709551615.
This is trivial to reproduce and the following script triggers it:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
echo -n "X" > $MNT/foobar
btrfs filesystem defragment $MNT/foobar
umount $MNT
So fix this by not decrementing last_byte by 1 before doing the sector
size round up. Also, to make it easier to follow, make the round up right
after computing last_byte.
Reported-by: Anthony Ruhier <[email protected]>
Fixes: 7b508037d4cac3 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
CC: [email protected] # 5.16
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
amx_test's binary should be present in the .gitignore file for the git
to ignore it.
Fixes: bf70636d9443 ("selftest: kvm: Add amx selftest")
Signed-off-by: Muhammad Usama Anjum <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Nullify svm_x86_ops.vcpu_(un)blocking if AVIC/APICv is disabled as the
hooks are necessary only to clear the vCPU's IsRunning entry in the
Physical APIC and to update IRTE entries if the VM has a pass-through
device attached.
Opportunistically rename the helpers to clarify their AVIC relationship.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move svm_hardware_setup() below svm_x86_ops so that KVM can modify ops
during setup, e.g. the vcpu_(un)blocking hooks can be nullified if AVIC
is disabled or unsupported.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Drop avic_set_running() in favor of calling avic_vcpu_{load,put}()
directly, and modify the block+put path to use preempt_disable/enable()
instead of get/put_cpu(), as it doesn't actually care about the current
pCPU associated with the vCPU. Opportunistically add lockdep assertions
as being preempted in avic_vcpu_put() would lead to consuming stale data,
even though doing so _in the current code base_ would not be fatal.
Add a much needed comment explaining why svm_vcpu_blocking() needs to
unload the AVIC and update the IRTE _before_ the vCPU starts blocking.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When waking vCPUs in the posted interrupt wakeup handling, do exactly
that and no more. There is no need to kick the vCPU as the wakeup
handler just needs to get the vCPU task running, and if it's in the guest
then it's definitely running.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move the fallback "wake_up" path into the helper to trigger posted
interrupt helper now that the nested and non-nested paths are identical.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Refactor the posted interrupt helper to take the desired notification
vector instead of a bool so that the callers are self-documenting.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Replace the full "kick" with just the "wake" in the fallback path when
triggering a virtual interrupt via a posted interrupt fails because the
guest is not IN_GUEST_MODE. If the guest transitions into guest mode
between the check and the kick, then it's guaranteed to see the pending
interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting
IN_GUEST_MODE. Kicking the guest in this case is nothing more than an
unnecessary VM-Exit (and host IRQ).
Opportunistically update comments to explain the various ordering rules
and barriers at play.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Don't bother updating the Physical APIC table or IRTE when loading a vCPU
that is blocking, i.e. won't be marked IsRun{ning}=1, as the pCPU is
queried if and only if IsRunning is '1'. If the vCPU was migrated, the
new pCPU will be picked up when avic_vcpu_load() is called by
svm_vcpu_unblocking().
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use kvm_vcpu_is_blocking() to determine whether or not the vCPU should be
marked running during avic_vcpu_load(). Drop avic_is_running, which
really should have been named "vcpu_is_not_blocking", as it tracked if
the vCPU was blocking, not if it was actually running, e.g. it was set
during svm_create_vcpu() when the vCPU was obviously not running.
This is technically a teeny tiny functional change, as the vCPU will be
marked IsRunning=1 on being reloaded if the vCPU is preempted between
svm_vcpu_blocking() and prepare_to_rcuwait(). But that's a benign change
as the vCPU will be marked IsRunning=0 when KVM voluntarily schedules out
the vCPU.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Remove handling of KVM_REQ_APICV_UPDATE from svm_vcpu_unblocking(), it's
no longer needed as it was made obsolete by commit df7e4827c549 ("KVM:
SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC").
Prior to that commit, the manual check was necessary to ensure the AVIC
stuff was updated by avic_set_running() when a request to enable APICv
became pending while the vCPU was blocking, as the request handling
itself would not do the update. But, as evidenced by the commit, that
logic was flawed and subject to various races.
Now that svm_refresh_apicv_exec_ctrl() does avic_vcpu_load/put() in
response to an APICv status change, drop the manual check in the
unblocking path.
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Drop the avic_vcpu_is_running() check when waking vCPUs in response to a
VM-Exit due to incomplete IPI delivery. The check isn't wrong per se, but
it's not 100% accurate in the sense that it doesn't guarantee that the vCPU
was one of the vCPUs that didn't receive the IPI.
The check isn't required for correctness as blocking == !running in this
context.
From a performance perspective, waking a live task is not expensive as the
only moderately costly operation is a locked operation to temporarily
disable preemption. And if that is indeed a performance issue,
kvm_vcpu_is_blocking() would be a better check than poking into the AVIC.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Signal the AVIC doorbell iff the vCPU is running in the guest. If the vCPU
is not IN_GUEST_MODE, it's guaranteed to pick up any pending IRQs on the
next VMRUN, which unconditionally processes the vIRR.
Add comments to document the logic.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Unexport switch_to_{hv,sw}_timer() now that common x86 handles the
transitions.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Handle the switch to/from the hypervisor/software timer when a vCPU is
blocking in common x86 instead of in VMX. Even though VMX is the only
user of a hypervisor timer, the logic and all functions involved are
generic x86 (unless future CPUs do something completely different and
implement a hypervisor timer that runs regardless of mode).
Handling the switch in common x86 will allow for the elimination of the
pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
timer if and only if it was in use (without additional params). Add a
comment explaining why the switch cannot be deferred to kvm_sched_out()
or kvm_vcpu_block().
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
rename the list and all associated variables to clarify that it tracks
the set of vCPU that need to be poked on a posted interrupt to the wakeup
vector. The list is not used to track _all_ vCPUs that are blocking, and
the term "blocked" can be misleading as it may refer to a blocking
condition in the host or the guest, where as the PI wakeup case is
specifically for the vCPUs that are actively blocking from within the
guest.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Remove kvm_vcpu.pre_pcpu as it no longer has any users. No functional
change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move the posted interrupt pre/post_block logic into vcpu_put/load
respectively, using the kvm_vcpu_is_blocking() to determining whether or
not the wakeup handler needs to be set (and unset). This avoids updating
the PI descriptor if halt-polling is successful, reduces the number of
touchpoints for updating the descriptor, and eliminates the confusing
behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU
is scheduled back in after preemption.
The downside is that KVM will do the PID update twice if the vCPU is
preempted after prepare_to_rcuwait() but before schedule(), but that's a
rare case (and non-existent on !PREEMPT kernels).
The notable wart is the need to send a self-IPI on the wakeup vector if
an outstanding notification is pending after configuring the wakeup
vector. Ideally, KVM would just do a kvm_vcpu_wake_up() in this case,
but the scheduler doesn't support waking a task from its preemption
notifier callback, i.e. while the task is right in the middle of
being scheduled out.
Note, setting the wakeup vector before halt-polling is not necessary:
once the pending IRQ will be recorded in the PIR, kvm_vcpu_has_events()
will detect this (via kvm_cpu_get_interrupt(), kvm_apic_get_interrupt(),
apic_has_interrupt_for_ppr() and finally vmx_sync_pir_to_irr()) and
terminate the polling.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Bring in fix for VT-d posted interrupts before further changing the code in 5.17.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Avoid warnings on s390 like
[ 1801.980931] CPU: 12 PID: 117600 Comm: kworker/12:0 Tainted: G E 5.17.0-20220113.rc0.git0.32ce2abb03cf.300.fc35.s390x+next #1
[ 1801.980938] Workqueue: events irqfd_inject [kvm]
[...]
[ 1801.981057] Call Trace:
[ 1801.981060] [<000003ff805f0f5c>] mark_page_dirty_in_slot+0xa4/0xb0 [kvm]
[ 1801.981083] [<000003ff8060e9fe>] adapter_indicators_set+0xde/0x268 [kvm]
[ 1801.981104] [<000003ff80613c24>] set_adapter_int+0x64/0xd8 [kvm]
[ 1801.981124] [<000003ff805fb9aa>] kvm_set_irq+0xc2/0x130 [kvm]
[ 1801.981144] [<000003ff805f8d86>] irqfd_inject+0x76/0xa0 [kvm]
[ 1801.981164] [<0000000175e56906>] process_one_work+0x1fe/0x470
[ 1801.981173] [<0000000175e570a4>] worker_thread+0x64/0x498
[ 1801.981176] [<0000000175e5ef2c>] kthread+0x10c/0x110
[ 1801.981180] [<0000000175de73c8>] __ret_from_fork+0x40/0x58
[ 1801.981185] [<000000017698440a>] ret_from_fork+0xa/0x40
when writing to a guest from an irqfd worker as long as we do not have
the dirty ring.
Signed-off-by: Christian Borntraeger <[email protected]>
Reluctantly-acked-by: David Woodhouse <[email protected]>
Message-Id: <[email protected]>
Fixes: 2efd61a608b0 ("KVM: Warn if mark_page_dirty() is called without an active vCPU")
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add a VMX specific test to verify that KVM doesn't explode if userspace
attempts KVM_RUN when emulation is required with a pending exception.
KVM VMX's emulation support for !unrestricted_guest punts exceptions to
userspace instead of attempting to synthesize the exception with all the
correct state (and stack switching, etc...).
Punting is acceptable as there's never been a request to support
injecting exceptions when emulating due to invalid state, but KVM has
historically assumed that userspace will do the right thing and either
clear the exception or kill the guest. Deliberately do the opposite and
attempt to re-enter the guest with a pending exception and emulation
required to verify KVM continues to punt the combination to userspace,
e.g. doesn't explode, WARN, etc...
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Reject KVM_RUN if emulation is required (because VMX is running without
unrestricted guest) and an exception is pending, as KVM doesn't support
emulating exceptions except when emulating real mode via vm86. The vCPU
is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
the impossible condition. Alternatively, the WARN could be removed, but
then userspace and/or KVM bugs would result in the vCPU silently running
in a bad state, which isn't very friendly to users.
Originally, the bug was hit by syzkaller with a nested guest as that
doesn't require kvm_intel.unrestricted_guest=0. That particular flavor
is likely fixed by commit cd0e615c49e5 ("KVM: nVMX: Synthesize
TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
trigger the WARN with a non-nested guest, and userspace can likely force
bad state via ioctls() for a nested guest as well.
Checking for the impossible condition needs to be deferred until KVM_RUN
because KVM can't force specific ordering between ioctls. E.g. clearing
exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
emulation_required would prevent userspace from queuing an exception and
then stuffing sregs. Note, if KVM were to try and detect/prevent the
condition prior to KVM_RUN, handle_invalid_guest_state() and/or
handle_emulation_failure() would need to be modified to clear the pending
exception prior to exiting to userspace.
------------[ cut here ]------------
WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
FS: 00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
Call Trace:
kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
kvm_vcpu_ioctl+0x279/0x690 [kvm]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Reported-by: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Verify that the PMU event filter works as expected.
Note that the virtual PMU doesn't work as expected on AMD Zen CPUs (an
intercepted rdmsr is counted as a retired branch instruction), but the
PMU event filter does work.
Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Extract the x86 model number from CPUID.01H:EAX.
Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move this static inline function to processor.h, so that it can be
used in individual tests, as needed.
Opportunistically replace the bare 'unsigned' with 'unsigned int.'
Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Replace the one ad hoc "AuthenticAMD" CPUID vendor string comparison
with a new function, is_amd_cpu().
Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Refactor is_intel_cpu() to make it easier to reuse the bulk of the
code for other vendors in the future.
Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The PMU event filter may contain up to 300 events. Replace the linear
search in reprogram_gp_counter() with a binary search.
Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Recent restructuring of cifs_reconnect introduced a helper func
named cifs_ses_mark_for_reconnect, which updates the state of tcp
session for all the channels of a session for reconnect.
However, this does not update the session state and chans_need_reconnect
bitmask. This change fixes that.
Also, cifs_mark_tcp_sess_for_reconnect should mark set the bitmask
for all channels when the whole session is marked for reconnect.
Fixed that here too.
Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Till the end of SMB session setup, update tcpStatus and
avoid updating session status field. There was a typo in
cifs_setup_session, which caused ses->status to be updated
instead. This was causing issues during reconnect.
Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
The status of tcp session, smb session and tcon have the
same flow, irrespective of the SMB version used. Hence
these status checks and updates should happen in the
version independent callers of these commands.
Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
cifs_tree_connect checks and sets the tidStatus for the tcon.
cifs_tree_connect also calls a dfs specific tree connect
function, which also does similar checks. This should
not happen. Removing it with this change.
Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Recently, the cifs_reconnect code was refactored into
two branches for regular vs dfs codepath. Some of my
recent changes were missing in the dfs path, namely the
code to enable periodic DNS query, and a missing lock.
Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
ses_selected is being declared and set at several places. It is not
being used. Remove it.
Signed-off-by: Muhammad Usama Anjum <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
A spin lock called chan_lock was introduced recently.
But not all accesses were protected. Doing that with
this change.
To make sure that a channel is not freed when in use,
we need to introduce a ref count. But today, we don't
ever free channels.
Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Recent changes to multichannel required some adjustments in
the way connection states transitioned during/after reconnect.
Also some minor fixes:
1. A pending switch of GlobalMid_Lock to cifs_tcp_ses_lock
2. Relocations of the code that logs reconnect
3. Changed some code in allocate_mid to suit the new scheme
Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
With the new multichannel logic, when a channel needs reconnection,
the tree connect and other channels can still be active.
This fix will handle cases of checking for channel reconnect,
when the tcon does not need reconnect.
Signed-off-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
vm_xsave_req_perm() is currently defined and used by x86_64 only.
Make it compiled into vm_create_with_vcpus() only when on x86_64
machines. Otherwise, it would cause linkage errors, e.g. on s390x.
Fixes: 415a3c33e8 ("kvm: selftests: Add support for KVM_CAP_XSAVE2")
Reported-by: Janis Schoetterl-Glausch <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Tested-by: Janis Schoetterl-Glausch <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|