Age | Commit message (Collapse) | Author | Files | Lines |
|
Two additional tests are added:
- SMM triggered from L2 does not currupt L1 host state.
- Save/restore during SMM triggered from L2 does not corrupt guest/host
state.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If the VM was migrated while in SMM, no nested state was saved/restored,
and therefore svm_leave_smm has to load both save and control area
of the vmcb12. Save area is already loaded from HSAVE area,
so now load the control area as well from the vmcb12.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
VMCB split commit 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the
nested L2 guest") broke return from SMM when we entered there from guest
(L2) mode. Gen2 WS2016/Hyper-V is known to do this on boot. The problem
manifests itself like this:
kvm_exit: reason EXIT_RSM rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:7ffbb280: 0f aa
kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x7ffb3000
kvm_nested_vmrun: rip: 0x000000007ffbb280 vmcb: 0x0000000008224000
nrip: 0xffffffffffbbe119 int_ctl: 0x01020000 event_inj: 0x00000000
npt: on
kvm_nested_intercepts: cr_read: 0000 cr_write: 0010 excp: 40060002
intercepts: fd44bfeb 0000217f 00000000
kvm_entry: vcpu 0, rip 0xffffffffffbbe119
kvm_exit: reason EXIT_NPF rip 0xffffffffffbbe119 info
200000006 1ab000
kvm_nested_vmexit: vcpu 0 reason npf rip 0xffffffffffbbe119 info1
0x0000000200000006 info2 0x00000000001ab000 intr_info 0x00000000
error_code 0x00000000
kvm_page_fault: address 1ab000 error_code 6
kvm_nested_vmexit_inject: reason EXIT_NPF info1 200000006 info2 1ab000
int_info 0 int_info_err 0
kvm_entry: vcpu 0, rip 0x7ffbb280
kvm_exit: reason EXIT_EXCP_GP rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:7ffbb280: 0f aa
kvm_inj_exception: #GP (0x0)
Note: return to L2 succeeded but upon first exit to L1 its RIP points to
'RSM' instruction but we're not in SMM.
The problem appears to be that VMCB01 gets irreversibly destroyed during
SMM execution. Previously, we used to have 'hsave' VMCB where regular
(pre-SMM) L1's state was saved upon nested_svm_vmexit() but now we just
switch to VMCB01 from VMCB02.
Pre-split (working) flow looked like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() restores L1's state from 'hsave'
- SMM -> RSM
- enter_svm_guest_mode() switches to L2 but keeps 'hsave' intact so we have
pre-SMM (and pre L2 VMRUN) L1's state there
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from L1.
This was always broken with regards to svm_get_nested_state()/
svm_set_nested_state(): 'hsave' was never a part of what's being
save and restored so migration happening during SMM triggered from L2 would
never restore L1's state correctly.
Post-split flow (broken) looks like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() switches to VMCB01 from VMCB02
- SMM -> RSM
- enter_svm_guest_mode() switches from VMCB01 to VMCB02 but pre-SMM VMCB01
is already lost.
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from VMCB01 but it is corrupted
(reflects the state during 'RSM' execution).
VMX doesn't have this problem because unlike VMCB, VMCS keeps both guest
and host state so when we switch back to VMCS02 L1's state is intact there.
To resolve the issue we need to save L1's state somewhere. We could've
created a third VMCB for SMM but that would require us to modify saved
state format. L1's architectural HSAVE area (pointed by MSR_VM_HSAVE_PA)
seems appropriate: L0 is free to save any (or none) of L1's state there.
Currently, KVM does 'none'.
Note, for nested state migration to succeed, both source and destination
hypervisors must have the fix. We, however, don't need to create a new
flag indicating the fact that HSAVE area is now populated as migration
during SMM triggered from L2 was always broken.
Fixes: 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Separate the code setting non-VMLOAD-VMSAVE state from
svm_set_nested_state() into its own function. This is going to be
re-used from svm_enter_smm()/svm_leave_smm().
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
APM states that "The address written to the VM_HSAVE_PA MSR, which holds
the address of the page used to save the host state on a VMRUN, must point
to a hypervisor-owned page. If this check fails, the WRMSR will fail with
a #GP(0) exception. Note that a value of 0 is not considered valid for the
VM_HSAVE_PA MSR and a VMRUN that is attempted while the HSAVE_PA is 0 will
fail with a #GP(0) exception."
svm_set_msr() already checks that the supplied address is valid, so only
check for '0' is missing. Add it to nested_svm_vmrun().
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
APM states that #GP is raised upon write to MSR_VM_HSAVE_PA when
the supplied address is not page-aligned or is outside of "maximum
supported physical address for this implementation".
page_address_valid() check seems suitable. Also, forcefully page-align
the address when it's written from VMM.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Reviewed-by: Maxim Levitsky <[email protected]>
[Add comment about behavior for host-provided values. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use IS_ERR() instead of checking for a NULL pointer when querying for
sev_pin_memory() failures. sev_pin_memory() always returns an error code
cast to a pointer, or a valid pointer; it never returns NULL.
Reported-by: Dan Carpenter <[email protected]>
Cc: Steve Rutherford <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: Ashish Kalra <[email protected]>
Fixes: d3d1af85e2c7 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes: 15fb7de1a7f5 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Return -EFAULT if copy_to_user() fails; if accessing user memory faults,
copy_to_user() returns the number of bytes remaining, not an error code.
Reported-by: Dan Carpenter <[email protected]>
Cc: Steve Rutherford <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: Ashish Kalra <[email protected]>
Fixes: d3d1af85e2c7 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In theory there are no side effects of not intercepting #SMI,
because then #SMI becomes transparent to the OS and the KVM.
Plus an observation on recent Zen2 CPUs reveals that these
CPUs ignore #SMI interception and never deliver #SMI VMexits.
This is also useful to test nested KVM to see that L1
handles #SMIs correctly in case when L1 doesn't intercept #SMI.
Finally the default remains the same, the SMI are intercepted
by default thus this patch doesn't have any effect unless
non default module param value is used.
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Kernel never sends real INIT even to CPUs, other than on boot.
Thus INIT interception is an error which should be caught
by a check for an unknown VMexit reason.
On top of that, the current INIT VM exit handler skips
the current instruction which is wrong.
That was added in commit 5ff3a351f687 ("KVM: x86: Move trivial
instruction-based exit handlers to common code").
Fixes: 5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Commit 5ff3a351f687 ("KVM: x86: Move trivial instruction-based
exit handlers to common code"), unfortunately made a mistake of
treating nop_on_interception and nop_interception in the same way.
Former does truly nothing while the latter skips the instruction.
SMI VM exit handler should do nothing.
(SMI itself is handled by the host when we do STGI)
Fixes: 5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
vmx_msr_index was used to record the list of MSRs which can be lazily
restored when kvm returns to userspace. It is now reimplemented as
kvm_uret_msrs_list, a common x86 list which is only used inside x86.c.
So just remove the obsolete declaration in vmx.h.
Signed-off-by: Yu Zhang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When the host is using debug registers but the guest is not using them
nor is the guest in guest-debug state, the kvm code does not reset
the host debug registers before kvm_x86->run(). Rather, it relies on
the hardware vmentry instruction to automatically reset the dr7 registers
which ensures that the host breakpoints do not affect the guest.
This however violates the non-instrumentable nature around VM entry
and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,
Another issue is consistency. When the guest debug registers are active,
the host breakpoints are reset before kvm_x86->run(). But when the
guest debug registers are inactive, the host breakpoints are delayed to
be disabled. The host tracing tools may see different results depending
on what the guest is doing.
To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
if the host has set any breakpoints, no matter if the guest is using
them or not.
Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
[Only clear %db7 instead of reloading all debug registers. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Commit a75a895e6457 ("KVM: selftests: Unconditionally use memslot 0 for
vaddr allocations") removed the memslot parameters from vm_vaddr_alloc.
It addressed all callers except one under lib/aarch64/, due to a race
with commit e3db7579ef35 ("KVM: selftests: Add exception handling
support for aarch64")
Fix the vm_vaddr_alloc call in lib/aarch64/processor.c.
Reported-by: Zenghui Yu <[email protected]>
Signed-off-by: Ricardo Koller <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In commit bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
loop for filling debugfs_stat_data was copy-pasted 2 times, but
in the second loop pointers are saved over pointers allocated
in the first loop. All this causes is a memory leak, fix it.
Fixes: bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
Signed-off-by: Pavel Skripkin <[email protected]>
Reviewed-by: Jing Zhang <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Jing Zhang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Some of the lines aren't properly indented, causing yamllint to warn
about them:
.../nxp,sja1105.yaml:70:17: [warning] wrong indentation: expected 18 but found 16 (indentation)
Use the proper indentation to fix those warnings.
Signed-off-by: Thierry Reding <[email protected]>
Fixes: 070f5b701d559ae1 ("dt-bindings: net: dsa: sja1105: add SJA1110 bindings")
Tested-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Rob Herring <[email protected]>
|
|
"LinusTorvalds" is not pretty. Replace it with "Linus Torvalds".
Signed-off-by: Hu Haowen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
|
|
Risc-V gained support recently.
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
|
|
A couple of exotic quote characters came in with this license text; they
can confuse software that is not expecting non-ASCII text. Switch to
normal quotes here, with no changes to the actual license text.
Reported-by: Rahul T R <[email protected]>
Signed-off-by: Nishanth Menon <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Thorsten Leemhuis <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
|
|
s390 enforces DYNAMIC_FTRACE if FUNCTION_TRACER is selected.
At the same time implementation of ftrace_caller is not compliant with
HAVE_DYNAMIC_FTRACE since it doesn't provide implementation of
ftrace_update_ftrace_func() and calls ftrace_trace_function() directly.
The subtle difference is that during ftrace code patching ftrace
replaces function tracer via ftrace_update_ftrace_func() and activates
it back afterwards. Unexpected direct calls to ftrace_trace_function()
during ftrace code patching leads to nullptr-dereferences when tracing
is activated for one of functions which are used during code patching.
Those function currently are:
copy_from_kernel_nofault()
copy_from_kernel_nofault_allowed()
preempt_count_sub() [with debug_defconfig]
preempt_count_add() [with debug_defconfig]
Corresponding KASAN report:
BUG: KASAN: nullptr-dereference in function_trace_call+0x316/0x3b0
Read of size 4 at addr 0000000000001e08 by task migration/0/15
CPU: 0 PID: 15 Comm: migration/0 Tainted: G B 5.13.0-41423-g08316af3644d
Hardware name: IBM 3906 M04 704 (LPAR)
Stopper: multi_cpu_stop+0x0/0x3e0 <- stop_machine_cpuslocked+0x1e4/0x218
Call Trace:
[<0000000001f77caa>] show_stack+0x16a/0x1d0
[<0000000001f8de42>] dump_stack+0x15a/0x1b0
[<0000000001f81d56>] print_address_description.constprop.0+0x66/0x2e0
[<000000000082b0ca>] kasan_report+0x152/0x1c0
[<00000000004cfd8e>] function_trace_call+0x316/0x3b0
[<0000000001fb7082>] ftrace_caller+0x7a/0x7e
[<00000000006bb3e6>] copy_from_kernel_nofault_allowed+0x6/0x10
[<00000000006bb42e>] copy_from_kernel_nofault+0x3e/0xd0
[<000000000014605c>] ftrace_make_call+0xb4/0x1f8
[<000000000047a1b4>] ftrace_replace_code+0x134/0x1d8
[<000000000047a6e0>] ftrace_modify_all_code+0x120/0x1d0
[<000000000047a7ec>] __ftrace_modify_code+0x5c/0x78
[<000000000042395c>] multi_cpu_stop+0x224/0x3e0
[<0000000000423212>] cpu_stopper_thread+0x33a/0x5a0
[<0000000000243ff2>] smpboot_thread_fn+0x302/0x708
[<00000000002329ea>] kthread+0x342/0x408
[<00000000001066b2>] __ret_from_fork+0x92/0xf0
[<0000000001fb57fa>] ret_from_fork+0xa/0x30
The buggy address belongs to the page:
page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1
flags: 0x1ffff00000001000(reserved|node=0|zone=0|lastcpupid=0x1ffff)
raw: 1ffff00000001000 0000040000000048 0000040000000048 0000000000000000
raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
0000000000001d00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
0000000000001d80: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
>0000000000001e00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
^
0000000000001e80: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
0000000000001f00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
==================================================================
To fix that introduce ftrace_func callback to be called from
ftrace_caller and update it in ftrace_update_ftrace_func().
Fixes: 4cc9bed034d1 ("[S390] cleanup ftrace backend functions")
Cc: [email protected]
Reviewed-by: Heiko Carstens <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
|
|
To help review changes related to AMD IOMMU.
Signed-off-by: Suravee Suthikulpanit <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>
|
|
Takashi Iwai says:
====================
r8152: Fix a couple of PM problems
it seems that r8152 driver suffers from the deadlock at both runtime
and system PM. Formerly, it was seen more often at hibernation
resume, but now it's triggered more frequently, as reported in SUSE
Bugzilla:
https://bugzilla.suse.com/show_bug.cgi?id=1186194
While debugging the problem, I stumbled on a few obvious bugs and here
is the results with two patches for addressing the resume problem.
***
However, the story doesn't end here, unfortunately, and those patches
don't seem sufficing. The rest major problem is that the driver calls
napi_disable() and napi_enable() in the PM suspend callbacks. This
makes the system stalling at (runtime-)suspend. If we drop
napi_disable() and napi_enable() calls in the PM suspend callbacks, it
starts working (that was the result in Bugzilla comment 13):
https://bugzilla.suse.com/show_bug.cgi?id=1186194#c13
So, my patches aren't enough and we still need to investigate
further. It'd be appreciated if anyone can give a fix or a hint for
more debugging. The usage of napi_disable() at PM callbacks is unique
in this driver and looks rather suspicious to me; but I'm no expert in
this area so I might be wrong...
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
r8152 driver sets up the MAC address at reset-resume, while
rtl8152_set_mac_address() has the temporary autopm get/put. This may
lead to a deadlock as the PM lock has been already taken for the
execution of the runtime PM callback.
This patch adds the workaround to avoid the superfluous autpm when
called from rtl8152_reset_resume().
Link: https://bugzilla.suse.com/show_bug.cgi?id=1186194
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
rtl8152_close() takes the refcount via usb_autopm_get_interface() but
it doesn't release when RTL8152_UNPLUG test hits. This may lead to
the imbalance of PM refcount. This patch addresses it.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1186194
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add new PCI device id.
Signed-off-by: Jinzhou Su <[email protected]>
Reviewed-by: Huang Rui <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected] # 5.11.x
|
|
Populate the auxtrace queues using AUX records rather than whole
auxtrace buffers so that the decoder is reset between each aux record.
This is similar to the auxtrace_queues__process_index() ->
auxtrace_queues__add_indexed_event() flow where
perf_session__peek_event() is used to read AUXTRACE events out of random
positions in the file based on the auxtrace index.
But now we loop over all PERF_RECORD_AUX events instead of AUXTRACE
buffers. For each PERF_RECORD_AUX event, we find the corresponding
AUXTRACE buffer using the index, and add a fragment of that buffer to
the auxtrace queues.
No other changes to decoding were made, apart from populating the
auxtrace queues. The result of decoding is identical to before, except
in cases where decoding failed completely, due to not resetting the
decoder.
The reason for this change is because AUX records are emitted any time
tracing is disabled, for example when the process is scheduled out.
Because ETM was disabled and enabled again, the decoder also needs to be
reset to force the search for a sync packet. Otherwise there would be
fatal decoding errors.
Testing
=======
Testing was done with the following script, to diff the decoding results
between the patched and un-patched versions of perf:
#!/bin/bash
set -ex
$1 script -i $3 $4 > split.script
$2 script -i $3 $4 > default.script
diff split.script default.script | head -n 20
And it was run like this, with various itrace options depending on the
quantity of synthesised events:
compare.sh ./perf-patched ./perf-default perf-per-cpu-2-threads.data --itrace=i100000ns
No changes in output were observed in the following scenarios:
* Simple per-cpu
perf record -e cs_etm/@tmc_etr0/u top
* Per-thread, single thread
perf record -e cs_etm/@tmc_etr0/u --per-thread ./threads_C
* Per-thread multiple threads (but only one thread collected data):
perf record -e cs_etm/@tmc_etr0/u --per-thread --pid 4596,4597
* Per-thread multiple threads (both threads collected data):
perf record -e cs_etm/@tmc_etr0/u --per-thread --pid 4596,4597
* Per-cpu explicit threads:
perf record -e cs_etm/@tmc_etr0/u --pid 853,854
* System-wide (per-cpu):
perf record -e cs_etm/@tmc_etr0/u -a
* No data collected (no aux buffers)
Can happen with any command when run for a short period
* Containing truncated records
Can happen with any command
* Containing aux records with 0 size
Can happen with any command
* Snapshot mode (various files with and without buffer wrap)
perf record -e cs_etm/@tmc_etr0/u -a --snapshot
Some differences were observed in the following scenario:
* Snapshot mode (with duplicate buffers)
perf record -e cs_etm/@tmc_etr0/u -a --snapshot
Fewer samples are generated in snapshot mode if duplicate buffers
were gathered because buffers with the same offset are now only added
once. This gives different, but more correct results and no duplicate
data is decoded any more.
Signed-off-by: James Clark <[email protected]>
Reviewed-by: Mathieu Poirier <[email protected]>
Tested-by: Leo Yan <[email protected]>
Cc: Al Grant <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Branislav Rankov <[email protected]>
Cc: Denis Nikitin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Suzuki Poulouse <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The linux/kconfig.h file was copied from the kernel but the line where
with the generated/autoconf.h include from where the CONFIG_ entries
would come from was deleted, as tools/ build system don't create that
file, so we ended up always defining just __LITTLE_ENDIAN as
CONFIG_CPU_BIG_ENDIAN was nowhere to be found.
This in turn ended up breaking the build in some systems where
__LITTLE_ENDIAN was already defined, such as the androind NDK.
So just ditch that block that depends on the CONFIG_CPU_BIG_ENDIAN
define.
The kconfig.h file was copied just to get IS_ENABLED() and a
'make -C tools/all' doesn't breaks with this removal.
Fixes: 93281c4a96572a34 ("x86/insn: Add an insn_decode() API")
Cc: Adrian Hunter <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
There's a chance that the IDA allocated in mmc_alloc_host() is not freed
for some time because it's freed as part of a class' release function
(see mmc_host_classdev_release() where the IDA is freed). If another
thread is holding a reference to the class, then only once all balancing
device_put() calls (in turn calling kobject_put()) have been made will
the IDA be released and usable again.
Normally this isn't a problem because the kobject is released before
anything else that may want to use the same number tries to again, but
with CONFIG_DEBUG_KOBJECT_RELEASE=y and OF aliases it becomes pretty
easy to try to allocate an alias from the IDA twice while the first time
it was allocated is still pending a call to ida_simple_remove(). It's
also possible to trigger it by using CONFIG_DEBUG_KOBJECT_RELEASE and
probe defering a driver at boot that calls mmc_alloc_host() before
trying to get resources that may defer likes clks or regulators.
Instead of allocating from the IDA in this scenario, let's just skip it
if we know this is an OF alias. The number is already "claimed" and
devices that aren't using OF aliases won't try to use the claimed
numbers anyway (see mmc_first_nonreserved_index()). This should avoid
any issues with mmc_alloc_host() returning failures from the
ida_simple_get() in the case that we're using an OF alias.
Cc: Matthias Schiffer <[email protected]>
Cc: Sujit Kautkar <[email protected]>
Reported-by: Zubin Mithra <[email protected]>
Fixes: fa2d0aa96941 ("mmc: core: Allow setting slot index via device tree alias")
Signed-off-by: Stephen Boyd <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Ulf Hansson <[email protected]>
|
|
Ulf reported the following KASAN splat after adding some manual hacks
into mmc-utils[1].
DEBUG: mmc_blk_open: Let's sleep for 10s..
mmc1: card 0007 removed
BUG: KASAN: use-after-free in mmc_blk_get+0x58/0xb8
Read of size 4 at addr ffff00000a394a28 by task mmc/180
CPU: 2 PID: 180 Comm: mmc Not tainted 5.10.0-rc4-00069-gcc758c8c7127-dirty #5
Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
Call trace:
dump_backtrace+0x0/0x2b4
show_stack+0x18/0x6c
dump_stack+0xfc/0x168
print_address_description.constprop.0+0x6c/0x488
kasan_report+0x118/0x210
__asan_load4+0x94/0xd0
mmc_blk_get+0x58/0xb8
mmc_blk_open+0x7c/0xdc
__blkdev_get+0x3b4/0x964
blkdev_get+0x64/0x100
blkdev_open+0xe8/0x104
do_dentry_open+0x234/0x61c
vfs_open+0x54/0x64
path_openat+0xe04/0x1584
do_filp_open+0xe8/0x1e4
do_sys_openat2+0x120/0x230
__arm64_sys_openat+0xf0/0x15c
el0_svc_common.constprop.0+0xac/0x234
do_el0_svc+0x84/0xa0
el0_sync_handler+0x264/0x270
el0_sync+0x174/0x180
Allocated by task 33:
stack_trace_save+0x9c/0xdc
kasan_save_stack+0x28/0x60
__kasan_kmalloc.constprop.0+0xc8/0xf0
kasan_kmalloc+0x10/0x20
mmc_blk_alloc_req+0x94/0x4b0
mmc_blk_probe+0x2d4/0xaa4
mmc_bus_probe+0x34/0x4c
really_probe+0x148/0x6e0
driver_probe_device+0x78/0xec
__device_attach_driver+0x108/0x16c
bus_for_each_drv+0xf4/0x15c
__device_attach+0x168/0x240
device_initial_probe+0x14/0x20
bus_probe_device+0xec/0x100
device_add+0x55c/0xaf0
mmc_add_card+0x288/0x380
mmc_attach_sd+0x18c/0x22c
mmc_rescan+0x444/0x4f0
process_one_work+0x3b8/0x650
worker_thread+0xa0/0x724
kthread+0x218/0x220
ret_from_fork+0x10/0x38
Freed by task 33:
stack_trace_save+0x9c/0xdc
kasan_save_stack+0x28/0x60
kasan_set_track+0x28/0x40
kasan_set_free_info+0x24/0x4c
__kasan_slab_free+0x100/0x180
kasan_slab_free+0x14/0x20
kfree+0xb8/0x46c
mmc_blk_put+0xe4/0x11c
mmc_blk_remove_req.part.0+0x6c/0xe4
mmc_blk_remove+0x368/0x370
mmc_bus_remove+0x34/0x50
__device_release_driver+0x228/0x31c
device_release_driver+0x2c/0x44
bus_remove_device+0x1e4/0x200
device_del+0x2b0/0x770
mmc_remove_card+0xf0/0x150
mmc_sd_detect+0x9c/0x150
mmc_rescan+0x110/0x4f0
process_one_work+0x3b8/0x650
worker_thread+0xa0/0x724
kthread+0x218/0x220
ret_from_fork+0x10/0x38
The buggy address belongs to the object at ffff00000a394800
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 552 bytes inside of
1024-byte region [ffff00000a394800, ffff00000a394c00)
The buggy address belongs to the page:
page:00000000ff84ed53 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8a390
head:00000000ff84ed53 order:3 compound_mapcount:0 compound_pincount:0
flags: 0x3fffc0000010200(slab|head)
raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff000009f03800
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff00000a394900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff00000a394980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff00000a394a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff00000a394a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff00000a394b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Looking closer at the problem, it looks like a classic dangling pointer
bug. The 'struct mmc_blk_data' that is used after being freed in
mmc_blk_put() is stashed away in 'md->disk->private_data' via
mmc_blk_alloc_req() but used in mmc_blk_get() because the 'usage' count
isn't properly aligned with the lifetime of the pointer. You'd expect
the 'usage' member to be in sync with the kfree(), and it mostly is,
except that mmc_blk_get() needs to dereference the potentially freed
memory storage for the 'struct mmc_blk_data' stashed away in the
private_data member to look at 'usage' before it actually figures out if
it wants to consider it a valid pointer or not. That's not going to work
if the freed memory has been overwritten by something else after the
free, and KASAN rightly complains here.
To fix the immediate problem, let's set the private_data member to NULL
in mmc_blk_put() so that mmc_blk_get() can consider the object "on the
way out" if the pointer is NULL and not even try to look at 'usage' if
the object isn't going to be around much longer. With that set to NULL
on the last mmc_blk_put(), optimize the get path further and use a kref
underneath the 'open_lock' mutex to only up the reference count if it's
non-zero, i.e. alive, and otherwise make mmc_blk_get() return NULL,
without actually testing the reference count if we're in the process of
removing the object from the system.
Finally, tighten the locking region on the put side to only be around
the parts that are removing the 'mmc_blk_data' from the system and
publishing that fact to the gendisk and then drop the lock as soon as we
can to avoid holding the lock around code that doesn't need it. This
fixes the KASAN issue.
Cc: Matthias Schiffer <[email protected]>
Cc: Sujit Kautkar <[email protected]>
Cc: Zubin Mithra <[email protected]>
Reported-by: Ulf Hansson <[email protected]>
Link: https://lore.kernel.org/linux-mmc/CAPDyKFryT63Jc7+DXWSpAC19qpZRqFr1orxwYGMuSqx247O8cQ@mail.gmail.com/ [1]
Signed-off-by: Stephen Boyd <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Ulf Hansson <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski.
"Including fixes from bpf and netfilter.
Current release - regressions:
- sock: fix parameter order in sock_setsockopt()
Current release - new code bugs:
- netfilter: nft_last:
- fix incorrect arithmetic when restoring last used
- honor NFTA_LAST_SET on restoration
Previous releases - regressions:
- udp: properly flush normal packet at GRO time
- sfc: ensure correct number of XDP queues; don't allow enabling the
feature if there isn't sufficient resources to Tx from any CPU
- dsa: sja1105: fix address learning getting disabled on the CPU port
- mptcp: addresses a rmem accounting issue that could keep packets in
subflow receive buffers longer than necessary, delaying MPTCP-level
ACKs
- ip_tunnel: fix mtu calculation for ETHER tunnel devices
- do not reuse skbs allocated from skbuff_fclone_cache in the napi
skb cache, we'd try to return them to the wrong slab cache
- tcp: consistently disable header prediction for mptcp
Previous releases - always broken:
- bpf: fix subprog poke descriptor tracking use-after-free
- ipv6:
- allocate enough headroom in ip6_finish_output2() in case
iptables TEE is used
- tcp: drop silly ICMPv6 packet too big messages to avoid
expensive and pointless lookups (which may serve as a DDOS
vector)
- make sure fwmark is copied in SYNACK packets
- fix 'disable_policy' for forwarded packets (align with IPv4)
- netfilter: conntrack:
- do not renew entry stuck in tcp SYN_SENT state
- do not mark RST in the reply direction coming after SYN packet
for an out-of-sync entry
- mptcp: cleanly handle error conditions with MP_JOIN and syncookies
- mptcp: fix double free when rejecting a join due to port mismatch
- validate lwtstate->data before returning from skb_tunnel_info()
- tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
- mt76: mt7921: continue to probe driver when fw already downloaded
- bonding: fix multiple issues with offloading IPsec to (thru?) bond
- stmmac: ptp: fix issues around Qbv support and setting time back
- bcmgenet: always clear wake-up based on energy detection
Misc:
- sctp: move 198 addresses from unusable to private scope
- ptp: support virtual clocks and timestamping
- openvswitch: optimize operation for key comparison"
* tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits)
net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
sfc: add logs explaining XDP_TX/REDIRECT is not available
sfc: ensure correct number of XDP queues
sfc: fix lack of XDP TX queues - error XDP TX failed (-22)
net: fddi: fix UAF in fza_probe
net: dsa: sja1105: fix address learning getting disabled on the CPU port
net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload
net: Use nlmsg_unicast() instead of netlink_unicast()
octeontx2-pf: Fix uninitialized boolean variable pps
ipv6: allocate enough headroom in ip6_finish_output2()
net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific
net: bridge: multicast: fix MRD advertisement router port marking race
net: bridge: multicast: fix PIM hello router port marking race
net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
dsa: fix for_each_child.cocci warnings
virtio_net: check virtqueue_add_sgs() return value
mptcp: properly account bulk freed memory
selftests: mptcp: fix case multiple subflows limited by server
mptcp: avoid processing packet if a subflow reset
mptcp: fix syncookie process if mptcp can not_accept new subflow
...
|
|
Add a simple helper that filesystems can use in their parameter parser
to parse the "source" parameter. A few places open-coded this function
and that already caused a bug in the cgroup v1 parser that we fixed.
Let's make it harder to get this wrong by introducing a helper which
performs all necessary checks.
Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
Cc: Christoph Hellwig <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The following sequence can be used to trigger a UAF:
int fscontext_fd = fsopen("cgroup");
int fd_null = open("/dev/null, O_RDONLY);
int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
close_range(3, ~0U, 0);
The cgroup v1 specific fs parser expects a string for the "source"
parameter. However, it is perfectly legitimate to e.g. specify a file
descriptor for the "source" parameter. The fs parser doesn't know what
a filesystem allows there. So it's a bug to assume that "source" is
always of type fs_value_is_string when it can reasonably also be
fs_value_is_file.
This assumption in the cgroup code causes a UAF because struct
fs_parameter uses a union for the actual value. Access to that union is
guarded by the param->type member. Since the cgroup paramter parser
didn't check param->type but unconditionally moved param->string into
fc->source a close on the fscontext_fd would trigger a UAF during
put_fs_context() which frees fc->source thereby freeing the file stashed
in param->file causing a UAF during a close of the fd_null.
Fix this by verifying that param->type is actually a string and report
an error if not.
In follow up patches I'll add a new generic helper that can be used here
and by other filesystems instead of this error-prone copy-pasta fix.
But fixing it in here first makes backporting a it to stable a lot
easier.
Fixes: 8d2451f4994f ("cgroup1: switch to option-by-option parsing")
Reported-by: [email protected]
Cc: Christoph Hellwig <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: <[email protected]>
Cc: syzkaller-bugs <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
the SVM
The AMD platform does not support the functions Ah CPUID leaf. The returned
results for this entry should all remain zero just like the native does:
AMD host:
0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
(uncanny) AMD guest:
0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00008000
Fixes: cadbaa039b99 ("perf/x86/intel: Make anythread filter support conditional")
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
BUG: KASAN: use-after-free in kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
Read of size 8 at addr ffff0000c03a2500 by task syz-executor083/4269
CPU: 5 PID: 4269 Comm: syz-executor083 Not tainted 5.10.0 #7
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132
show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x110/0x164 lib/dump_stack.c:118
print_address_description+0x78/0x5c8 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report+0x148/0x1e4 mm/kasan/report.c:562
check_memory_region_inline mm/kasan/generic.c:183 [inline]
__asan_load8+0xb4/0xbc mm/kasan/generic.c:252
kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
Allocated by task 4269:
stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461
kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475
kmem_cache_alloc_trace include/linux/slab.h:450 [inline]
kmalloc include/linux/slab.h:552 [inline]
kzalloc include/linux/slab.h:664 [inline]
kvm_vm_ioctl_register_coalesced_mmio+0x78/0x1cc arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:146
kvm_vm_ioctl+0x7e8/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3746
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
Freed by task 4269:
stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track+0x38/0x6c mm/kasan/common.c:56
kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355
__kasan_slab_free+0x124/0x150 mm/kasan/common.c:422
kasan_slab_free+0x10/0x1c mm/kasan/common.c:431
slab_free_hook mm/slub.c:1544 [inline]
slab_free_freelist_hook mm/slub.c:1577 [inline]
slab_free mm/slub.c:3142 [inline]
kfree+0x104/0x38c mm/slub.c:4124
coalesced_mmio_destructor+0x94/0xa4 arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:102
kvm_iodevice_destructor include/kvm/iodev.h:61 [inline]
kvm_io_bus_unregister_dev+0x248/0x280 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:4374
kvm_vm_ioctl_unregister_coalesced_mmio+0x158/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:186
kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
If kvm_io_bus_unregister_dev() return -ENOMEM, we already call kvm_iodevice_destructor()
inside this function to delete 'struct kvm_coalesced_mmio_dev *dev' from list
and free the dev, but kvm_iodevice_destructor() is called again, it will lead
the above issue.
Let's check the the return value of kvm_io_bus_unregister_dev(), only call
kvm_iodevice_destructor() if the return value is 0.
Cc: Paolo Bonzini <[email protected]>
Cc: [email protected]
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Kefeng Wang <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Fixes: 5d3c4c79384a ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed", 2021-04-20)
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Don't clear the C-bit in the #NPF handler, as it is a legal GPA bit for
non-SEV guests, and for SEV guests the C-bit is dropped before the GPA
hits the NPT in hardware. Clearing the bit for non-SEV guests causes KVM
to mishandle #NPFs with that collide with the host's C-bit.
Although the APM doesn't explicitly state that the C-bit is not reserved
for non-SEV, Tom Lendacky confirmed that the following snippet about the
effective reduction due to the C-bit does indeed apply only to SEV guests.
Note that because guest physical addresses are always translated
through the nested page tables, the size of the guest physical address
space is not impacted by any physical address space reduction indicated
in CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however,
the guest physical address space is effectively reduced by 1 bit.
And for SEV guests, the APM clearly states that the bit is dropped before
walking the nested page tables.
If the C-bit is an address bit, this bit is masked from the guest
physical address when it is translated through the nested page tables.
Consequently, the hypervisor does not need to be aware of which pages
the guest has chosen to mark private.
Note, the bogus C-bit clearing was removed from legacy #PF handler in
commit 6d1b867d0456 ("KVM: SVM: Don't strip the C-bit from CR2 on #PF
interception").
Fixes: 0ede79e13224 ("KVM: SVM: Clear C-bit from the page fault address")
Cc: Peter Gonda <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Ignore "dynamic" host adjustments to the physical address mask when
generating the masks for guest PTEs, i.e. the guest PA masks. The host
physical address space and guest physical address space are two different
beasts, e.g. even though SEV's C-bit is the same bit location for both
host and guest, disabling SME in the host (which clears shadow_me_mask)
does not affect the guest PTE->GPA "translation".
For non-SEV guests, not dropping bits is the correct behavior. Assuming
KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits
that are lost as collateral damage from memory encryption are treated as
reserved bits, i.e. KVM will never get to the point where it attempts to
generate a gfn using the affected bits. And if userspace wants to create
a bogus vCPU, then userspace gets to deal with the fallout of hardware
doing odd things with bad GPAs.
For SEV guests, not dropping the C-bit is technically wrong, but it's a
moot point because KVM can't read SEV guest's page tables in any case
since they're always encrypted. Not to mention that the current KVM code
is also broken since sme_me_mask does not have to be non-zero for SEV to
be supported by KVM. The proper fix would be to teach all of KVM to
correctly handle guest private memory, but that's a task for the future.
Fixes: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
Cc: [email protected]
Cc: Brijesh Singh <[email protected]>
Cc: Tom Lendacky <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
[Use a new header instead of adding header guards to paging_tmpl.h. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use boot_cpu_data.x86_phys_bits instead of the raw CPUID information to
enumerate the MAXPHYADDR for KVM guests when TDP is disabled (the guest
version is only relevant to NPT/TDP).
When using shadow paging, any reductions to the host's MAXPHYADDR apply
to KVM and its guests as well, i.e. using the raw CPUID info will cause
KVM to misreport the number of PA bits available to the guest.
Unconditionally zero out the "Physical Address bit reduction" entry.
For !TDP, the adjustment is already done, and for TDP enumerating the
host's reduction is wrong as the reduction does not apply to GPAs.
Fixes: 9af9b94068fb ("x86/cpu/AMD: Handle SME reduction in physical address size")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Ignore the guest MAXPHYADDR reported by CPUID.0x8000_0008 if TDP, i.e.
NPT, is disabled, and instead use the host's MAXPHYADDR. Per AMD'S APM:
Maximum guest physical address size in bits. This number applies only
to guests using nested paging. When this field is zero, refer to the
PhysAddrSize field for the maximum guest physical address size.
Fixes: 24c82e576b78 ("KVM: Sanitize cpuid")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
enabled"
Let KVM load if EFER.NX=0 even if NX is supported, the analysis and
testing (or lack thereof) for the non-PAE host case was garbage.
If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S
skips over the entire EFER sequence. Hopefully that can be changed in
the future to allow KVM to require EFER.NX, but the motivation behind
KVM's requirement isn't yet merged. Reverting and revisiting the mess
at a later date is by far the safest approach.
This reverts commit 8bbed95d2cb6e5de8a342d761a89b0a04faed7be.
Fixes: 8bbed95d2cb6 ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Commit b78f4a59669 ("KVM: selftests: Rename vm_handle_exception")
raced with a couple of new x86 tests, missing two vm_handle_exception
to vm_install_exception_handler conversions.
Help the two broken tests to catch up with the new world.
Cc: Andrew Jones <[email protected]>
CC: Ricardo Koller <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Ricardo Koller <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: selftests: Fixes
- provide memory model for IBM z196 and zEC12
- do not require 64GB of memory
|
|
With the recent fixes for fallthrough warnings, it is now possible to
enable -Wimplicit-fallthrough for Clang.
It's important to mention that since we have adopted the use of the
pseudo-keyword macro fallthrough; we also want to avoid having more
/* fall through */ comments being introduced. Notice that contrary
to GCC, Clang doesn't recognize any comments as implicit fall-through
markings when the -Wimplicit-fallthrough option is enabled. So, in
order to avoid having more comments being introduced, we have to use
the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
will cause a warning in case a code comment is intended to be used
as a fall-through marking.
Co-developed-by: Kees Cook <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Gustavo A. R. Silva <[email protected]>
|
|
Fix the following fallthrough warning:
arch/powerpc/platforms/powermac/smp.c:149:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/lkml/60ef0750.I8J+C6KAtb0xVOAa%[email protected]/
Signed-off-by: Gustavo A. R. Silva <[email protected]>
|
|
sysconf(__SC_THREAD_STACK_MIN_VALUE)
In fedora rawhide the PTHREAD_STACK_MIN define may end up expanded to a
sysconf() call, and that will return 'long int', breaking the build:
45 fedora:rawhide : FAIL gcc version 11.1.1 20210623 (Red Hat 11.1.1-6) (GCC)
builtin-sched.c: In function 'create_tasks':
/git/perf-5.14.0-rc1/tools/include/linux/kernel.h:43:24: error: comparison of distinct pointer types lacks a cast [-Werror]
43 | (void) (&_max1 == &_max2); \
| ^~
builtin-sched.c:673:34: note: in expansion of macro 'max'
673 | (size_t) max(16 * 1024, PTHREAD_STACK_MIN));
| ^~~
cc1: all warnings being treated as errors
$ grep __sysconf /usr/include/*/*.h
/usr/include/bits/pthread_stack_min-dynamic.h:extern long int __sysconf (int __name) __THROW;
/usr/include/bits/pthread_stack_min-dynamic.h:# define PTHREAD_STACK_MIN __sysconf (__SC_THREAD_STACK_MIN_VALUE)
/usr/include/bits/time.h:extern long int __sysconf (int);
/usr/include/bits/time.h:# define CLK_TCK ((__clock_t) __sysconf (2)) /* 2 is _SC_CLK_TCK */
$
So cast it to int to cope with that.
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Fix the following fallthrough warning (powerpc-randconfig):
drivers/dma/mpc512x_dma.c:816:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/lkml/60ef0750.I8J+C6KAtb0xVOAa%[email protected]/
Signed-off-by: Gustavo A. R. Silva <[email protected]>
|
|
Fix the following fallthrough warning (powerpc-randconfig):
drivers/usb/gadget/udc/fsl_qe_udc.c:589:4: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/lkml/60ef0750.I8J+C6KAtb0xVOAa%[email protected]/
Signed-off-by: Gustavo A. R. Silva <[email protected]>
|
|
When calling ttm_range_man_fini(), 'man' may be uninitialized, which may
cause a null pointer dereference bug.
Fix this by checking if it is a null pointer.
This log reveals it:
[ 7.902580 ] BUG: kernel NULL pointer dereference, address: 0000000000000058
[ 7.905721 ] RIP: 0010:ttm_range_man_fini+0x40/0x160
[ 7.911826 ] Call Trace:
[ 7.911826 ] radeon_ttm_fini+0x167/0x210
[ 7.911826 ] radeon_bo_fini+0x15/0x40
[ 7.913767 ] rs400_fini+0x55/0x80
[ 7.914358 ] radeon_device_fini+0x3c/0x140
[ 7.914358 ] radeon_driver_unload_kms+0x5c/0xe0
[ 7.914358 ] radeon_driver_load_kms+0x13a/0x200
[ 7.914358 ] ? radeon_driver_unload_kms+0xe0/0xe0
[ 7.914358 ] drm_dev_register+0x1db/0x290
[ 7.914358 ] radeon_pci_probe+0x16a/0x230
[ 7.914358 ] local_pci_probe+0x4a/0xb0
Signed-off-by: Zheyu Ma <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Christian König <[email protected]>
|
|
Because the out of range assignment to bit fields
are compiler-dependant, the fields could have wrong
value.
Signed-off-by: Hyunchul Lee <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
mounts
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=213565
cruid should only be used for the initial mount and after this we should use the current
users credentials.
Ignore the original cruid mount argument when creating a new context for a multiuser mount
following a DFS link.
Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api")
Cc: [email protected] # 5.11+
Reported-by: Xiaoli Feng <[email protected]>
Signed-off-by: Ronnie Sahlberg <[email protected]>
Reviewed-by: Paulo Alcantara (SUSE) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|