Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Will Deacon:
"Ensure PAN is re-enabled following user fault in uaccess routines.
After I thought we were done for 5.4, we had a report this week of a
nasty issue that has been shown to leak data between different user
address spaces thanks to corruption of entries in the TLB. In
hindsight, we should have spotted this in review when the PAN code was
merged back in v4.3, but hindsight is 20/20 and I'm trying not to beat
myself up too much about it despite being fairly miserable.
Anyway, the fix is "obvious" but the actual failure is more more
subtle, and is described in the commit message. I've included a fairly
mechanical follow-up patch here as well, which moves this checking out
into the C wrappers which is what we do for {get,put}_user() already
and allows us to remove these bloody assembly macros entirely. The
patches have passed kernelci [1] [2] [3] and CKI [4] tests over night,
as well as some targetted testing [5] for this particular issue.
The first patch is tagged for stable and should be applied to 4.14,
4.19 and 5.3. I have separate backports for 4.4 and 4.9, which I'll
send out once this has landed in your tree (although the original
patch applies cleanly, it won't build for those two trees).
Thanks to Pavel Tatashin for reporting this and Mark Rutland for
helping to diagnose the issue and review/test the solution"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: uaccess: Remove uaccess_*_not_uao asm macros
arm64: uaccess: Ensure PAN is re-enabled after unhandled uaccess fault
|
|
The 0-DAY found that audit_log_task is not declared under
CONFIG_AUDITSYSCALL which causes compilation error when
it is not defined:
kernel/bpf/syscall.o: In function `bpf_audit_prog.isra.30':
>> syscall.c:(.text+0x860): undefined reference to `audit_log_task'
Adding the audit_log_task declaration and stub within
CONFIG_AUDITSYSCALL ifdef.
Fixes: 91e6015b082b ("bpf: Emit audit messages upon successful prog load and unload")
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The workqueue only exists for the primary PF. For other functions
we hit a WARN_ON in kernel/workqueue.c.
Fixes: 7c236c43b838 ("sfc: Add support for IEEE-1588 PTP")
Signed-off-by: Martin Habets <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Pull block fix from Jens Axboe:
"Just a single fix for an issue in nbd introduced in this cycle"
* tag 'for-linus-20191121' of git://git.kernel.dk/linux-block:
nbd:fix memory leak in nbd_get_socket()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"A last set of small fixes for GPIO, this cycle was quite busy.
- Fix debounce delays on the MAX77620 GPIO expander
- Use the correct unit for debounce times on the BD70528 GPIO expander
- Get proper deps for parallel builds of the GPIO tools
- Add a specific ACPI quirk for the Terra Pad 1061"
* tag 'gpio-v5.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpiolib: acpi: Add Terra Pad 1061 to the run_edge_events_on_boot_blacklist
tools: gpio: Correctly add make dependencies for gpio_utils
gpio: bd70528: Use correct unit for debounce times
gpio: max77620: Fixup debounce delays
|
|
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style. This fixes various indentation mixups (seven spaces,
tab+one space, etc).
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style. This fixes various indentation mixups (seven spaces,
tab+one space, etc).
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd fixlet from Christian Brauner:
"This contains a simple fix for the pidfd poll method. In the original
patchset pidfd_poll() was made to return an unsigned int. However, the
poll method is defined to return a __poll_t. While the unsigned int is
not a huge deal it's just nicer to return a __poll_t.
I've decided to send it right before the 5.4 release mainly so that
stable doesn't need to backport it to both 5.4 and 5.3"
* tag 'for-linus-2019-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fork: fix pidfd_poll()'s return type
|
|
If starting the transfer of a command suceeds but the transfer for the reply
fails, it is not enough to initiate killing the transfer for the
command may still be running. You need to wait for the killing to finish
before you can reuse URB and buffer.
Reported-and-tested-by: [email protected]
Signed-off-by: Oliver Neukum <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
As Jakub suggested on another patch, it's better to do the check
on erspan options before allocating memory.
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
LWTUNNEL_IP(6)_OPTS are the new items in ip(6)_tun_policy, which
are parsed by nla_parse_nested_deprecated(). We should check it
strictly by setting .strict_start_type = LWTUNNEL_IP(6)_OPTS.
This patch also adds missing LWTUNNEL_IP6_OPTS in ip6_tun_policy.
Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve")
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
ct_policy and mpls_policy are parsed with nla_parse_nested(), which
does NL_VALIDATE_STRICT validation, strict_start_type is not needed
to set as it is actually trying to make some attributes parsed with
NL_VALIDATE_STRICT.
This patch is to remove it, and do the same on rtm_nh_policy which
is parsed by nlmsg_parse().
Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Xin Long says:
====================
net: sched: support vxlan and erspan options
This patchset is to add vxlan and erspan options support in
cls_flower and act_tunnel_key. The form is pretty much like
geneve_opts in:
https://patchwork.ozlabs.org/patch/935272/
https://patchwork.ozlabs.org/patch/954564/
but only one option is allowed for vxlan and erspan.
v1->v2:
- see each patch changelog.
====================
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch is to allow matching options in erspan.
The options can be described in the form:
VER:INDEX:DIR:HWID/VER:INDEX_MASK:DIR_MASK:HWID_MASK.
When ver is set to 1, index will be applied while dir
and hwid will be ignored, and when ver is set to 2,
dir and hwid will be used while index will be ignored.
Different from geneve, only one option can be set. And
also, geneve options, vxlan options or erspan options
can't be set at the same time.
# ip link add name erspan1 type erspan external
# tc qdisc add dev erspan1 ingress
# tc filter add dev erspan1 protocol ip parent ffff: \
flower \
enc_src_ip 10.0.99.192 \
enc_dst_ip 10.0.99.193 \
enc_key_id 11 \
erspan_opts 1:12:0:0/1:ffff:0:0 \
ip_proto udp \
action mirred egress redirect dev eth0
v1->v2:
- improve some err msgs of extack.
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch is to allow matching gbp option in vxlan.
The options can be described in the form GBP/GBP_MASK,
where GBP is represented as a 32bit hexadecimal value.
Different from geneve, only one option can be set. And
also, geneve options and vxlan options can't be set at
the same time.
# ip link add name vxlan0 type vxlan dstport 0 external
# tc qdisc add dev vxlan0 ingress
# tc filter add dev vxlan0 protocol ip parent ffff: \
flower \
enc_src_ip 10.0.99.192 \
enc_dst_ip 10.0.99.193 \
enc_key_id 11 \
vxlan_opts 01020304/ffffffff \
ip_proto udp \
action mirred egress redirect dev eth0
v1->v2:
- add .strict_start_type for enc_opts_policy as Jakub noticed.
- use Duplicate instead of Wrong in err msg for extack as Jakub
suggested.
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch is to allow setting erspan options using the
act_tunnel_key action. Different from geneve options,
only one option can be set. And also, geneve options,
vxlan options or erspan options can't be set at the
same time.
Options are expressed as ver:index:dir:hwid, when ver
is set to 1, index will be applied while dir and hwid
will be ignored, and when ver is set to 2, dir and
hwid will be used while index will be ignored.
# ip link add name erspan1 type erspan external
# tc qdisc add dev eth0 ingress
# tc filter add dev eth0 protocol ip parent ffff: \
flower indev eth0 \
ip_proto udp \
action tunnel_key \
set src_ip 10.0.99.192 \
dst_ip 10.0.99.193 \
dst_port 6081 \
id 11 \
erspan_opts 1:2:0:0 \
action mirred egress redirect dev erspan1
v1->v2:
- do the validation when dst is not yet allocated as Jakub suggested.
- use Duplicate instead of Wrong in err msg for extack.
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch is to allow setting vxlan options using the
act_tunnel_key action. Different from geneve options,
only one option can be set. And also, geneve options
and vxlan options can't be set at the same time.
gbp is the only param for vxlan options:
# ip link add name vxlan0 type vxlan dstport 0 external
# tc qdisc add dev eth0 ingress
# tc filter add dev eth0 protocol ip parent ffff: \
flower indev eth0 \
ip_proto udp \
action tunnel_key \
set src_ip 10.0.99.192 \
dst_ip 10.0.99.193 \
dst_port 6081 \
id 11 \
vxlan_opts 01020304 \
action mirred egress redirect dev vxlan0
v1->v2:
- add .strict_start_type for enc_opts_policy as Jakub noticed.
- use Duplicate instead of Wrong in err msg for extack as Jakub
suggested.
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
If rvu_get_blkaddr() fails, then this rvu_cgx_nix_cuml_stats() returns
zero and we write some uninitialized data into the debugfs output.
On the error paths, the use of the uninitialized "*stat" is harmless,
but it will lead to a Smatch warning (static analysis) and a UBSan
warning (runtime analysis) so we should prevent that as well.
Fixes: f967488d095e ("octeontx2-af: Add per CGX port level NIX Rx/Tx counters")
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
If transport->init() fails, we can't assign the transport to the
socket, because it's not initialized correctly, and any future
calls to the transport callbacks would have an unexpected behavior.
Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
Reported-and-tested-by: [email protected]
Signed-off-by: Stefano Garzarella <[email protected]>
Reviewed-by: Jorgen Hansen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The current ampdu locking code does not unlock its mutex in the early
return case. This patch fixes it.
Signed-off-by: Markus Theil <[email protected]>
Acked-by: Felix Fietkau <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
|
|
When the NMI lands on an ESPFIX_SS, we are on the entry stack and must
swizzle, otherwise we'll run do_nmi() on the entry stack, which is
BAD.
Also, similar to the normal exception path, we need to correct the
ESPFIX magic before leaving the entry stack, otherwise pt_regs will
present a non-flat stack pointer.
Tested by running sigreturn_32 concurrent with perf-record.
Fixes: e5862d0515ad ("x86/entry/32: Leave the kernel via trampoline stack")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Cc: [email protected]
|
|
Right now, we do some fancy parts of the exception entry path while SS
might have a nonzero base: we fill in regs->ss and regs->sp, and we
consider switching to the kernel stack. This results in regs->ss and
regs->sp referring to a non-flat stack and it may result in
overflowing the entry stack. The former issue means that we can try to
call iret_exc on a non-flat stack, which doesn't work.
Tested with selftests/x86/sigreturn_32.
Fixes: 45d7b255747c ("x86/entry/32: Enter the kernel via trampoline stack")
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
|
|
This will allow us to get percpu access working before FIXUP_FRAME,
which will allow us to unwind ESPFIX earlier.
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
|
|
When re-building the IRET frame we use %eax as an destination %esp,
make sure to then also match the segment for when there is a nonzero
SS base (ESPFIX).
[peterz: Changelog and minor edits]
Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs")
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
|
|
As reported by Lai, the commit 3c88c692c287 ("x86/stackframe/32:
Provide consistent pt_regs") wrecked the IRET EXTABLE entry by making
.Lirq_return not point at IRET.
Fix this by placing IRET_FRAME in RESTORE_REGS, to mirror how
FIXUP_FRAME is part of SAVE_ALL.
Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs")
Reported-by: Lai Jiangshan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Cc: [email protected]
|
|
The entry stack in the cpu entry area is protected against overflow by the
readonly GDT on 64-bit, but on 32-bit the GDT needs to be writeable and
therefore does not trigger a fault on stack overflow.
Add a guard page.
Fixes: c482feefe1ae ("x86/entry/64: Make cpu_entry_area.tss read-only")
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
|
|
Commit 945fd17ab6ba ("x86/cpu_entry_area: Sync cpu_entry_area to
initial_page_table") introduced the sync for the initial page table for
32bit.
sync_initial_page_table() uses clone_pgd_range() which does the update for
the kernel page table. If PTI is enabled it also updates the user space
page table counterpart, which is assumed to be in the next page after the
target PGD.
At this point in time 32-bit did not have PTI support, so the user space
page table update was not taking place.
The support for PTI on 32-bit which was introduced later on, did not take
that into account and missed to add the user space counter part for the
initial page table.
As a consequence sync_initial_page_table() overwrites any data which is
located in the page behing initial_page_table causing random failures,
e.g. by corrupting doublefault_tss and wreckaging the doublefault handler
on 32bit.
Fix it by adding a "user" page table right after initial_page_table.
Fixes: 7757d607c6b3 ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Joerg Roedel <[email protected]>
Cc: [email protected]
|
|
The double fault TSS was missing GS setup, which is needed for stack
canaries to work.
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
|
|
In nbd_add_socket when krealloc succeeds, if nsock's allocation fail the
reallocted memory is leak. The correct behaviour should be assigning the
reallocted memory to config->socks right after success.
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Navid Emamdoost <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
for-5.5/drivers-post
Pull NVMe changes from Keith:
"- The only new feature is the optional hwmon support for nvme (Guenter
and Akinobu)
- A universal work-around for controllers reading discard payloads
beyond the range boundary (Eduard)
- Chaitanya graciously agreed to share the target driver maintenance"
* 'nvme-5.5' of git://git.infradead.org/nvme:
nvme: hwmon: add quirk to avoid changing temperature threshold
nvme: hwmon: provide temperature min and max values for each sensor
nvmet: add another maintainer
nvme: Discard workaround for non-conformant devices
nvme: Add hardware monitoring support
|
|
Considering the previous changes, this is a more proper name.
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Drop the rbt_memtype_*() call rbt_ prefix, as we no longer use
an rbtree directly.
Signed-off-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Get rid of the passing the rb_root down the helper calls; there
is only one: &memtype_rbroot.
No change in functionality.
[ mingo: Fixed the changelog which described a different version of the patch. ]
Signed-off-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
With some considerations, the custom pat_rbtree implementation can be
simplified to use most of the generic interval_tree machinery:
- The tree inorder traversal can slightly differ when there are key
('start') collisions in the tree due to one going left and another right.
This, however, only affects the output of debugfs' pat_memtype_list file.
- Generic interval trees are now fully closed [a, b], for which we need
to adjust the last endpoint (ie: end - 1).
- Erasing logic must remain untouched as well.
- In order for the types to remain u64, the 'memtype_interval' calls are
introduced, as opposed to simply using struct interval_tree.
In addition, the PAT tree might potentially also benefit by the fast overlap
detection for the insertion case when looking up the first overlapping node
in the tree.
No change in behavior is intended.
Finally, I've tested this on various servers, via sanity warnings, running
side by side with the current version and so far see no differences in the
returned pointer node when doing memtype_rb_lowest_match() lookups.
Signed-off-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing
the value of the temperature threshold feature for specific devices that
show undesirable behavior.
Guenter reported:
"On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the
Composite temperature sensor results in a temperature warning, and that
warning is sticky until I reset the controller.
It doesn't seem to matter which temperature I write; writing -273000 has
the same result."
The Intel NVMe has the latest firmware version installed, so this isn't
a problem that was ever fixed.
Reported-by: Guenter Roeck <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Jean Delvare <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
According to the NVMe specification, the over temperature threshold and
under temperature threshold features shall be implemented for Composite
Temperature if a non-zero WCTEMP field value is reported in the Identify
Controller data structure. The features are also implemented for all
implemented temperature sensors (i.e., all Temperature Sensor fields that
report a non-zero value).
This provides the over temperature threshold and under temperature
threshold for each sensor as temperature min and max values of hwmon
sysfs attributes.
The WCTEMP is already provided as a temperature max value for Composite
Temperature, but this change isn't incompatible. Because the default
value of the over temperature threshold for Composite Temperature is
the WCTEMP.
Now the alarm attribute for Composite Temperature indicates one of the
temperature is outside of a temperature threshold. Because there is only
a single bit in Critical Warning field that indicates a temperature is
outside of a threshold.
Example output from the "sensors" command:
nvme-pci-0100
Adapter: PCI adapter
Composite: +33.9°C (low = -273.1°C, high = +69.8°C)
(crit = +79.8°C)
Sensor 1: +34.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +31.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 5: +47.9°C (low = -273.1°C, high = +65261.8°C)
This also adds helper macros for kelvin from/to milli Celsius conversion,
and replaces the repeated code in hwmon.c.
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Jean Delvare <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Sagi and I have been pretty busy lately, and Chaitanya has been
helping a lot with target work and agreed to share the load.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
We really don't need this, as the slow path will do the right thing
anyway.
This reverts commit 6952a7f8446ee85ea9d10ab87b64797a031eaae3.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Requests that triggers flushing volatile writeback cache to disk (barriers)
have significant effect to overall performance.
Block layer has sophisticated engine for combining several flush requests
into one. But there is no statistics for actual flushes executed by disk.
Requests which trigger flushes usually are barriers - zero-size writes.
This patch adds two iostat counters into /sys/class/block/$dev/stat and
/proc/diskstats - count of completed flush requests and their total time.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add kernel AUX area sampling definitions, which brings perf_event.h into
line with the kernel version.
New sample type PERF_SAMPLE_AUX requests a sample of the AUX area
buffer. New perf_event_attr member 'aux_sample_size' specifies the
desired size of the sample.
Also add support for parsing samples containing AUX area data i.e.
PERF_SAMPLE_AUX.
Committer notes:
I squashed the first two patches in this series to avoid breaking
automatic bisection, i.e. after applying only the original first patch
in this series we would have:
# perf test -v parsing
26: Sample parsing :
--- start ---
test child forked, pid 17018
sample format has changed, some new PERF_SAMPLE_ bit was introduced - test needs updating
test child finished with -1
---- end ----
Sample parsing: FAILED!
#
With the two paches combined:
# perf test parsing
26: Sample parsing : Ok
#
Signed-off-by: Adrian Hunter <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Preparatory work for shattering mmu.c into multiple files. Besides making it easier
to follow, this will also make it possible to write unit tests for various parts.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
apic-access-page
According to Intel SDM section 28.3.3.3/28.3.3.4 Guidelines for Use
of the INVVPID/INVEPT Instruction, the hypervisor needs to execute
INVVPID/INVEPT X in case CPU executes VMEntry with VPID/EPTP X and
either: "Virtualize APIC accesses" VM-execution control was changed
from 0 to 1, OR the value of apic_access_page was changed.
In the nested case, the burden falls on L1, unless L0 enables EPT in
vmcs02 but L1 enables neither EPT nor VPID in vmcs12. For this reason
prepare_vmcs02() and load_vmcs12_host_state() have special code to
request a TLB flush in case L1 does not use EPT but it uses
"virtualize APIC accesses".
This special case however is not necessary. On a nested vmentry the
physical TLB will already be flushed except if all the following apply:
* L0 uses VPID
* L1 uses VPID
* L0 can guarantee TLB entries populated while running L1 are tagged
differently than TLB entries populated while running L2.
If the first condition is false, the processor will flush the TLB
on vmentry to L2. If the second or third condition are false,
prepare_vmcs02() will request KVM_REQ_TLB_FLUSH. However, even
if both are true, no extra TLB flush is needed to handle the APIC
access page:
* if L1 doesn't use VPID, the second condition doesn't hold and the
TLB will be flushed anyway.
* if L1 uses VPID, it has to flush the TLB itself with INVVPID and
section 28.3.3.3 doesn't apply to L0.
* even INVEPT is not needed because, if L0 uses EPT, it uses different
EPTP when running L2 than L1 (because guest_mode is part of mmu-role).
In this case SDM section 28.3.3.4 doesn't apply.
Similarly, examining nested_vmx_vmexit()->load_vmcs12_host_state(),
one could note that L0 won't flush TLB only in cases where SDM sections
28.3.3.3 and 28.3.3.4 don't apply. In particular, if L0 uses different
VPIDs for L1 and L2 (i.e. vmx->vpid != vmx->nested.vpid02), section
28.3.3.3 doesn't apply.
Thus, remove this flush from prepare_vmcs02() and nested_vmx_vmexit().
Side-note: This patch can be viewed as removing parts of commit
fb6c81984313 ("kvm: vmx: Flush TLB when the APIC-access address changes”)
that is not relevant anymore since commit
1313cc2bd8f6 ("kvm: mmu: Add guest_mode to kvm_mmu_page_role”).
i.e. The first commit assumes that if L0 use EPT and L1 doesn’t use EPT,
then L0 will use same EPTP for both L0 and L1. Which indeed required
L0 to execute INVEPT before entering L2 guest. This assumption is
not true anymore since when guest_mode was added to mmu-role.
Reviewed-by: Joao Martins <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Fixes gcc '-Wunused-but-set-variable' warning:
arch/x86/kvm/x86.c: In function kvm_make_scan_ioapic_request_mask:
arch/x86/kvm/x86.c:7911:7: warning: variable called set but not
used [-Wunused-but-set-variable]
It is not used since commit 7ee30bc132c6 ("KVM: x86: deliver KVM
IOAPIC scan request to target vCPUs")
Signed-off-by: Mao Wenan <[email protected]>
Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
vmcs->apic_access_page is simply a token that the hypervisor puts into
the PFN of a 4KB EPTE (or PTE if using shadow-paging) that triggers
APIC-access VMExit or APIC virtualization logic whenever a CPU running
in VMX non-root mode read/write from/to this PFN.
As every write either triggers an APIC-access VMExit or write is
performed on vmcs->virtual_apic_page, the PFN pointed to by
vmcs->apic_access_page should never actually be touched by CPU.
Therefore, there is no need to mark vmcs02->apic_access_page as dirty
after unpin it on L2->L1 emulated VMExit or when L1 exit VMX operation.
Reviewed-by: Krish Sadhukhan <[email protected]>
Reviewed-by: Joao Martins <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Conflicts:
arch/x86/kvm/vmx/vmx.c
|
|
If X86_FEATURE_RTM is disabled, the guest should not be able to access
MSR_IA32_TSX_CTRL. We can therefore use it in KVM to force all
transactions from the guest to abort.
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The current guest mitigation of TAA is both too heavy and not really
sufficient. It is too heavy because it will cause some affected CPUs
(those that have MDS_NO but lack TAA_NO) to fall back to VERW and
get the corresponding slowdown. It is not really sufficient because
it will cause the MDS_NO bit to disappear upon microcode update, so
that VMs started before the microcode update will not be runnable
anymore afterwards, even with tsx=on.
Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for
the guest and let it run without the VERW mitigation. Even though
MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write
it on every vmentry, we can use the shared MSR functionality because
the host kernel need not protect itself from TSX-based side-channels.
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Because KVM always emulates CPUID, the CPUID clear bit
(bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually"
by the hypervisor when performing said emulation.
Right now neither kvm-intel.ko nor kvm-amd.ko implement
MSR_IA32_TSX_CTRL but this will change in the next patch.
Reviewed-by: Jim Mattson <[email protected]>
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
"Shared MSRs" are guest MSRs that are written to the host MSRs but
keep their value until the next return to userspace. They support
a mask, so that some bits keep the host value, but this mask is
only used to skip an unnecessary MSR write and the value written
to the MSR is always the guest MSR.
Fix this and, while at it, do not update smsr->values[slot].curr if
for whatever reason the wrmsr fails. This should only happen due to
reserved bits, so the value written to smsr->values[slot].curr
will not match when the user-return notifier and the host value will
always be restored. However, it is untidy and in rare cases this
can actually avoid spurious WRMSRs on return to userspace.
Cc: [email protected]
Reviewed-by: Jim Mattson <[email protected]>
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM does not implement MSR_IA32_TSX_CTRL, so it must not be presented
to the guests. It is also confusing to have !ARCH_CAP_TSX_CTRL_MSR &&
!RTM && ARCH_CAP_TAA_NO: lack of MSR_IA32_TSX_CTRL suggests TSX was not
hidden (it actually was), yet the value says that TSX is not vulnerable
to microarchitectural data sampling. Fix both.
Cc: [email protected]
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|