Age | Commit message (Collapse) | Author | Files | Lines |
|
Conflicts:
arch/x86/kvm/vmx/vmx.c
|
|
If X86_FEATURE_RTM is disabled, the guest should not be able to access
MSR_IA32_TSX_CTRL. We can therefore use it in KVM to force all
transactions from the guest to abort.
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The current guest mitigation of TAA is both too heavy and not really
sufficient. It is too heavy because it will cause some affected CPUs
(those that have MDS_NO but lack TAA_NO) to fall back to VERW and
get the corresponding slowdown. It is not really sufficient because
it will cause the MDS_NO bit to disappear upon microcode update, so
that VMs started before the microcode update will not be runnable
anymore afterwards, even with tsx=on.
Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for
the guest and let it run without the VERW mitigation. Even though
MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write
it on every vmentry, we can use the shared MSR functionality because
the host kernel need not protect itself from TSX-based side-channels.
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Because KVM always emulates CPUID, the CPUID clear bit
(bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually"
by the hypervisor when performing said emulation.
Right now neither kvm-intel.ko nor kvm-amd.ko implement
MSR_IA32_TSX_CTRL but this will change in the next patch.
Reviewed-by: Jim Mattson <[email protected]>
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
"Shared MSRs" are guest MSRs that are written to the host MSRs but
keep their value until the next return to userspace. They support
a mask, so that some bits keep the host value, but this mask is
only used to skip an unnecessary MSR write and the value written
to the MSR is always the guest MSR.
Fix this and, while at it, do not update smsr->values[slot].curr if
for whatever reason the wrmsr fails. This should only happen due to
reserved bits, so the value written to smsr->values[slot].curr
will not match when the user-return notifier and the host value will
always be restored. However, it is untidy and in rare cases this
can actually avoid spurious WRMSRs on return to userspace.
Cc: [email protected]
Reviewed-by: Jim Mattson <[email protected]>
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM does not implement MSR_IA32_TSX_CTRL, so it must not be presented
to the guests. It is also confusing to have !ARCH_CAP_TSX_CTRL_MSR &&
!RTM && ARCH_CAP_TAA_NO: lack of MSR_IA32_TSX_CTRL suggests TSX was not
hidden (it actually was), yet the value says that TSX is not vulnerable
to microarchitectural data sampling. Fix both.
Cc: [email protected]
Tested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm updates for Linux 5.5:
- Allow non-ISV data aborts to be reported to userspace
- Allow injection of data aborts from userspace
- Expose stolen time to guests
- GICv4 performance improvements
- vgic ITS emulation fixes
- Simplify FWB handling
- Enable halt pool counters
- Make the emulated timer PREEMPT_RT compliant
Conflicts:
include/uapi/linux/kvm.h
|
|
fbdev uses the physical address of our framebuffer for its fb_mmap()
routine. While we need to adapt this address for the new io BAR, we have
to fix v5.4 first! The simplest fix is to restore the smem back to v5.3
and we will then probably have to implement our fbops->fb_mmap() callback
to handle local memory.
Reported-by: Neil MacLeod <[email protected]>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=112256
Fixes: 5f889b9a61dd ("drm/i915: Disregard drm_mode_config.fb_base")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Tested-by: Neil MacLeod <[email protected]>
Reviewed-by: Ville Syrjälä <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit abc5520704ab438099fe352636b30b05c1253bea)
Signed-off-by: Joonas Lahtinen <[email protected]>
(cherry picked from commit 9faf5fa4d3dad3b0c0fa6e67689c144981a11c27)
Signed-off-by: Rodrigo Vivi <[email protected]>
|
|
kobject_put() should only be called in error path.
Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Jouni Hogander <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
We need to check the host page size is big enough to accomodate the
EQ. Let's do this before taking a reference on the EQ page to avoid
a potential leak if the check fails.
Cc: [email protected] # v5.2
Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
Signed-off-by: Greg Kurz <[email protected]>
Reviewed-by: Cédric Le Goater <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
The EQ page is allocated by the guest and then passed to the hypervisor
with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page
before handing it over to the HW. This reference is dropped either when
the guest issues the H_INT_RESET hcall or when the KVM device is released.
But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times,
either to reset the EQ (vCPU hot unplug) or to set a new EQ (guest reboot).
In both cases the existing EQ page reference is leaked because we simply
overwrite it in the XIVE queue structure without calling put_page().
This is especially visible when the guest memory is backed with huge pages:
start a VM up to the guest userspace, either reboot it or unplug a vCPU,
quit QEMU. The leak is observed by comparing the value of HugePages_Free in
/proc/meminfo before and after the VM is run.
Ideally we'd want the XIVE code to handle the EQ page de-allocation at the
platform level. This isn't the case right now because the various XIVE
drivers have different allocation needs. It could maybe worth introducing
hooks for this purpose instead of exposing XIVE internals to the drivers,
but this is certainly a huge work to be done later.
In the meantime, for easier backport, fix both vCPU unplug and guest reboot
leaks by introducing a wrapper around xive_native_configure_queue() that
does the necessary cleanup.
Reported-by: Satheesh Rajendran <[email protected]>
Cc: [email protected] # v5.2
Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
Signed-off-by: Cédric Le Goater <[email protected]>
Signed-off-by: Greg Kurz <[email protected]>
Tested-by: Lijun Pan <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
git://people.freedesktop.org/~agd5f/linux into drm-fixes
drm-fixes-5.4-2019-11-20:
amdgpu:
- Remove experimental flag for navi14
- Fix confusing power message failures on older VI parts
- Hang fix for gfxoff when using the read register interface
- Two stability regression fixes for Raven
Signed-off-by: Dave Airlie <[email protected]>
From: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
This reverts commit 1c4259159132ae4ceaf7c6db37a6cf76417f73d9.
S/G display is not stable with the IOMMU enabled on some
platforms.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=205523
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
|
|
There are still combinations of sbios and firmware that
are not stable.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=204689
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
|
|
When gfxoff is enabled, accessing gfx registers via MMIO
can lead to a hang.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=205497
Acked-by: Xiaojie Yuan <[email protected]>
Reviewed-by: Evan Quan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
|
|
For fine grained dpm, there is only two levels supported. However
to reflect correctly the current clock frequency, there is an
intermediate level faked. Thus on forcing level setting, we
need to treat level 2 correctly as level 1.
Signed-off-by: Evan Quan <[email protected]>
Reviewed-by: Kevin Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Otherwise, the error message prompted will confuse user.
Signed-off-by: Evan Quan <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected]
|
|
5.4 and newer works fine with navi14.
Reviewed-by: Xiaojie Yuan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
In most cases blk_tracing is not active, but bfq_log_bfqq macro
generate pid_str unconditionally, which result in significant overhead.
## Test
modprobe null_blk
echo bfq > /sys/block/nullb0/queue/scheduler
fio --name=t --ioengine=libaio --direct=1 --filename=/dev/nullb0 \
--runtime=30 --time_based=1 --rw=write --iodepth=128 --bs=4k
# Results
| | baseline | w/ patch | gain |
| iops | 113.19K | 126.42K | +11% |
Acked-by: Paolo Valente <[email protected]>
Signed-off-by: Dmitry Monakhov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This reverts commit a1b89132dc4f61071bdeaab92ea958e0953380a1.
Revert required hand-patching due to subsequent changes that were
applied since commit a1b89132dc4f61071bdeaab92ea958e0953380a1.
Requires: ed0302e83098d ("dm crypt: make workqueue names device-specific")
Cc: [email protected]
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=199857
Reported-by: Vito Caputo <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2019-11-20
This series introduces some fixes to mlx5 driver.
Please pull and let me know if there is any problem.
For -stable v4.9:
('net/mlx5e: Fix set vf link state error flow')
For -stable v4.14
('net/mlxfw: Verify FSM error code translation doesn't exceed array size')
For -stable v4.19
('net/mlx5: Fix auto group size calculation')
For -stable v5.3
('net/mlx5e: Fix error flow cleanup in mlx5e_tc_tun_create_header_ipv4/6')
('net/mlx5e: Do not use non-EXT link modes in EXT mode')
('net/mlx5: Update the list of the PCI supported devices')
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
Both rtl_work_func_t() and rtl8152_close() call napi_disable().
Since the two calls aren't protected by a lock, if the close
function starts executing before the work function, we can get into a
situation where the napi_disable() function is called twice in
succession (first by rtl8152_close(), then by set_carrier()).
In such a situation, the second call would loop indefinitely, since
rtl8152_close() doesn't call napi_enable() to clear the NAPI_STATE_SCHED
bit.
The rtl8152_close() function in turn issues a
cancel_delayed_work_sync(), and so it would wait indefinitely for the
rtl_work_func_t() to complete. Since rtl8152_close() is called by a
process holding rtnl_lock() which is requested by other processes, this
eventually leads to a system deadlock and crash.
Re-order the napi_disable() call to occur after the work function
disabling and urb cancellation calls are issued.
Change-Id: I6ef0b703fc214998a037a68f722f784e1d07815e
Reported-by: http://crbug.com/1017928
Signed-off-by: Prashant Malani <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Stefan Wahren says:
====================
net: qca_spi: Fix receive and reset issues
This small patch series fixes two major issues in the SPI driver for the
QCA700x.
It has been tested on a Charge Control C 300 (NXP i.MX6ULL +
2x QCA7000).
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
The reset counter is specific for every QCA700x chip. So move this
into the private driver struct. Otherwise we get unpredictable reset
behavior in setups with multiple QCA700x chips.
Fixes: 291ab06ecf67 (net: qualcomm: new Ethernet over SPI driver for QCA7000)
Signed-off-by: Stefan Wahren <[email protected]>
Signed-off-by: Stefan Wahren <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
When receiving many or larger packets, e.g. when doing a file download,
it was observed that the read buffer size register reports up to 4 bytes
more than the current define allows in the check.
If this is the case, then no data transfer is initiated to receive the
packets (and thus to empty the buffer) which results in a stall of the
interface.
These 4 bytes are a hardware generated frame length which is prepended
to the actual frame, thus we have to respect it during our check.
Fixes: 026b907d58c4 ("net: qca_spi: Add available buffer space verification")
Signed-off-by: Michael Heimpold <[email protected]>
Signed-off-by: Stefan Wahren <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Juliet Kim says:
====================
Support both XIVE and XICS modes in ibmvnic
This series aims to support both XICS and XIVE with avoiding
a regression in behavior when a system runs in XICS mode.
Patch 1 reverts commit 11d49ce9f7946dfed4dcf5dbde865c78058b50ab
(“net/ibmvnic: Fix EOI when running in XIVE mode.”)
Patch 2 Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
Reversion of commit 11d49ce9f7946dfed4dcf5dbde865c78058b50ab
(“net/ibmvnic: Fix EOI when running in XIVE mode.”) leaves us
calling H_EOI even in XIVE mode. That will fail with H_FUNCTION
because H_EOI is not supported in that mode. That failure is
harmless. Ignore it so we can use common code for both XICS and
XIVE.
Signed-off-by: Juliet Kim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This reverts commit 11d49ce9f7946dfed4dcf5dbde865c78058b50ab
(“net/ibmvnic: Fix EOI when running in XIVE mode.”) since that
has the unintended effect of changing the interrupt priority
and emits warning when running in legacy XICS mode.
Signed-off-by: Juliet Kim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Array mlxfw_fsm_state_err_str contains value to string translation, when
values are provided by mlxfw_dev. If value is larger than
MLXFW_FSM_STATE_ERR_MAX, return "unknown error" as expected instead of
reading an address than exceed array size.
Fixes: 410ed13cae39 ("Add the mlxfw module for Mellanox firmware flash process")
Signed-off-by: Eran Ben Elisha <[email protected]>
Acked-by: Jiri Pirko <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Add the upcoming ConnectX-6 LX device ID.
Fixes: 85327a9c4150 ("net/mlx5: Update the list of the PCI supported devices")
Signed-off-by: Shani Shapp <[email protected]>
Reviewed-by: Eran Ben Elisha <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Once all the large flow groups (defined by the user when the flow table
is created - max_num_groups) were created, then all the following new
flow groups will have only one flow table entry, even though the flow table
has place to larger groups.
Fix the condition to prefer large flow group.
Fixes: f0d22d187473 ("net/mlx5_core: Introduce flow steering autogrouped flow table")
Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Device that doesn't support IP-in-IP offloads has to filter csum and gso
offload support, otherwise kernel will conclude that device is capable of
offloading csum and gso for IP-in-IP tunnels and that might result in
IP-in-IP tunnel not functioning.
Fixes: 25948b87dda2 ("net/mlx5e: Support TSO and TX checksum offloads for IP-in-IP")
Signed-off-by: Marina Varshaver <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
On some old Firmwares, connector type value was not supported, and value
read from FW was 0. For those, driver used link mode in order to set
connector type in link_ksetting.
After FW exposed the connector type, driver translated the value to ethtool
definitions. However, as 0 is a valid value, before returning PORT_OTHER,
driver run the check of link mode in order to maintain backward
compatibility.
Cited patch added support to EXT mode. With both features (connector type
and EXT link modes) ,if connector_type read from FW is 0 and EXT mode is
set, driver mistakenly compare EXT link modes to non-EXT link mode.
Fixed that by skipping this comparison if we are in EXT mode, as connector
type value is valid in this scenario.
Fixes: 6a897372417e ("net/mlx5: ethtool, Add ethtool support for 50Gbps per lane link modes")
Signed-off-by: Eran Ben Elisha <[email protected]>
Reviewed-by: Aya Levin <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Before this commit the ndo always returned success.
Fix that.
Fixes: 1ab2068a4c66 ("net/mlx5: Implement vports admin state backup/restore")
Signed-off-by: Roi Dayan <[email protected]>
Reviewed-by: Vlad Buslov <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
When an ste hash table has too many collision we enlarge it
to a bigger hash table (rehash). Rehashing collision improvement
depends on the bytemask value. The more 1 bits we have in bytemask
means better spreading in the table.
Without this fix tables can grow in size without providing any
improvement which can lead to memory depletion and failures.
This patch will limit table rehash to reduce memory and improve
the performance.
Fixes: 41d07074154c ("net/mlx5: DR, Expose steering rule functionality")
Signed-off-by: Alex Vesker <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
The byte mask fields affect on the hash index distribution,
when the byte mask is zero, the hash calculation will always
be equal to the same index.
To avoid unneeded rehash of hash tables mark the table to skip
rehash.
This is needed by the next patch which will limit table rehash
to reduce memory consumption.
Fixes: 41d07074154c ("net/mlx5: DR, Expose steering rule functionality")
Signed-off-by: Alex Vesker <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
When creating a CQ, the CPU id is used for the vector value.
This would fail in-case the CPU id was higher than the maximum
vector value.
Fixes: 297cccebdc5a ("net/mlx5: DR, Expose an internal API to issue RDMA operations")
Signed-off-by: Alex Vesker <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Reviewed-by: Erez Shitrit <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Mirred action parsing code in parse_tc_fdb_actions() first checks if
out_dev has same parent id, and only verifies that there is a pending encap
action that was parsed before. Recent change in vxlan module made function
netdev_port_same_parent_id() to return true when called for mlx5 eswitch
representor and vxlan device created explicitly on mlx5 representor
device (vxlan devices created with "external" flag without explicitly
specifying parent interface are not affected). With call to
netdev_port_same_parent_id() returning true, incorrect code path is chosen
and encap rules fail to offload because vxlan dev is not a valid eswitch
forwarding dev. Dmesg log of error:
[ 1784.389797] devices ens1f0_0 vxlan1 not on same switch HW, can't offload forwarding
In order to fix the issue, rearrange conditional in parse_tc_fdb_actions()
to check for pending encap action before checking if out_dev has the same
parent id.
Fixes: 0ce1822c2a08 ("vxlan: add adjacent link to limit depth level")
Signed-off-by: Vlad Buslov <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Current code uses the old method of prio encoding in
flow_cls_common_offload. Fix to follow the changes introduced in
commit ef01adae0e43 ("net: sched: use major priority number as hardware priority").
Fixes: fcb64c0f5640 ("net/mlx5: E-Switch, add ingress rate support")
Signed-off-by: Eli Cohen <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Be sure to release the neighbour in case of failures after successful
route lookup.
Fixes: 101f4de9dd52 ("net/mlx5e: Move TC tunnel offloading code to separate source file")
Signed-off-by: Eli Cohen <[email protected]>
Reviewed-by: Roi Dayan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Julian Wiedmann says:
====================
s390/qeth: fixes 2019-11-20
please apply two late qeth fixes to your net tree.
The first fixes a deadlock that can occur if a qeth device is set
offline while in the middle of processing deferred HW events.
The second patch converts the return value of an error path to
use -EIO, so that it can be passed back to userspace.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
When propagating IO errors back to userspace, one error path in
qeth_irq() currently returns '1' instead of a proper errno.
Fixes: 54daaca7024d ("s390/qeth: cancel cmd on early error")
Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The L2 bridgeport code uses the coarse 'conf_mutex' for guarding access
to its configuration state.
This can result in a deadlock when qeth_l2_stop_card() - called under the
conf_mutex - blocks on flush_workqueue() to wait for the completion of
pending bridgeport workers. Such workers would also need to aquire
the conf_mutex, stalling indefinitely.
Introduce a lock that specifically guards the bridgeport configuration,
so that the workers no longer need the conf_mutex.
Wrapping qeth_l2_promisc_to_bridge() in this fine-grained lock then also
fixes a theoretical race against a concurrent qeth_bridge_port_role_store()
operation.
Fixes: c0a2e4d10d93 ("s390/qeth: conclude all event processing before offlining a card")
Signed-off-by: Julian Wiedmann <[email protected]>
Reviewed-by: Alexandra Winter <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Previously we will return directly if (!rt || !rt->fib6_nh.fib_nh_gw_family)
in function rt6_probe(), but after commit cc3a86c802f0
("ipv6: Change rt6_probe to take a fib6_nh"), the logic changed to
return if there is fib_nh_gw_family.
Fixes: cc3a86c802f0 ("ipv6: Change rt6_probe to take a fib6_nh")
Signed-off-by: Hangbin Liu <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
kobject_init_and_add takes reference even when it fails. This has
to be given up by the caller in error handling. Otherwise memory
allocated by kobject_init_and_add is never freed. Originally found
by Syzkaller:
BUG: memory leak
unreferenced object 0xffff8880679f8b08 (size 8):
comm "netdev_register", pid 269, jiffies 4294693094 (age 12.132s)
hex dump (first 8 bytes):
72 78 2d 30 00 36 20 d4 rx-0.6 .
backtrace:
[<000000008c93818e>] __kmalloc_track_caller+0x16e/0x290
[<000000001f2e4e49>] kvasprintf+0xb1/0x140
[<000000007f313394>] kvasprintf_const+0x56/0x160
[<00000000aeca11c8>] kobject_set_name_vargs+0x5b/0x140
[<0000000073a0367c>] kobject_init_and_add+0xd8/0x170
[<0000000088838e4b>] net_rx_queue_update_kobjects+0x152/0x560
[<000000006be5f104>] netdev_register_kobject+0x210/0x380
[<00000000e31dab9d>] register_netdevice+0xa1b/0xf00
[<00000000f68b2465>] __tun_chr_ioctl+0x20d5/0x3dd0
[<000000004c50599f>] tun_chr_ioctl+0x2f/0x40
[<00000000bbd4c317>] do_vfs_ioctl+0x1c7/0x1510
[<00000000d4c59e8f>] ksys_ioctl+0x99/0xb0
[<00000000946aea81>] __x64_sys_ioctl+0x78/0xb0
[<0000000038d946e5>] do_syscall_64+0x16f/0x580
[<00000000e0aa5d8f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<00000000285b3d1a>] 0xffffffffffffffff
Cc: David Miller <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Signed-off-by: Jouni Hogander <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
It is safer and simpler to drop the uaccess assembly macros in favour of
inline C functions. Although this bloats the Image size slightly, it
aligns our user copy routines with '{get,put}_user()' and generally
makes the code a lot easier to reason about.
Cc: Catalin Marinas <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Tested-by: Mark Rutland <[email protected]>
Signed-off-by: Pavel Tatashin <[email protected]>
[will: tweaked commit message and changed temporary variable names]
Signed-off-by: Will Deacon <[email protected]>
|
|
A number of our uaccess routines ('__arch_clear_user()' and
'__arch_copy_{in,from,to}_user()') fail to re-enable PAN if they
encounter an unhandled fault whilst accessing userspace.
For CPUs implementing both hardware PAN and UAO, this bug has no effect
when both extensions are in use by the kernel.
For CPUs implementing hardware PAN but not UAO, this means that a kernel
using hardware PAN may execute portions of code with PAN inadvertently
disabled, opening us up to potential security vulnerabilities that rely
on userspace access from within the kernel which would usually be
prevented by this mechanism. In other words, parts of the kernel run the
same way as they would on a CPU without PAN implemented/emulated at all.
For CPUs not implementing hardware PAN and instead relying on software
emulation via 'CONFIG_ARM64_SW_TTBR0_PAN=y', the impact is unfortunately
much worse. Calling 'schedule()' with software PAN disabled means that
the next task will execute in the kernel using the page-table and ASID
of the previous process even after 'switch_mm()', since the actual
hardware switch is deferred until return to userspace. At this point, or
if there is a intermediate call to 'uaccess_enable()', the page-table
and ASID of the new process are installed. Sadly, due to the changes
introduced by KPTI, this is not an atomic operation and there is a very
small window (two instructions) where the CPU is configured with the
page-table of the old task and the ASID of the new task; a speculative
access in this state is disastrous because it would corrupt the TLB
entries for the new task with mappings from the previous address space.
As Pavel explains:
| I was able to reproduce memory corruption problem on Broadcom's SoC
| ARMv8-A like this:
|
| Enable software perf-events with PERF_SAMPLE_CALLCHAIN so userland's
| stack is accessed and copied.
|
| The test program performed the following on every CPU and forking
| many processes:
|
| unsigned long *map = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
| MAP_SHARED | MAP_ANONYMOUS, -1, 0);
| map[0] = getpid();
| sched_yield();
| if (map[0] != getpid()) {
| fprintf(stderr, "Corruption detected!");
| }
| munmap(map, PAGE_SIZE);
|
| From time to time I was getting map[0] to contain pid for a
| different process.
Ensure that PAN is re-enabled when returning after an unhandled user
fault from our uaccess routines.
Cc: Catalin Marinas <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Tested-by: Mark Rutland <[email protected]>
Cc: <[email protected]>
Fixes: 338d4f49d6f7 ("arm64: kernel: Add support for Privileged Access Never")
Signed-off-by: Pavel Tatashin <[email protected]>
[will: rewrote commit message]
Signed-off-by: Will Deacon <[email protected]>
|
|
Linux-next commit titled "perf/core: Optimize perf_init_event()"
changed the semantics of PMU device driver registration.
It was done to speed up the lookup/handling of PMU device driver
specific events. It also enforces that only one PMU device
driver will be registered of type PERF_EVENT_RAW.
This change added these line in function perf_pmu_register():
...
+ ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
+ if (ret < 0)
goto free_pdc;
+
+ WARN_ON(type >= 0 && ret != type);
The warn_on generates a message. We have 3 PMU device drivers,
each registered as type PERF_TYPE_RAW.
The cf_diag device driver (arch/s390/kernel/perf_cpumf_cf_diag.c)
always hits the WARN_ON because it is the second PMU device driver
(after sampling device driver arch/s390/kernel/perf_cpumf_sf.c)
which is registered as type 4 (PERF_TYPE_RAW).
So when the sampling device driver is registered, ret has value 4.
When cf_diag device driver is registered with type 4,
ret has value of 5 and WARN_ON fires.
Adjust the PMU device drivers for s390 to support the new
semantics required by perf_pmu_register().
Signed-off-by: Thomas Richter <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>
|
|
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Since commit 1313cc2bd8f6 ("kvm: mmu: Add guest_mode to kvm_mmu_page_role"),
guest_mode was added to mmu-role and therefore if L0 use EPT, it will
always run L1 and L2 with different EPTP. i.e. EPTP01!=EPTP02.
Because TLB entries are tagged with EP4TA, KVM can assume
TLB entries populated while running L2 are tagged differently
than TLB entries populated while running L1.
Therefore, update nested_has_guest_tlb_tag() to consider if
L0 use EPT instead of if L1 use EPT.
Reviewed-by: Joao Martins <[email protected]>
Reviewed-by: Krish Sadhukhan <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|