aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-11-21Merge tag 'arm64-fixes' of ↵Linus Torvalds7-31/+27
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fix from Will Deacon: "Ensure PAN is re-enabled following user fault in uaccess routines. After I thought we were done for 5.4, we had a report this week of a nasty issue that has been shown to leak data between different user address spaces thanks to corruption of entries in the TLB. In hindsight, we should have spotted this in review when the PAN code was merged back in v4.3, but hindsight is 20/20 and I'm trying not to beat myself up too much about it despite being fairly miserable. Anyway, the fix is "obvious" but the actual failure is more more subtle, and is described in the commit message. I've included a fairly mechanical follow-up patch here as well, which moves this checking out into the C wrappers which is what we do for {get,put}_user() already and allows us to remove these bloody assembly macros entirely. The patches have passed kernelci [1] [2] [3] and CKI [4] tests over night, as well as some targetted testing [5] for this particular issue. The first patch is tagged for stable and should be applied to 4.14, 4.19 and 5.3. I have separate backports for 4.4 and 4.9, which I'll send out once this has landed in your tree (although the original patch applies cleanly, it won't build for those two trees). Thanks to Pavel Tatashin for reporting this and Mark Rutland for helping to diagnose the issue and review/test the solution" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: uaccess: Remove uaccess_*_not_uao asm macros arm64: uaccess: Ensure PAN is re-enabled after unhandled uaccess fault
2019-11-21audit: Move audit_log_task declaration under CONFIG_AUDITSYSCALLJiri Olsa1-3/+5
The 0-DAY found that audit_log_task is not declared under CONFIG_AUDITSYSCALL which causes compilation error when it is not defined: kernel/bpf/syscall.o: In function `bpf_audit_prog.isra.30': >> syscall.c:(.text+0x860): undefined reference to `audit_log_task' Adding the audit_log_task declaration and stub within CONFIG_AUDITSYSCALL ifdef. Fixes: 91e6015b082b ("bpf: Emit audit messages upon successful prog load and unload") Reported-by: kbuild test robot <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21sfc: Only cancel the PPS workqueue if it existsMartin Habets1-1/+2
The workqueue only exists for the primary PF. For other functions we hit a WARN_ON in kernel/workqueue.c. Fixes: 7c236c43b838 ("sfc: Add support for IEEE-1588 PTP") Signed-off-by: Martin Habets <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21Merge tag 'for-linus-20191121' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+1
Pull block fix from Jens Axboe: "Just a single fix for an issue in nbd introduced in this cycle" * tag 'for-linus-20191121' of git://git.kernel.dk/linux-block: nbd:fix memory leak in nbd_get_socket()
2019-11-21Merge tag 'gpio-v5.4-5' of ↵Linus Torvalds5-9/+31
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO fixes from Linus Walleij: "A last set of small fixes for GPIO, this cycle was quite busy. - Fix debounce delays on the MAX77620 GPIO expander - Use the correct unit for debounce times on the BD70528 GPIO expander - Get proper deps for parallel builds of the GPIO tools - Add a specific ACPI quirk for the Terra Pad 1061" * tag 'gpio-v5.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpiolib: acpi: Add Terra Pad 1061 to the run_edge_events_on_boot_blacklist tools: gpio: Correctly add make dependencies for gpio_utils gpio: bd70528: Use correct unit for debounce times gpio: max77620: Fixup debounce delays
2019-11-21net: Fix Kconfig indentation, continuedKrzysztof Kozlowski5-148/+148
Adjust indentation from spaces to tab (+optional two spaces) as in coding style. This fixes various indentation mixups (seven spaces, tab+one space, etc). Signed-off-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21drivers: net: Fix Kconfig indentation, continuedKrzysztof Kozlowski10-149/+149
Adjust indentation from spaces to tab (+optional two spaces) as in coding style. This fixes various indentation mixups (seven spaces, tab+one space, etc). Signed-off-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21Merge tag 'for-linus-2019-11-21' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd fixlet from Christian Brauner: "This contains a simple fix for the pidfd poll method. In the original patchset pidfd_poll() was made to return an unsigned int. However, the poll method is defined to return a __poll_t. While the unsigned int is not a huge deal it's just nicer to return a __poll_t. I've decided to send it right before the 5.4 release mainly so that stable doesn't need to backport it to both 5.4 and 5.3" * tag 'for-linus-2019-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fork: fix pidfd_poll()'s return type
2019-11-21nfc: port100: handle command failure cleanlyOliver Neukum1-1/+1
If starting the transfer of a command suceeds but the transfer for the reply fails, it is not enough to initiate killing the transfer for the command may still be running. You need to wait for the killing to finish before you can reuse URB and buffer. Reported-and-tested-by: [email protected] Signed-off-by: Oliver Neukum <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21lwtunnel: check erspan options before allocating tun_infoXin Long1-8/+16
As Jakub suggested on another patch, it's better to do the check on erspan options before allocating memory. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21lwtunnel: be STRICT to validate the new LWTUNNEL_IP(6)_OPTSXin Long1-0/+3
LWTUNNEL_IP(6)_OPTS are the new items in ip(6)_tun_policy, which are parsed by nla_parse_nested_deprecated(). We should check it strictly by setting .strict_start_type = LWTUNNEL_IP(6)_OPTS. This patch also adds missing LWTUNNEL_IP6_OPTS in ip6_tun_policy. Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve") Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21net: remove the unnecessary strict_start_type in some policiesXin Long3-3/+0
ct_policy and mpls_policy are parsed with nla_parse_nested(), which does NL_VALIDATE_STRICT validation, strict_start_type is not needed to set as it is actually trying to make some attributes parsed with NL_VALIDATE_STRICT. This patch is to remove it, and do the same on rtm_nh_policy which is parsed by nlmsg_parse(). Suggested-by: Jakub Kicinski <[email protected]> Signed-off-by: Xin Long <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21Merge branch 'net-sched-support-vxlan-and-erspan-options'David S. Miller4-1/+514
Xin Long says: ==================== net: sched: support vxlan and erspan options This patchset is to add vxlan and erspan options support in cls_flower and act_tunnel_key. The form is pretty much like geneve_opts in: https://patchwork.ozlabs.org/patch/935272/ https://patchwork.ozlabs.org/patch/954564/ but only one option is allowed for vxlan and erspan. v1->v2: - see each patch changelog. ==================== Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21net: sched: allow flower to match erspan optionsXin Long2-0/+161
This patch is to allow matching options in erspan. The options can be described in the form: VER:INDEX:DIR:HWID/VER:INDEX_MASK:DIR_MASK:HWID_MASK. When ver is set to 1, index will be applied while dir and hwid will be ignored, and when ver is set to 2, dir and hwid will be used while index will be ignored. Different from geneve, only one option can be set. And also, geneve options, vxlan options or erspan options can't be set at the same time. # ip link add name erspan1 type erspan external # tc qdisc add dev erspan1 ingress # tc filter add dev erspan1 protocol ip parent ffff: \ flower \ enc_src_ip 10.0.99.192 \ enc_dst_ip 10.0.99.193 \ enc_key_id 11 \ erspan_opts 1:12:0:0/1:ffff:0:0 \ ip_proto udp \ action mirred egress redirect dev eth0 v1->v2: - improve some err msgs of extack. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21net: sched: allow flower to match vxlan optionsXin Long2-0/+122
This patch is to allow matching gbp option in vxlan. The options can be described in the form GBP/GBP_MASK, where GBP is represented as a 32bit hexadecimal value. Different from geneve, only one option can be set. And also, geneve options and vxlan options can't be set at the same time. # ip link add name vxlan0 type vxlan dstport 0 external # tc qdisc add dev vxlan0 ingress # tc filter add dev vxlan0 protocol ip parent ffff: \ flower \ enc_src_ip 10.0.99.192 \ enc_dst_ip 10.0.99.193 \ enc_key_id 11 \ vxlan_opts 01020304/ffffffff \ ip_proto udp \ action mirred egress redirect dev eth0 v1->v2: - add .strict_start_type for enc_opts_policy as Jakub noticed. - use Duplicate instead of Wrong in err msg for extack as Jakub suggested. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21net: sched: add erspan option support to act_tunnel_keyXin Long2-0/+134
This patch is to allow setting erspan options using the act_tunnel_key action. Different from geneve options, only one option can be set. And also, geneve options, vxlan options or erspan options can't be set at the same time. Options are expressed as ver:index:dir:hwid, when ver is set to 1, index will be applied while dir and hwid will be ignored, and when ver is set to 2, dir and hwid will be used while index will be ignored. # ip link add name erspan1 type erspan external # tc qdisc add dev eth0 ingress # tc filter add dev eth0 protocol ip parent ffff: \ flower indev eth0 \ ip_proto udp \ action tunnel_key \ set src_ip 10.0.99.192 \ dst_ip 10.0.99.193 \ dst_port 6081 \ id 11 \ erspan_opts 1:2:0:0 \ action mirred egress redirect dev erspan1 v1->v2: - do the validation when dst is not yet allocated as Jakub suggested. - use Duplicate instead of Wrong in err msg for extack. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21net: sched: add vxlan option support to act_tunnel_keyXin Long2-1/+97
This patch is to allow setting vxlan options using the act_tunnel_key action. Different from geneve options, only one option can be set. And also, geneve options and vxlan options can't be set at the same time. gbp is the only param for vxlan options: # ip link add name vxlan0 type vxlan dstport 0 external # tc qdisc add dev eth0 ingress # tc filter add dev eth0 protocol ip parent ffff: \ flower indev eth0 \ ip_proto udp \ action tunnel_key \ set src_ip 10.0.99.192 \ dst_ip 10.0.99.193 \ dst_port 6081 \ id 11 \ vxlan_opts 01020304 \ action mirred egress redirect dev vxlan0 v1->v2: - add .strict_start_type for enc_opts_policy as Jakub noticed. - use Duplicate instead of Wrong in err msg for extack as Jakub suggested. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21octeontx2-af: Fix uninitialized variable in debugfsDan Carpenter1-1/+2
If rvu_get_blkaddr() fails, then this rvu_cgx_nix_cuml_stats() returns zero and we write some uninitialized data into the debugfs output. On the error paths, the use of the uninitialized "*stat" is harmless, but it will lead to a Smatch warning (static analysis) and a UBSan warning (runtime analysis) so we should prevent that as well. Fixes: f967488d095e ("octeontx2-af: Add per CGX port level NIX Rx/Tx counters") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21vsock: avoid to assign transport if its initialization failsStefano Garzarella1-1/+8
If transport->init() fails, we can't assign the transport to the socket, because it's not initialized correctly, and any future calls to the transport callbacks would have an unexpected behavior. Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Reported-and-tested-by: [email protected] Signed-off-by: Stefano Garzarella <[email protected]> Reviewed-by: Jorgen Hansen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-11-21mt76: fix fix ampdu lockingMarkus Theil3-6/+12
The current ampdu locking code does not unlock its mutex in the early return case. This patch fixes it. Signed-off-by: Markus Theil <[email protected]> Acked-by: Felix Fietkau <[email protected]> Signed-off-by: Kalle Valo <[email protected]>
2019-11-21x86/entry/32: Fix NMI vs ESPFIXPeter Zijlstra1-12/+41
When the NMI lands on an ESPFIX_SS, we are on the entry stack and must swizzle, otherwise we'll run do_nmi() on the entry stack, which is BAD. Also, similar to the normal exception path, we need to correct the ESPFIX magic before leaving the entry stack, otherwise pt_regs will present a non-flat stack pointer. Tested by running sigreturn_32 concurrent with perf-record. Fixes: e5862d0515ad ("x86/entry/32: Leave the kernel via trampoline stack") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Cc: [email protected]
2019-11-21x86/entry/32: Unwind the ESPFIX stack earlier on exception entryAndy Lutomirski1-14/+16
Right now, we do some fancy parts of the exception entry path while SS might have a nonzero base: we fill in regs->ss and regs->sp, and we consider switching to the kernel stack. This results in regs->ss and regs->sp referring to a non-flat stack and it may result in overflowing the entry stack. The former issue means that we can try to call iret_exc on a non-flat stack, which doesn't work. Tested with selftests/x86/sigreturn_32. Fixes: 45d7b255747c ("x86/entry/32: Enter the kernel via trampoline stack") Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected]
2019-11-21x86/entry/32: Move FIXUP_FRAME after pushing %fs in SAVE_ALLAndy Lutomirski1-31/+35
This will allow us to get percpu access working before FIXUP_FRAME, which will allow us to unwind ESPFIX earlier. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected]
2019-11-21x86/entry/32: Use %ss segment where requiredAndy Lutomirski1-5/+14
When re-building the IRET frame we use %eax as an destination %esp, make sure to then also match the segment for when there is a nonzero SS base (ESPFIX). [peterz: Changelog and minor edits] Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected]
2019-11-21x86/entry/32: Fix IRET exceptionPeter Zijlstra1-1/+1
As reported by Lai, the commit 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") wrecked the IRET EXTABLE entry by making .Lirq_return not point at IRET. Fix this by placing IRET_FRAME in RESTORE_REGS, to mirror how FIXUP_FRAME is part of SAVE_ALL. Fixes: 3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs") Reported-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Cc: [email protected]
2019-11-21x86/cpu_entry_area: Add guard page for entry stack on 32bitThomas Gleixner1-1/+5
The entry stack in the cpu entry area is protected against overflow by the readonly GDT on 64-bit, but on 32-bit the GDT needs to be writeable and therefore does not trigger a fault on stack overflow. Add a guard page. Fixes: c482feefe1ae ("x86/entry/64: Make cpu_entry_area.tss read-only") Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected]
2019-11-21x86/pti/32: Size initial_page_table correctlyThomas Gleixner1-0/+10
Commit 945fd17ab6ba ("x86/cpu_entry_area: Sync cpu_entry_area to initial_page_table") introduced the sync for the initial page table for 32bit. sync_initial_page_table() uses clone_pgd_range() which does the update for the kernel page table. If PTI is enabled it also updates the user space page table counterpart, which is assumed to be in the next page after the target PGD. At this point in time 32-bit did not have PTI support, so the user space page table update was not taking place. The support for PTI on 32-bit which was introduced later on, did not take that into account and missed to add the user space counter part for the initial page table. As a consequence sync_initial_page_table() overwrites any data which is located in the page behing initial_page_table causing random failures, e.g. by corrupting doublefault_tss and wreckaging the doublefault handler on 32bit. Fix it by adding a "user" page table right after initial_page_table. Fixes: 7757d607c6b3 ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32") Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Joerg Roedel <[email protected]> Cc: [email protected]
2019-11-21x86/doublefault/32: Fix stack canaries in the double fault handlerAndy Lutomirski1-0/+3
The double fault TSS was missing GS setup, which is needed for stack canaries to work. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected]
2019-11-21nbd: prevent memory leakNavid Emamdoost1-2/+3
In nbd_add_socket when krealloc succeeds, if nsock's allocation fail the reallocted memory is leak. The correct behaviour should be assigning the reallocted memory to config->socks right after success. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Navid Emamdoost <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-11-21Merge branch 'nvme-5.5' of git://git.infradead.org/nvme into ↵Jens Axboe8-4/+307
for-5.5/drivers-post Pull NVMe changes from Keith: "- The only new feature is the optional hwmon support for nvme (Guenter and Akinobu) - A universal work-around for controllers reading discard payloads beyond the range boundary (Eduard) - Chaitanya graciously agreed to share the target driver maintenance" * 'nvme-5.5' of git://git.infradead.org/nvme: nvme: hwmon: add quirk to avoid changing temperature threshold nvme: hwmon: provide temperature min and max values for each sensor nvmet: add another maintainer nvme: Discard workaround for non-conformant devices nvme: Add hardware monitoring support
2019-11-21x86/mm/pat: Rename pat_rbtree.c to pat_interval.cDavidlohr Bueso2-1/+1
Considering the previous changes, this is a more proper name. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-11-21x86/mm/pat: Drop the rbt_ prefix from external memtype callsDavidlohr Bueso3-20/+20
Drop the rbt_memtype_*() call rbt_ prefix, as we no longer use an rbtree directly. Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-11-21x86/mm/pat: Do not pass 'rb_root' down the memtype tree helper functionsDavidlohr Bueso1-13/+8
Get rid of the passing the rb_root down the helper calls; there is only one: &memtype_rbroot. No change in functionality. [ mingo: Fixed the changelog which described a different version of the patch. ] Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-11-21x86/mm/pat: Convert the PAT tree to a generic interval treeDavidlohr Bueso1-120/+42
With some considerations, the custom pat_rbtree implementation can be simplified to use most of the generic interval_tree machinery: - The tree inorder traversal can slightly differ when there are key ('start') collisions in the tree due to one going left and another right. This, however, only affects the output of debugfs' pat_memtype_list file. - Generic interval trees are now fully closed [a, b], for which we need to adjust the last endpoint (ie: end - 1). - Erasing logic must remain untouched as well. - In order for the types to remain u64, the 'memtype_interval' calls are introduced, as opposed to simply using struct interval_tree. In addition, the PAT tree might potentially also benefit by the fast overlap detection for the insertion case when looking up the first overlapping node in the tree. No change in behavior is intended. Finally, I've tested this on various servers, via sanity warnings, running side by side with the current version and so far see no differences in the returned pointer node when doing memtype_rb_lowest_match() lookups. Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-11-22nvme: hwmon: add quirk to avoid changing temperature thresholdAkinobu Mita3-2/+12
This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing the value of the temperature threshold feature for specific devices that show undesirable behavior. Guenter reported: "On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the Composite temperature sensor results in a temperature warning, and that warning is sticky until I reset the controller. It doesn't seem to matter which temperature I write; writing -273000 has the same result." The Intel NVMe has the latest firmware version installed, so this isn't a problem that was ever fixed. Reported-by: Guenter Roeck <[email protected]> Cc: Keith Busch <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Sagi Grimberg <[email protected]> Cc: Jean Delvare <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Signed-off-by: Akinobu Mita <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2019-11-22nvme: hwmon: provide temperature min and max values for each sensorAkinobu Mita2-16/+96
According to the NVMe specification, the over temperature threshold and under temperature threshold features shall be implemented for Composite Temperature if a non-zero WCTEMP field value is reported in the Identify Controller data structure. The features are also implemented for all implemented temperature sensors (i.e., all Temperature Sensor fields that report a non-zero value). This provides the over temperature threshold and under temperature threshold for each sensor as temperature min and max values of hwmon sysfs attributes. The WCTEMP is already provided as a temperature max value for Composite Temperature, but this change isn't incompatible. Because the default value of the over temperature threshold for Composite Temperature is the WCTEMP. Now the alarm attribute for Composite Temperature indicates one of the temperature is outside of a temperature threshold. Because there is only a single bit in Critical Warning field that indicates a temperature is outside of a threshold. Example output from the "sensors" command: nvme-pci-0100 Adapter: PCI adapter Composite: +33.9°C (low = -273.1°C, high = +69.8°C) (crit = +79.8°C) Sensor 1: +34.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +31.9°C (low = -273.1°C, high = +65261.8°C) Sensor 5: +47.9°C (low = -273.1°C, high = +65261.8°C) This also adds helper macros for kelvin from/to milli Celsius conversion, and replaces the repeated code in hwmon.c. Cc: Keith Busch <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Sagi Grimberg <[email protected]> Cc: Jean Delvare <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Signed-off-by: Akinobu Mita <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2019-11-22nvmet: add another maintainerChristoph Hellwig1-0/+1
Sagi and I have been pretty busy lately, and Chaitanya has been helping a lot with target work and agreed to share the load. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2019-11-21Revert "block: split bio if the only bvec's length is > SZ_4K"Jens Axboe1-1/+1
We really don't need this, as the slow path will do the right thing anyway. This reverts commit 6952a7f8446ee85ea9d10ab87b64797a031eaae3. Signed-off-by: Jens Axboe <[email protected]>
2019-11-21block: add iostat counters for flush requestsKonstantin Khlebnikov8-7/+58
Requests that triggers flushing volatile writeback cache to disk (barriers) have significant effect to overall performance. Block layer has sophisticated engine for combining several flush requests into one. But there is no statistics for actual flushes executed by disk. Requests which trigger flushes usually are barriers - zero-size writes. This patch adds two iostat counters into /sys/class/block/$dev/stat and /proc/diskstats - count of completed flush requests and their total time. Signed-off-by: Konstantin Khlebnikov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-11-21perf tools: Add kernel AUX area sampling definitionsAdrian Hunter9-6/+59
Add kernel AUX area sampling definitions, which brings perf_event.h into line with the kernel version. New sample type PERF_SAMPLE_AUX requests a sample of the AUX area buffer. New perf_event_attr member 'aux_sample_size' specifies the desired size of the sample. Also add support for parsing samples containing AUX area data i.e. PERF_SAMPLE_AUX. Committer notes: I squashed the first two patches in this series to avoid breaking automatic bisection, i.e. after applying only the original first patch in this series we would have: # perf test -v parsing 26: Sample parsing : --- start --- test child forked, pid 17018 sample format has changed, some new PERF_SAMPLE_ bit was introduced - test needs updating test child finished with -1 ---- end ---- Sample parsing: FAILED! # With the two paches combined: # perf test parsing 26: Sample parsing : Ok # Signed-off-by: Adrian Hunter <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Link: http://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2019-11-21KVM: x86: create mmu/ subdirectoryPaolo Bonzini4-2/+2
Preparatory work for shattering mmu.c into multiple files. Besides making it easier to follow, this will also make it possible to write unit tests for various parts. Signed-off-by: Paolo Bonzini <[email protected]>
2019-11-21KVM: nVMX: Remove unnecessary TLB flushes on L1<->L2 switches when L1 use ↵Liran Alon1-7/+0
apic-access-page According to Intel SDM section 28.3.3.3/28.3.3.4 Guidelines for Use of the INVVPID/INVEPT Instruction, the hypervisor needs to execute INVVPID/INVEPT X in case CPU executes VMEntry with VPID/EPTP X and either: "Virtualize APIC accesses" VM-execution control was changed from 0 to 1, OR the value of apic_access_page was changed. In the nested case, the burden falls on L1, unless L0 enables EPT in vmcs02 but L1 enables neither EPT nor VPID in vmcs12. For this reason prepare_vmcs02() and load_vmcs12_host_state() have special code to request a TLB flush in case L1 does not use EPT but it uses "virtualize APIC accesses". This special case however is not necessary. On a nested vmentry the physical TLB will already be flushed except if all the following apply: * L0 uses VPID * L1 uses VPID * L0 can guarantee TLB entries populated while running L1 are tagged differently than TLB entries populated while running L2. If the first condition is false, the processor will flush the TLB on vmentry to L2. If the second or third condition are false, prepare_vmcs02() will request KVM_REQ_TLB_FLUSH. However, even if both are true, no extra TLB flush is needed to handle the APIC access page: * if L1 doesn't use VPID, the second condition doesn't hold and the TLB will be flushed anyway. * if L1 uses VPID, it has to flush the TLB itself with INVVPID and section 28.3.3.3 doesn't apply to L0. * even INVEPT is not needed because, if L0 uses EPT, it uses different EPTP when running L2 than L1 (because guest_mode is part of mmu-role). In this case SDM section 28.3.3.4 doesn't apply. Similarly, examining nested_vmx_vmexit()->load_vmcs12_host_state(), one could note that L0 won't flush TLB only in cases where SDM sections 28.3.3.3 and 28.3.3.4 don't apply. In particular, if L0 uses different VPIDs for L1 and L2 (i.e. vmx->vpid != vmx->nested.vpid02), section 28.3.3.3 doesn't apply. Thus, remove this flush from prepare_vmcs02() and nested_vmx_vmexit(). Side-note: This patch can be viewed as removing parts of commit fb6c81984313 ("kvm: vmx: Flush TLB when the APIC-access address changes”) that is not relevant anymore since commit 1313cc2bd8f6 ("kvm: mmu: Add guest_mode to kvm_mmu_page_role”). i.e. The first commit assumes that if L0 use EPT and L1 doesn’t use EPT, then L0 will use same EPTP for both L0 and L1. Which indeed required L0 to execute INVEPT before entering L2 guest. This assumption is not true anymore since when guest_mode was added to mmu-role. Reviewed-by: Joao Martins <[email protected]> Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-11-21KVM: x86: remove set but not used variable 'called'Mao Wenan1-3/+2
Fixes gcc '-Wunused-but-set-variable' warning: arch/x86/kvm/x86.c: In function kvm_make_scan_ioapic_request_mask: arch/x86/kvm/x86.c:7911:7: warning: variable called set but not used [-Wunused-but-set-variable] It is not used since commit 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs") Signed-off-by: Mao Wenan <[email protected]> Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs") Signed-off-by: Paolo Bonzini <[email protected]>
2019-11-21KVM: nVMX: Do not mark vmcs02->apic_access_page as dirty when unpinningLiran Alon1-3/+3
vmcs->apic_access_page is simply a token that the hypervisor puts into the PFN of a 4KB EPTE (or PTE if using shadow-paging) that triggers APIC-access VMExit or APIC virtualization logic whenever a CPU running in VMX non-root mode read/write from/to this PFN. As every write either triggers an APIC-access VMExit or write is performed on vmcs->virtual_apic_page, the PFN pointed to by vmcs->apic_access_page should never actually be touched by CPU. Therefore, there is no need to mark vmcs02->apic_access_page as dirty after unpin it on L2->L1 emulated VMExit or when L1 exit VMX operation. Reviewed-by: Krish Sadhukhan <[email protected]> Reviewed-by: Joao Martins <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Liran Alon <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-11-21Merge branch 'kvm-tsx-ctrl' into HEADPaolo Bonzini928-4763/+13793
Conflicts: arch/x86/kvm/vmx/vmx.c
2019-11-21KVM: vmx: use MSR_IA32_TSX_CTRL to hard-disable TSX on guest that lack itPaolo Bonzini1-14/+30
If X86_FEATURE_RTM is disabled, the guest should not be able to access MSR_IA32_TSX_CTRL. We can therefore use it in KVM to force all transactions from the guest to abort. Tested-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-11-21KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionalityPaolo Bonzini2-21/+40
The current guest mitigation of TAA is both too heavy and not really sufficient. It is too heavy because it will cause some affected CPUs (those that have MDS_NO but lack TAA_NO) to fall back to VERW and get the corresponding slowdown. It is not really sufficient because it will cause the MDS_NO bit to disappear upon microcode update, so that VMs started before the microcode update will not be runnable anymore afterwards, even with tsx=on. Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for the guest and let it run without the VERW mitigation. Even though MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write it on every vmentry, we can use the shared MSR functionality because the host kernel need not protect itself from TSX-based side-channels. Tested-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-11-21KVM: x86: implement MSR_IA32_TSX_CTRL effect on CPUIDPaolo Bonzini3-4/+9
Because KVM always emulates CPUID, the CPUID clear bit (bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually" by the hypervisor when performing said emulation. Right now neither kvm-intel.ko nor kvm-amd.ko implement MSR_IA32_TSX_CTRL but this will change in the next patch. Reviewed-by: Jim Mattson <[email protected]> Tested-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-11-21KVM: x86: do not modify masked bits of shared MSRsPaolo Bonzini1-2/+3
"Shared MSRs" are guest MSRs that are written to the host MSRs but keep their value until the next return to userspace. They support a mask, so that some bits keep the host value, but this mask is only used to skip an unnecessary MSR write and the value written to the MSR is always the guest MSR. Fix this and, while at it, do not update smsr->values[slot].curr if for whatever reason the wrmsr fails. This should only happen due to reserved bits, so the value written to smsr->values[slot].curr will not match when the user-return notifier and the host value will always be restored. However, it is untidy and in rare cases this can actually avoid spurious WRMSRs on return to userspace. Cc: [email protected] Reviewed-by: Jim Mattson <[email protected]> Tested-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2019-11-21KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIESPaolo Bonzini1-2/+8
KVM does not implement MSR_IA32_TSX_CTRL, so it must not be presented to the guests. It is also confusing to have !ARCH_CAP_TSX_CTRL_MSR && !RTM && ARCH_CAP_TAA_NO: lack of MSR_IA32_TSX_CTRL suggests TSX was not hidden (it actually was), yet the value says that TSX is not vulnerable to microarchitectural data sampling. Fix both. Cc: [email protected] Tested-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>