aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2019-03-05x86: Deprecate a.out supportBorislav Petkov2-2/+1
Linux supports ELF binaries for ~25 years now. a.out coredumping has bitrotten quite significantly and would need some fixing to get it into shape again but considering how even the toolchains cannot create a.out executables in its default configuration, let's deprecate a.out support and remove it a couple of releases later, instead. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Richard Weinberger <richard@nod.at> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Jann Horn <jannh@google.com> Cc: <linux-api@vger.kernel.org> Cc: <linux-fsdevel@vger.kernel.org> Cc: lkml <linux-kernel@vger.kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <x86@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05a.out: remove core dumping supportLinus Torvalds2-226/+0
We're (finally) phasing out a.out support for good. As Borislav Petkov points out, we've supported ELF binaries for about 25 years by now, and coredumping in particular has bitrotted over the years. None of the tool chains even support generating a.out binaries any more, and the plan is to deprecate a.out support entirely for the kernel. But I want to start with just removing the core dumping code, because I can still imagine that somebody actually might want to support a.out as a simpler biinary format. Particularly if you generate some random binaries on the fly, ELF is a much more complicated format (admittedly ELF also does have a lot of toolchain support, mitigating that complexity a lot and you really should have moved over in the last 25 years). So it's at least somewhat possible that somebody out there has some workflow that still involves generating and running a.out executables. In contrast, it's very unlikely that anybody depends on debugging any legacy a.out core files. But regardless, I want this phase-out to be done in two steps, so that we can resurrect a.out support (if needed) without having to resurrect the core file dumping that is almost certainly not needed. Jann Horn pointed to the <asm/a.out-core.h> file that my first trivial cut at this had missed. And Alan Cox points out that the a.out binary loader _could_ be done in user space if somebody wants to, but we might keep just the loader in the kernel if somebody really wants it, since the loader isn't that big and has no really odd special cases like the core dumping does. Acked-by: Borislav Petkov <bp@alien8.de> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jannh@google.com> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05Merge branch 'linus' of ↵Linus Torvalds9-702/+336
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "API: - Add helper for simple skcipher modes. - Add helper to register multiple templates. - Set CRYPTO_TFM_NEED_KEY when setkey fails. - Require neither or both of export/import in shash. - AEAD decryption test vectors are now generated from encryption ones. - New option CONFIG_CRYPTO_MANAGER_EXTRA_TESTS that includes random fuzzing. Algorithms: - Conversions to skcipher and helper for many templates. - Add more test vectors for nhpoly1305 and adiantum. Drivers: - Add crypto4xx prng support. - Add xcbc/cmac/ecb support in caam. - Add AES support for Exynos5433 in s5p. - Remove sha384/sha512 from artpec7 as hardware cannot do partial hash" [ There is a merge of the Freescale SoC tree in order to pull in changes required by patches to the caam/qi2 driver. ] * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (174 commits) crypto: s5p - add AES support for Exynos5433 dt-bindings: crypto: document Exynos5433 SlimSSS crypto: crypto4xx - add missing of_node_put after of_device_is_available crypto: cavium/zip - fix collision with generic cra_driver_name crypto: af_alg - use struct_size() in sock_kfree_s() crypto: caam - remove redundant likely/unlikely annotation crypto: s5p - update iv after AES-CBC op end crypto: x86/poly1305 - Clear key material from stack in SSE2 variant crypto: caam - generate hash keys in-place crypto: caam - fix DMA mapping xcbc key twice crypto: caam - fix hash context DMA unmap size hwrng: bcm2835 - fix probe as platform device crypto: s5p-sss - Use AES_BLOCK_SIZE define instead of number crypto: stm32 - drop pointless static qualifier in stm32_hash_remove() crypto: chelsio - Fixed Traffic Stall crypto: marvell - Remove set but not used variable 'ivsize' crypto: ccp - Update driver messages to remove some confusion crypto: adiantum - add 1536 and 4096-byte test vectors crypto: nhpoly1305 - add a test vector with len % 16 != 0 crypto: arm/aes-ce - update IV after partial final CTR block ...
2019-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds4-43/+126
Pull networking updates from David Miller: "Here we go, another merge window full of networking and #ebpf changes: 1) Snoop DHCPACKS in batman-adv to learn MAC/IP pairs in the DHCP range without dealing with floods of ARP traffic, from Linus Lüssing. 2) Throttle buffered multicast packet transmission in mt76, from Felix Fietkau. 3) Support adaptive interrupt moderation in ice, from Brett Creeley. 4) A lot of struct_size conversions, from Gustavo A. R. Silva. 5) Add peek/push/pop commands to bpftool, as well as bash completion, from Stanislav Fomichev. 6) Optimize sk_msg_clone(), from Vakul Garg. 7) Add SO_BINDTOIFINDEX, from David Herrmann. 8) Be more conservative with local resends due to local congestion, from Yuchung Cheng. 9) Allow vetoing of unsupported VXLAN FDBs, from Petr Machata. 10) Add health buffer support to devlink, from Eran Ben Elisha. 11) Add TXQ scheduling API to mac80211, from Toke Høiland-Jørgensen. 12) Add statistics to basic packet scheduler filter, from Cong Wang. 13) Add GRE tunnel support for mlxsw Spectrum-2, from Nir Dotan. 14) Lots of new IP tunneling forwarding tests, also from Nir Dotan. 15) Add 3ad stats to bonding, from Nikolay Aleksandrov. 16) Lots of probing improvements for bpftool, from Quentin Monnet. 17) Various nfp drive #ebpf JIT improvements from Jakub Kicinski. 18) Allow #ebpf programs to access gso_segs from skb shared info, from Eric Dumazet. 19) Add sock_diag support for AF_XDP sockets, from Björn Töpel. 20) Support 22260 iwlwifi devices, from Luca Coelho. 21) Use rbtree for ipv6 defragmentation, from Peter Oskolkov. 22) Add JMP32 instruction class support to #ebpf, from Jiong Wang. 23) Add spinlock support to #ebpf, from Alexei Starovoitov. 24) Support 256-bit keys and TLS 1.3 in ktls, from Dave Watson. 25) Add device infomation API to devlink, from Jakub Kicinski. 26) Add new timestamping socket options which are y2038 safe, from Deepa Dinamani. 27) Add RX checksum offloading for various sh_eth chips, from Sergei Shtylyov. 28) Flow offload infrastructure, from Pablo Neira Ayuso. 29) Numerous cleanups, improvements, and bug fixes to the PHY layer and many drivers from Heiner Kallweit. 30) Lots of changes to try and make packet scheduler classifiers run lockless as much as possible, from Vlad Buslov. 31) Support BCM957504 chip in bnxt_en driver, from Erik Burrows. 32) Add concurrency tests to tc-tests infrastructure, from Vlad Buslov. 33) Add hwmon support to aquantia, from Heiner Kallweit. 34) Allow 64-bit values for SO_MAX_PACING_RATE, from Eric Dumazet. And I would be remiss if I didn't thank the various major networking subsystem maintainers for integrating much of this work before I even saw it. Alexei Starovoitov, Daniel Borkmann, Pablo Neira Ayuso, Johannes Berg, Kalle Valo, and many others. Thank you!" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2207 commits) net/sched: avoid unused-label warning net: ignore sysctl_devconf_inherit_init_net without SYSCTL phy: mdio-mux: fix Kconfig dependencies net: phy: use phy_modify_mmd_changed in genphy_c45_an_config_aneg net: dsa: mv88e6xxx: add call to mv88e6xxx_ports_cmode_init to probe for new DSA framework selftest/net: Remove duplicate header sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79 net/mlx5e: Update tx reporter status in case channels were successfully opened devlink: Add support for direct reporter health state update devlink: Update reporter state to error even if recover aborted sctp: call iov_iter_revert() after sending ABORT team: Free BPF filter when unregistering netdev ip6mr: Do not call __IP6_INC_STATS() from preemptible context isdn: mISDN: Fix potential NULL pointer dereference of kzalloc net: dsa: mv88e6xxx: support in-band signalling on SGMII ports with external PHYs cxgb4/chtls: Prefix adapter flags with CXGB4 net-sysfs: Switch to bitmap_zalloc() mellanox: Switch to bitmap_zalloc() bpf: add test cases for non-pointer sanitiation logic mlxsw: i2c: Extend initialization by querying resources data ...
2019-03-05signal: add pidfd_send_signal() syscallChristian Brauner2-0/+2
The kill() syscall operates on process identifiers (pid). After a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. This issue has often surfaced and there has been a push to address this problem [1]. This patch uses file descriptors (fd) from proc/<pid> as stable handles on struct pid. Even if a pid is recycled the handle will not change. The fd can be used to send signals to the process it refers to. Thus, the new syscall pidfd_send_signal() is introduced to solve this problem. Instead of pids it operates on process fds (pidfd). /* prototype and argument /* long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags); /* syscall number 424 */ The syscall number was chosen to be 424 to align with Arnd's rework in his y2038 to minimize merge conflicts (cf. [25]). In addition to the pidfd and signal argument it takes an additional siginfo_t and flags argument. If the siginfo_t argument is NULL then pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo(). The flags argument is added to allow for future extensions of this syscall. It currently needs to be passed as 0. Failing to do so will cause EINVAL. /* pidfd_send_signal() replaces multiple pid-based syscalls */ The pidfd_send_signal() syscall currently takes on the job of rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a positive pid is passed to kill(2). It will however be possible to also replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended. /* sending signals to threads (tid) and process groups (pgid) */ Specifically, the pidfd_send_signal() syscall does currently not operate on process groups or threads. This is left for future extensions. In order to extend the syscall to allow sending signal to threads and process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and PIDFD_TYPE_TID) should be added. This implies that the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping in this api is a property of the flags argument not a property of the file descriptor (cf. [13]). Clarification for this has been requested by Eric (cf. [19]). When appropriate extensions through the flags argument are added then pidfd_send_signal() can additionally replace the part of kill(2) which operates on process groups as well as the tgkill(2) and rt_tgsigqueueinfo(2) syscalls. How such an extension could be implemented has been very roughly sketched in [14], [15], and [16]. However, this should not be taken as a commitment to a particular implementation. There might be better ways to do it. Right now this is intentionally left out to keep this patchset as simple as possible (cf. [4]). /* naming */ The syscall had various names throughout iterations of this patchset: - procfd_signal() - procfd_send_signal() - taskfd_send_signal() In the last round of reviews it was pointed out that given that if the flags argument decides the scope of the signal instead of different types of fds it might make sense to either settle for "procfd_" or "pidfd_" as prefix. The community was willing to accept either (cf. [17] and [18]). Given that one developer expressed strong preference for the "pidfd_" prefix (cf. [13]) and with other developers less opinionated about the name we should settle for "pidfd_" to avoid further bikeshedding. The "_send_signal" suffix was chosen to reflect the fact that the syscall takes on the job of multiple syscalls. It is therefore intentional that the name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer because it might imply that pidfd_send_signal() is a replacement for kill(2), and not the latter because it is a hassle to remember the correct spelling - especially for non-native speakers - and because it is not descriptive enough of what the syscall actually does. The name "pidfd_send_signal" makes it very clear that its job is to send signals. /* zombies */ Zombies can be signaled just as any other process. No special error will be reported since a zombie state is an unreliable state (cf. [3]). However, this can be added as an extension through the @flags argument if the need ever arises. /* cross-namespace signals */ The patch currently enforces that the signaler and signalee either are in the same pid namespace or that the signaler's pid namespace is an ancestor of the signalee's pid namespace. This is done for the sake of simplicity and because it is unclear to what values certain members of struct siginfo_t would need to be set to (cf. [5], [6]). /* compat syscalls */ It became clear that we would like to avoid adding compat syscalls (cf. [7]). The compat syscall handling is now done in kernel/signal.c itself by adding __copy_siginfo_from_user_generic() which lets us avoid compat syscalls (cf. [8]). It should be noted that the addition of __copy_siginfo_from_user_any() is caused by a bug in the original implementation of rt_sigqueueinfo(2) (cf. 12). With upcoming rework for syscall handling things might improve significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain any additional callers. /* testing */ This patch was tested on x64 and x86. /* userspace usage */ An asciinema recording for the basic functionality can be found under [9]. With this patch a process can be killed via: #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags) { #ifdef __NR_pidfd_send_signal return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); #else return -ENOSYS; #endif } int main(int argc, char *argv[]) { int fd, ret, saved_errno, sig; if (argc < 3) exit(EXIT_FAILURE); fd = open(argv[1], O_DIRECTORY | O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); printf("Sending signal %d to process %s\n", sig, argv[1]); ret = do_pidfd_send_signal(fd, sig, NULL, 0); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to send signal %d to process %s\n", strerror(errno), sig, argv[1]); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } /* Q&A * Given that it seems the same questions get asked again by people who are * late to the party it makes sense to add a Q&A section to the commit * message so it's hopefully easier to avoid duplicate threads. * * For the sake of progress please consider these arguments settled unless * there is a new point that desperately needs to be addressed. Please make * sure to check the links to the threads in this commit message whether * this has not already been covered. */ Q-01: (Florian Weimer [20], Andrew Morton [21]) What happens when the target process has exited? A-01: Sending the signal will fail with ESRCH (cf. [22]). Q-02: (Andrew Morton [21]) Is the task_struct pinned by the fd? A-02: No. A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct (cf. [22]). Q-03: (Andrew Morton [21]) Does the entire procfs directory remain visible? Just one entry within it? A-03: The same thing that happens right now when you hold a file descriptor to /proc/<pid> open (cf. [22]). Q-04: (Andrew Morton [21]) Does the pid remain reserved? A-04: No. This patchset guarantees a stable handle not that pids are not recycled (cf. [22]). Q-05: (Andrew Morton [21]) Do attempts to signal that fd return errors? A-05: See {Q,A}-01. Q-06: (Andrew Morton [22]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [22]). Q-07: (Andrew Morton [21] and others) This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? A-07: Yes (cf. [22]). Q-08: (Andrew Morton [21] and others) Now I think about it, why a new syscall? This thing is looking rather like an ioctl? A-08: This has been extensively discussed. It was agreed that a syscall is preferred for a variety or reasons. Here are just a few taken from prior threads. Syscalls are safer than ioctl()s especially when signaling to fds. Processes are a core kernel concept so a syscall seems more appropriate. The layout of the syscall with its four arguments would require the addition of a custom struct for the ioctl() thereby causing at least the same amount or even more complexity for userspace than a simple syscall. The new syscall will replace multiple other pid-based syscalls (see description above). The file-descriptors-for-processes concept introduced with this syscall will be extended with other syscalls in the future. See also [22], [23] and various other threads already linked in here. Q-09: (Florian Weimer [24]) What happens if you use the new interface with an O_PATH descriptor? A-09: pidfds opened as O_PATH fds cannot be used to send signals to a process (cf. [2]). Signaling processes through pidfds is the equivalent of writing to a file. Thus, this is not an operation that operates "purely at the file descriptor level" as required by the open(2) manpage. See also [4]. /* References */ [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/ [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/ [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/ [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/ [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/ [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/ [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/ [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/ [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/ [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/ [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/ [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/ [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/ [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/ [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/ [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/ [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/ [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/ [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/ [23]: https://lwn.net/Articles/773459/ [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/ Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Aleksa Sarai <cyphar@cyphar.com>
2019-03-05x86/ftrace: Fix warning and considate ftrace_jmp_replace() and ↵Steven Rostedt (VMware)1-25/+17
ftrace_call_replace() Arnd reported the following compiler warning: arch/x86/kernel/ftrace.c:669:23: error: 'ftrace_jmp_replace' defined but not used [-Werror=unused-function] The ftrace_jmp_replace() function now only has a single user and should be simply moved by that user. But looking at the code, it shows that ftrace_jmp_replace() is similar to ftrace_call_replace() except that instead of using the opcode of 0xe8 it uses 0xe9. It makes more sense to consolidate that function into one implementation that both ftrace_jmp_replace() and ftrace_call_replace() use by passing in the op code separate. The structure in ftrace_code_union is also modified to replace the "e8" field with the more appropriate name "op". Cc: stable@vger.kernel.org Reported-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: http://lkml.kernel.org/r/20190304200748.1418790-1-arnd@arndb.de Fixes: d2a68c4effd8 ("x86/ftrace: Do not call function graph from dynamic trampolines") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-03-05xen: remove pre-xen3 fallback handlersArnd Bergmann1-11/+2
The legacy hypercall handlers were originally added with a comment explaining that "copying the argument structures in HYPERVISOR_event_channel_op() and HYPERVISOR_physdev_op() into the local variable is sufficiently safe" and only made sure to not write past the end of the argument structure, the checks in linux/string.h disagree with that, when link-time optimizations are used: In function 'memcpy', inlined from 'pirq_query_unmask' at drivers/xen/fallback.c:53:2, inlined from '__startup_pirq' at drivers/xen/events/events_base.c:529:2, inlined from 'restore_pirqs' at drivers/xen/events/events_base.c:1439:3, inlined from 'xen_irq_resume' at drivers/xen/events/events_base.c:1581:2: include/linux/string.h:350:3: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter __read_overflow2(); ^ Further research turned out that only Xen 3.0.2 or earlier required the fallback at all, while all versions in use today don't need it. As far as I can tell, it is not even possible to run a mainline kernel on those old Xen releases, at the time when they were in use, only a patched kernel was supported anyway. Fixes: cf47a83fb06e ("xen/hypercall: fix hypercall fallback code for very old hypervisors") Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jan Beulich <JBeulich@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-03-04Merge tag 'regulator-v5.1' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator updates from Mark Brown: "The bulk of the standout changes in this release are cleanups, with the core work being a combination of factoring out common code into helpers and the completion of the conversion of the core to use GPIO descriptors. Summary: - Addition of helper functions for current limits and conversion of drivers to use them by Axel Lin. - Lots and lots of cleanups from Axel Lin. - Conversion of the core to use GPIO descriptors rather than numbers by Linus Walleij. - New drivers for Maxim MAX77650 and ROHM BD70528" * tag 'regulator-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (131 commits) regulator: mc13xxx: Constify regulator_ops variables regulator: palmas: Constify palmas_smps_ramp_delay array regulator: wm831x-dcdc: Convert to use regulator_set/get_current_limit_regmap regulator: pv88090: Convert to use regulator_set/get_current_limit_regmap regulator: pv88080: Convert to use regulator_set/get_current_limit_regmap regulator: pv88060: Convert to use regulator_set/get_current_limit_regmap regulator: max77650: Convert to use regulator_set/get_current_limit_regmap regulator: lp873x: Convert to use regulator_set/get_current_limit_regmap regulator: lp872x: Convert to use regulator_set/get_current_limit_regmap regulator: da9210: Convert to use regulator_set/get_current_limit_regmap regulator: da9055: Convert to use regulator_set/get_current_limit_regmap regulator: core: Add set/get_current_limit helpers for regmap users regulator: Fix comment for csel_reg and csel_mask regulator: stm32-vrefbuf: add power management support regulator: 88pm8607: Remove unused fields from struct pm8607_regulator_info regulator: 88pm8607: Simplify pm8607_list_voltage implementation regulator: cpcap: Constify omap4_regulators and xoom_regulators regulator: cpcap: Remove unused vsel_shift from struct cpcap_regulator dt-bindings: regulator: tps65218: rectify units of LS3 dt-bindings: regulator: add LS2 load switch documentation ...
2019-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-61/+5
2019-03-04get rid of legacy 'get_ds()' functionLinus Torvalds1-1/+0
Every in-kernel use of this function defined it to KERNEL_DS (either as an actual define, or as an inline function). It's an entirely historical artifact, and long long long ago used to actually read the segment selector valueof '%ds' on x86. Which in the kernel is always KERNEL_DS. Inspired by a patch from Jann Horn that just did this for a very small subset of users (the ones in fs/), along with Al who suggested a script. I then just took it to the logical extreme and removed all the remaining gunk. Roughly scripted with git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/' git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d' plus manual fixups to remove a few unusual usage patterns, the couple of inline function cases and to fix up a comment that had become stale. The 'get_ds()' function remains in an x86 kvm selftest, since in user space it actually does something relevant. Inspired-by: Jann Horn <jannh@google.com> Inspired-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-04x86-64: add warning for non-canonical user access address dereferencesLinus Torvalds1-0/+1
This adds a warning (once) for any kernel dereference that has a user exception handler, but accesses a non-canonical address. It basically is a simpler - and more limited - version of commit 9da3f2b74054 ("x86/fault: BUG() when uaccess helpers fault on kernel addresses") that got reverted. Note that unlike that original commit, this only causes a warning, because there are real situations where we currently can do this (notably speculative argument fetching for uprobes etc). Also, unlike that original commit, this _only_ triggers for #GP accesses, so the cases of valid kernel pointers that cross into a non-mapped page aren't affected. The intent of this is two-fold: - the uprobe/tracing accesses really do need to be more careful. In particular, from a portability standpoint it's just wrong to think that "a pointer is a pointer", and use the same logic for any random pointer value you find on the stack. It may _work_ on x86-64, but it doesn't necessarily work on other architectures (where the same pointer value can be either a kernel pointer _or_ a user pointer, and you really need to be much more careful in how you try to access it) The warning can hopefully end up being a reminder that just any random pointer access won't do. - Kees in particular wanted a way to actually report invalid uses of wild pointers to user space accessors, instead of just silently failing them. Automated fuzzers want a way to get reports if the kernel ever uses invalid values that the fuzzer fed it. The non-canonical address range is a fair chunk of the address space, and with this you can teach syzkaller to feed in invalid pointer values and find cases where we do not properly validate user addresses (possibly due to bad uses of "set_fs()"). Acked-by: Kees Cook <keescook@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-04Merge branch 'regulator-5.1' into regulator-nextMark Brown1-1/+0
2019-03-02Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Two last minute fixes: - Prevent value evaluation via functions happening in the user access enabled region of __put_user() (put another way: make sure to evaluate the value to be stored in user space _before_ enabling user space accesses) - Correct the definition of a Hyper-V hypercall constant" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/hyper-v: Fix definition of HV_MAX_FLUSH_REP_COUNT x86/uaccess: Don't leak the AC flag into __put_user() value evaluation
2019-03-01PCI: hv: Refactor hv_irq_unmask() to use cpumask_to_vpset()Maya Nakamura1-0/+1
Remove the duplicate implementation of cpumask_to_vpset() and use the shared implementation. Export hv_max_vp_index, which is required by cpumask_to_vpset(). Signed-off-by: Maya Nakamura <m.maya.nakamura@gmail.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
2019-02-28io_uring: add support for pre-mapped user IO buffersJens Axboe2-0/+2
If we have fixed user buffers, we can map them into the kernel when we setup the io_uring. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must call io_uring_register() after having setup an io_uring instance, passing in IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to an iovec array, and the nr_args should contain how many iovecs the application wishes to map. If successful, these buffers are now mapped into the kernel, eligible for IO. To use these fixed buffers, the application must use the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len must point to somewhere inside the indexed buffer. The application may register buffers throughout the lifetime of the io_uring instance. It can call io_uring_register() with IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of buffers, and then register a new set. The application need not unregister buffers explicitly before shutting down the io_uring instance. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. For now, buffers must not be file backed. If file backed buffers are passed in, the registration will fail with -1/EOPNOTSUPP. This restriction may be relaxed in the future. RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat arbitrary 1G per buffer size is also imposed. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28Add io_uring IO interfaceJens Axboe2-0/+4
The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_cqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up an io_uring instance for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28x86/boot/compressed/64: Do not read legacy ROM on EFI systemKirill A. Shutemov1-3/+16
EFI systems do not necessarily provide a legacy ROM. If the ROM is missing the memory is not mapped at all. Trying to dereference values in the legacy ROM area leads to a crash on Macbook Pro. Only look for values in the legacy ROM area for non-EFI system. Fixes: 3548e131ec6a ("x86/boot/compressed/64: Find a place for 32-bit trampoline") Reported-by: Pitam Mitra <pitamm@gmail.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Bockjoo Kim <bockjoo@phys.ufl.edu> Cc: bp@alien8.de Cc: hpa@zytor.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190219075224.35058-1-kirill.shutemov@linux.intel.com Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202351
2019-02-28x86, retpolines: Raise limit for generating indirect calls from switch-caseDaniel Borkmann1-0/+5
From networking side, there are numerous attempts to get rid of indirect calls in fast-path wherever feasible in order to avoid the cost of retpolines, for example, just to name a few: * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin") * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer") * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer") * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct") * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps") * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions") [...] Recent work on XDP from Björn and Magnus additionally found that manually transforming the XDP return code switch statement with more than 5 cases into if-else combination would result in a considerable speedup in XDP layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled builds. On i40e driver with XDP prog attached, a 20-26% speedup has been observed [0]. Aside from XDP, there are many other places later in the networking stack's critical path with similar switch-case processing. Rather than fixing every XDP-enabled driver and locations in stack by hand, it would be good to instead raise the limit where gcc would emit expensive indirect calls from the switch under retpolines and stick with the default as-is in case of !retpoline configured kernels. This would also have the advantage that for archs where this is not necessary, we let compiler select the underlying target optimization for these constructs and avoid potential slow-downs by if-else hand-rewrite. In case of gcc, this setting is controlled by case-values-threshold which has an architecture global default that selects 4 or 5 (latter if target does not have a case insn that compares the bounds) where some arch back ends like arm64 or s390 override it with their own target hooks, for example, in gcc commit db7a90aa0de5 ("S/390: Disable prediction of indirect branches") the threshold pretty much disables jump tables by limit of 20 under retpoline builds. Comparing gcc's and clang's default code generation on x86-64 under O2 level with retpoline build results in the following outcome for 5 switch cases: * gcc with -mindirect-branch=thunk-inline -mindirect-branch-register: # gdb -batch -ex 'disassemble dispatch' ./c-switch Dump of assembler code for function dispatch: 0x0000000000400be0 <+0>: cmp $0x4,%edi 0x0000000000400be3 <+3>: ja 0x400c35 <dispatch+85> 0x0000000000400be5 <+5>: lea 0x915f8(%rip),%rdx # 0x4921e4 0x0000000000400bec <+12>: mov %edi,%edi 0x0000000000400bee <+14>: movslq (%rdx,%rdi,4),%rax 0x0000000000400bf2 <+18>: add %rdx,%rax 0x0000000000400bf5 <+21>: callq 0x400c01 <dispatch+33> 0x0000000000400bfa <+26>: pause 0x0000000000400bfc <+28>: lfence 0x0000000000400bff <+31>: jmp 0x400bfa <dispatch+26> 0x0000000000400c01 <+33>: mov %rax,(%rsp) 0x0000000000400c05 <+37>: retq 0x0000000000400c06 <+38>: nopw %cs:0x0(%rax,%rax,1) 0x0000000000400c10 <+48>: jmpq 0x400c90 <fn_3> 0x0000000000400c15 <+53>: nopl (%rax) 0x0000000000400c18 <+56>: jmpq 0x400c70 <fn_2> 0x0000000000400c1d <+61>: nopl (%rax) 0x0000000000400c20 <+64>: jmpq 0x400c50 <fn_1> 0x0000000000400c25 <+69>: nopl (%rax) 0x0000000000400c28 <+72>: jmpq 0x400c40 <fn_0> 0x0000000000400c2d <+77>: nopl (%rax) 0x0000000000400c30 <+80>: jmpq 0x400cb0 <fn_4> 0x0000000000400c35 <+85>: push %rax 0x0000000000400c36 <+86>: callq 0x40dd80 <abort> End of assembler dump. * clang with -mretpoline emitting search tree: # gdb -batch -ex 'disassemble dispatch' ./c-switch Dump of assembler code for function dispatch: 0x0000000000400b30 <+0>: cmp $0x1,%edi 0x0000000000400b33 <+3>: jle 0x400b44 <dispatch+20> 0x0000000000400b35 <+5>: cmp $0x2,%edi 0x0000000000400b38 <+8>: je 0x400b4d <dispatch+29> 0x0000000000400b3a <+10>: cmp $0x3,%edi 0x0000000000400b3d <+13>: jne 0x400b52 <dispatch+34> 0x0000000000400b3f <+15>: jmpq 0x400c50 <fn_3> 0x0000000000400b44 <+20>: test %edi,%edi 0x0000000000400b46 <+22>: jne 0x400b5c <dispatch+44> 0x0000000000400b48 <+24>: jmpq 0x400c20 <fn_0> 0x0000000000400b4d <+29>: jmpq 0x400c40 <fn_2> 0x0000000000400b52 <+34>: cmp $0x4,%edi 0x0000000000400b55 <+37>: jne 0x400b66 <dispatch+54> 0x0000000000400b57 <+39>: jmpq 0x400c60 <fn_4> 0x0000000000400b5c <+44>: cmp $0x1,%edi 0x0000000000400b5f <+47>: jne 0x400b66 <dispatch+54> 0x0000000000400b61 <+49>: jmpq 0x400c30 <fn_1> 0x0000000000400b66 <+54>: push %rax 0x0000000000400b67 <+55>: callq 0x40dd20 <abort> End of assembler dump. For sake of comparison, clang without -mretpoline: # gdb -batch -ex 'disassemble dispatch' ./c-switch Dump of assembler code for function dispatch: 0x0000000000400b30 <+0>: cmp $0x4,%edi 0x0000000000400b33 <+3>: ja 0x400b57 <dispatch+39> 0x0000000000400b35 <+5>: mov %edi,%eax 0x0000000000400b37 <+7>: jmpq *0x492148(,%rax,8) 0x0000000000400b3e <+14>: jmpq 0x400bf0 <fn_0> 0x0000000000400b43 <+19>: jmpq 0x400c30 <fn_4> 0x0000000000400b48 <+24>: jmpq 0x400c10 <fn_2> 0x0000000000400b4d <+29>: jmpq 0x400c20 <fn_3> 0x0000000000400b52 <+34>: jmpq 0x400c00 <fn_1> 0x0000000000400b57 <+39>: push %rax 0x0000000000400b58 <+40>: callq 0x40dcf0 <abort> End of assembler dump. Raising the cases to a high number (e.g. 100) will still result in similar code generation pattern with clang and gcc as above, in other words clang generally turns off jump table emission by having an extra expansion pass under retpoline build to turn indirectbr instructions from their IR into switch instructions as a built-in -mno-jump-table lowering of a switch (in this case, even if IR input already contained an indirect branch). For gcc, adding --param=case-values-threshold=20 as in similar fashion as s390 in order to raise the limit for x86 retpoline enabled builds results in a small vmlinux size increase of only 0.13% (before=18,027,528 after=18,051,192). For clang this option is ignored due to i) not being needed as mentioned and ii) not having above cmdline parameter. Non-retpoline-enabled builds with gcc continue to use the default case-values-threshold setting, so nothing changes here. [0] https://lore.kernel.org/netdev/20190129095754.9390-1-bjorn.topel@gmail.com/ and "The Path to DPDK Speeds for AF_XDP", LPC 2018, networking track: - http://vger.kernel.org/lpc_net2018_talks/lpc18_pres_af_xdp_perf-v3.pdf - http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: netdev@vger.kernel.org Cc: David S. Miller <davem@davemloft.net> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/20190221221941.29358-1-daniel@iogearbox.net
2019-02-28x86/hyper-v: Fix definition of HV_MAX_FLUSH_REP_COUNTLan Tianyu1-1/+1
The max flush rep count of HvFlushGuestPhysicalAddressList hypercall is equal with how many entries of union hv_gpa_page_range can be populated into the input parameter page. The code lacks parenthesis around PAGE_SIZE - 2 * sizeof(u64) which results in bogus computations. Add them. Fixes: cc4edae4b924 ("x86/hyper-v: Add HvFlushGuestAddressList hypercall support") Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: kys@microsoft.com Cc: haiyangz@microsoft.com Cc: sthemmin@microsoft.com Cc: sashal@kernel.org Cc: bp@alien8.de Cc: hpa@zytor.com Cc: gregkh@linuxfoundation.org Cc: devel@linuxdriverproject.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190225143114.5149-1-Tianyu.Lan@microsoft.com
2019-02-28x86/Hyper-V: Set x2apic destination mode to physical when x2apic is availableLan Tianyu1-0/+12
Hyper-V doesn't provide irq remapping for IO-APIC. To enable x2apic, set x2apic destination mode to physcial mode when x2apic is available and Hyper-V IOMMU driver makes sure cpus assigned with IO-APIC irqs have 8-bit APIC id. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2019-02-28kernfs, sysfs, cgroup, intel_rdt: Support fs_contextDavid Howells2-69/+132
Make kernfs support superblock creation/mount/remount with fs_context. This requires that sysfs, cgroup and intel_rdt, which are built on kernfs, be made to support fs_context also. Notes: (1) A kernfs_fs_context struct is created to wrap fs_context and the kernfs mount parameters are moved in here (or are in fs_context). (2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra namespace tag parameter is passed in the context if desired (3) kernfs_free_fs_context() is provided as a destructor for the kernfs_fs_context struct, but for the moment it does nothing except get called in the right places. (4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to pass, but possibly this should be done anyway in case someone wants to add a parameter in future. (5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and the cgroup v1 and v2 mount parameters are all moved there. (6) cgroup1 parameter parsing error messages are now handled by invalf(), which allows userspace to collect them directly. (7) cgroup1 parameter cleanup is now done in the context destructor rather than in the mount/get_tree and remount functions. Weirdies: (*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held, but then uses the resulting pointer after dropping the locks. I'm told this is okay and needs commenting. (*) The cgroup refcount web. This really needs documenting. (*) cgroup2 only has one root? Add a suggestion from Thomas Gleixner in which the RDT enablement code is placed into its own function. [folded a leak fix from Andrey Vagin] Signed-off-by: David Howells <dhowells@redhat.com> cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> cc: Tejun Heo <tj@kernel.org> cc: Li Zefan <lizefan@huawei.com> cc: Johannes Weiner <hannes@cmpxchg.org> cc: cgroups@vger.kernel.org cc: fenghua.yu@intel.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28Merge tag 'perf-core-for-mingo-5.1-20190225' of ↵Ingo Molnar1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: perf annotate: Wei Li: - Fix getting source line failure perf script: Andi Kleen: - Handle missing fields with -F +... perf data: Jiri Olsa: - Prep work to support per-cpu files in a directory. Intel PT: Adrian Hunter: - Improve thread_stack__no_call_return() - Hide x86 retpolines in thread stacks. - exported SQL viewer refactorings, new 'top calls' report.. Alexander Shishkin: - Copy parent's address filter offsets on clone - Fix address filters for vmas with non-zero offset. Applies to ARM's CoreSight as well. python scripts: Tony Jones: - Python3 support for several 'perf script' python scripts. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-28Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar19-120/+156
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-28Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar20-126/+170
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-28crypto: x86/poly1305 - Clear key material from stack in SSE2 variantTommi Hirvola1-0/+4
1-block SSE2 variant of poly1305 stores variables s1..s4 containing key material on the stack. This commit adds missing zeroing of the stack memory. Benchmarks show negligible performance hit (tested on i7-3770). Signed-off-by: Tommi Hirvola <tommi@hirvola.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-02-27Merge tag 'y2038-syscall-abi' of ↵Thomas Gleixner2-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground into timers/2038 Pull additional syscall ABI cleanup for y2038 from Arnd Bergmann: This is a follow-up to the y2038 syscall patches already merged in the tip tree. As the final 32-bit RISC-V syscall ABI is still being decided on, this is the last chance to make a few corrections to leave out interfaces based on 32-bit time_t along with the old off_t and rlimit types. The series achieves this in a few steps: - A couple of bug fixes for minor regressions I introduced in the original series - A couple of older patches from Yury Norov that I had never merged in the past, these fix up the openat/open_by_handle_at and getrlimit/setrlimit syscalls to disallow the old versions of off_t and rlimit. - Hiding the deprecated system calls behind an #ifdef in include/uapi/asm-generic/unistd.h - Change arch/riscv to drop all these ABIs. Originally, the plan was to also leave these out on C-Sky, but that now has a glibc port that uses the older interfaces, so we need to leave them in place.
2019-02-25x86/mce: Improve error message when kernel cannot recover, p2Tony Luck1-0/+5
In c7d606f560e4 ("x86/mce: Improve error message when kernel cannot recover") a case was added for a machine check caused by a DATA access to poison memory from the kernel. A case should have been added also for an uncorrectable error during an instruction fetch in the kernel. Add that extra case so the error message now reads: mce: [Hardware Error]: Machine check: Instruction fetch error in kernel Fixes: c7d606f560e4 ("x86/mce: Improve error message when kernel cannot recover") Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Pu Wen <puwen@hygon.cn> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190225205940.15226-1-tony.luck@intel.com
2019-02-25x86/uaccess: Remove unused __addr_ok() macroBorislav Petkov1-3/+0
This was caught while staring at the whole {set,get}_fs() machinery. It's last user, the 32-bit version of strnlen_user() went away with 5723aa993d83 ("x86: use the new generic strnlen_user() function") so drop it. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: the arch/x86 maintainers <x86@kernel.org> Cc: "Tobin C. Harding" <tobin@kernel.org> Link: https://lkml.kernel.org/r/20190225191109.7671-1-bp@alien8.de
2019-02-25x86/uaccess: Don't leak the AC flag into __put_user() value evaluationAndy Lutomirski1-2/+4
When calling __put_user(foo(), ptr), the __put_user() macro would call foo() in between __uaccess_begin() and __uaccess_end(). If that code were buggy, then those bugs would be run without SMAP protection. Fortunately, there seem to be few instances of the problem in the kernel. Nevertheless, __put_user() should be fixed to avoid doing this. Therefore, evaluate __put_user()'s argument before setting AC. This issue was noticed when an objtool hack by Peter Zijlstra complained about genregs_get() and I compared the assembly output to the C source. [ bp: Massage commit message and fixed up whitespace. ] Fixes: 11f1a4b9755f ("x86: reorganize SMAP handling in user space accesses") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20190225125231.845656645@infradead.org
2019-02-25Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses"Linus Torvalds1-58/+0
This reverts commit 9da3f2b74054406f87dff7101a569217ffceb29b. It was well-intentioned, but wrong. Overriding the exception tables for instructions for random reasons is just wrong, and that is what the new code did. It caused problems for tracing, and it caused problems for strncpy_from_user(), because the new checks made perfectly valid use cases break, rather than catch things that did bad things. Unchecked user space accesses are a problem, but that's not a reason to add invalid checks that then people have to work around with silly flags (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an odd way to say "this commit was wrong" and was sprinked into random places to hide the wrongness). The real fix to unchecked user space accesses is to get rid of the special "let's not check __get_user() and __put_user() at all" logic. Make __{get|put}_user() be just aliases to the regular {get|put}_user() functions, and make it impossible to access user space without having the proper checks in places. The raison d'être of the special double-underscore versions used to be that the range check was expensive, and if you did multiple user accesses, you'd do the range check up front (like the signal frame handling code, for example). But SMAP (on x86) and PAN (on ARM) have made that optimization pointless, because the _real_ expense is the "set CPU flag to allow user space access". Do let's not break the valid cases to catch invalid cases that shouldn't even exist. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@chromium.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-4/+20
Three conflicts, one of which, for marvell10g.c is non-trivial and requires some follow-up from Heiner or someone else. The issue is that Heiner converted the marvell10g driver over to use the generic c45 code as much as possible. However, in 'net' a bug fix appeared which makes sure that a new local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0 is cleared. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-22perf, pt, coresight: Fix address filters for vmas with non-zero offsetAlexander Shishkin1-4/+5
Currently, the address range calculation for file-based filters works as long as the vma that maps the matching part of the object file starts from offset zero into the file (vm_pgoff==0). Otherwise, the resulting filter range would be off by vm_pgoff pages. Another related problem is that in case of a partially matching vma, that is, a vma that matches part of a filter region, the filter range size wouldn't be adjusted. Fix the arithmetics around address filter range calculations, taking into account vma offset, so that the entire calculation is done before the filter configuration is passed to the PMU drivers instead of having those drivers do the final bit of arithmetics. Based on the patch by Adrian Hunter <adrian.hunter.intel.com>. Reported-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Olsa <jolsa@redhat.com> Fixes: 375637bc5249 ("perf/core: Introduce address range filtering") Link: http://lkml.kernel.org/r/20190215115655.63469-3-alexander.shishkin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-02-22KVM: MMU: record maximum physical address width in kvm_mmu_extended_roleYu Zhang2-0/+2
Previously, commit 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed") offered some optimization to avoid the unnecessary reconfiguration. Yet one scenario is broken - when cpuid changes VM's maximum physical address width, reconfiguration is needed to reset the reserved bits. Also, the TDP may need to reset its shadow_root_level when this value is changed. To fix this, a new field, maxphyaddr, is introduced in the extended role structure to keep track of the configured guest physical address width. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-22kvm: x86: Return LA57 feature based on hardware capabilityYu Zhang1-0/+4
Previously, 'commit 372fddf70904 ("x86/mm: Introduce the 'no5lvl' kernel parameter")' cleared X86_FEATURE_LA57 in boot_cpu_data, if Linux chooses to not run in 5-level paging mode. Yet boot_cpu_data is queried by do_cpuid_ent() as the host capability later when creating vcpus, and Qemu will not be able to detect this feature and create VMs with LA57 feature. As discussed earlier, VMs can still benefit from extended linear address width, e.g. to enhance features like ASLR. So we would like to fix this, by return the true hardware capability when Qemu queries. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-22x86/kvm/mmu: fix switch between root and guest MMUsVitaly Kuznetsov2-4/+14
Commit 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") brought one subtle change: previously, when switching back from L2 to L1, we were resetting MMU hooks (like mmu->get_cr3()) in kvm_init_mmu() called from nested_vmx_load_cr3() and now we do that in nested_ept_uninit_mmu_context() when we re-target vcpu->arch.mmu pointer. The change itself looks logical: if nested_ept_init_mmu_context() changes something than nested_ept_uninit_mmu_context() restores it back. There is, however, one thing: the following call chain: nested_vmx_load_cr3() kvm_mmu_new_cr3() __kvm_mmu_new_cr3() fast_cr3_switch() cached_root_available() now happens with MMU hooks pointing to the new MMU (root MMU in our case) while previously it was happening with the old one. cached_root_available() tries to stash current root but it is incorrect to read current CR3 with mmu->get_cr3(), we need to use old_mmu->get_cr3() which in case we're switching from L2 to L1 is guest_mmu. (BTW, in shadow page tables case this is a non-issue because we don't switch MMU). While we could've tried to guess that we're switching between MMUs and call the right ->get_cr3() from cached_root_available() this seems to be overly complicated. Instead, just stash the corresponding CR3 when setting root_hpa and make cached_root_available() use the stashed value. Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20x86: kvmguest: use TSC clocksource if invariant TSC is exposedMarcelo Tosatti1-0/+14
The invariant TSC bit has the following meaning: "The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource." IOW, TSC does not change frequency. In such case, and with TSC scaling hardware available to handle migration, it is possible to use the TSC clocksource directly, whose system calls are faster. Reduce the rating of kvmclock clocksource to allow TSC clocksource to be the default if invariant TSC is exposed. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> v2: Use feature bits and tsc_unstable() check (Sean Christopherson) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()Sean Christopherson1-23/+10
...via a new helper, __kvm_mmu_zap_all(). An alternative to passing a 'bool mmio_only' would be to pass a callback function to filter the shadow page, i.e. to make __kvm_mmu_zap_all() generic and reusable, but zapping all shadow pages is a last resort, i.e. making the helper less extensible is a feature of sorts. And the explicit MMIO parameter makes it easy to preserve the WARN_ON_ONCE() if a restart is triggered when zapping MMIO sptes. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping childrenSean Christopherson1-2/+5
Paolo expressed a concern that kvm_mmu_zap_mmio_sptes() could have a quadratic runtime[1], i.e. restarting the spte walk while zapping only MMIO sptes could result in re-walking large portions of the list over and over due to the non-MMIO sptes encountered before the restart not being removed. At the time, the concern was legitimate as the walk was restarted when any spte was zapped. But that is no longer the case as the walk is now restarted iff one or more children have been zapped, which is necessary because zapping children makes the active_mmu_pages list unstable. Furthermore, it should be impossible for an MMIO spte to have children, i.e. zapping an MMIO spte should never result in zapping children. In other words, kvm_mmu_zap_mmio_sptes() should never restart its walk, and so should always execute in linear time. WARN if this assertion fails. Although it should never be needed, leave the restart logic in place. In normal operation, the cost is at worst an extra CMP+Jcc, and if for some reason the list does become unstable, not restarting would likely crash KVM, or worse, the kernel. [1] https://patchwork.kernel.org/patch/10756589/#22452085 Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Differentiate between nr zapped and list unstableSean Christopherson1-10/+26
The return value of kvm_mmu_prepare_zap_page() has evolved to become overloaded to convey two separate pieces of information. 1) was at least one page zapped and 2) has the list of MMU pages become unstable. In it's original incarnation (as kvm_mmu_zap_page()), there was no return value at all. Commit 0738541396be ("KVM: MMU: awareness of new kvm_mmu_zap_page behaviour") added a return value in preparation for commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"). Although the return value was of type 'int', it was actually used as a boolean to indicate whether or not active_mmu_pages may have become unstable due to zapping children. Walking a list with list_for_each_entry_safe() only protects against deleting/moving the current entry, i.e. zapping a child page would break iteration due to modifying any number of entries. Later, commit 60c8aec6e2c9 ("KVM: MMU: use page array in unsync walk") modified mmu_zap_unsync_children() to return an approximation of the number of children zapped. This was not intentional, it was simply a side effect of how the code was written. The unintented side affect was then morphed into an actual feature by commit 77662e0028c7 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling path"), which modified kvm_mmu_change_mmu_pages() to use the number of zapped pages when determining the number of MMU pages in use by the VM. Finally, commit 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed") added the initial page to the return value to make its behavior more consistent with what most users would expect. Incorporating the initial parent page in the return value of kvm_mmu_zap_page() breaks the original usage of restarting a list walk on a non-zero return value to handle a potentially unstable list, i.e. walks will unnecessarily restart when any page is zapped. Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e. return a boolean to indicate that the list may be unstable and move the number of zapped children to a dedicated parameter. Since the majority of callers to kvm_mmu_prepare_zap_page() don't care about either return value, preserve the current definition of kvm_mmu_prepare_zap_page() by making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This avoids having to update every call site and also provides cleaner code for functions that only care about the number of pages zapped. Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: fast invalidate all pages"Sean Christopherson3-100/+1
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches from the original series[1], now that all users of the fast invalidate mechanism are gone. This reverts commit 5304b8d37c2a5ebca48330f5e7868d240eafbed1. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptesSean Christopherson1-1/+2
Call cond_resched_lock() when zapping all sptes to reschedule if needed or to release and reacquire mmu_lock in case of contention. There is no need to flush or zap when temporarily dropping mmu_lock as zapping all sptes is done only when the owning userspace VMM has exited or when the VM is being destroyed, i.e. there is no interplay with memslots or MMIO generations to worry about. Be paranoid and restart the walk if mmu_lock is dropped to avoid any potential issues with consuming a stale iterator. The overhead in doing so is negligible as at worst there will be a few root shadow pages at the head of the list, i.e. the iterator is essentially the head of the list already. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: skip over invalid root pages when zapping all sptesSean Christopherson1-1/+4
...to guarantee forward progress. When zapped, root pages are marked invalid and moved to the head of the active pages list until they are explicitly freed. Theoretically, having unzappable root pages at the head of the list could prevent kvm_mmu_zap_all() from making forward progress were a future patch to add a loop restart after processing a page, e.g. to drop mmu_lock on contention. Although kvm_mmu_prepare_zap_page() can theoretically take action on invalid pages, e.g. to zap unsync children, functionally it's not necessary (root pages will be re-zapped when freed) and practically speaking the odds of e.g. @unsync or @unsync_children becoming %true while zapping all pages is basically nil. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: x86: use the fast way to invalidate all pages"Sean Christopherson2-1/+16
Revert to a slow kvm_mmu_zap_all() for kvm_arch_flush_shadow_all(). Flushing all shadow entries is only done during VM teardown, i.e. kvm_arch_flush_shadow_all() is only called when the associated MM struct is being released or when the VM instance is being freed. Although the performance of teardown itself isn't critical, KVM should still voluntarily schedule to play nice with the rest of the kernel; but that can be done without the fast invalidate mechanism in a future patch. This reverts commit 6ca18b6950f8dee29361722f28f69847724b276f. Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints"Sean Christopherson1-12/+9
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit 2248b023219251908aedda0621251cffc548f258. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages"Sean Christopherson2-22/+0
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit 35006126f024f68727c67001b9cb703c38f69268. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: zap pages in batch"Sean Christopherson1-11/+24
Unwinding optimizations related to obsolete pages is a step towards removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit e7d11c7a894986a13817c1c001e1e7668c5c4eb4. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: collapse TLB flushes when zap all pages"Sean Christopherson1-28/+3
Unwinding optimizations related to obsolete pages is a step towards removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit f34d251d66ba263c077ed9d2bbd1874339a4c887. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20Revert "KVM: MMU: reclaim the zapped-obsolete page first"Sean Christopherson3-19/+4
Unwinding optimizations related to obsolete pages is a step towards removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit 365c886860c4ba670d245e762b23987c912c129a. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Remove is_obsolete() callSean Christopherson1-5/+1
Unwinding usage of is_obsolete() is a step towards removing x86's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This is a partial revert of commit 05988d728dcd ("KVM: MMU: reduce KVM_REQ_MMU_RELOAD when root page is zapped"). [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptesSean Christopherson1-1/+2
Call cond_resched_lock() when zapping MMIO to reschedule if needed or to release and reacquire mmu_lock in case of contention. There is no need to flush or zap when temporarily dropping mmu_lock as zapping MMIO sptes is done when holding the memslots lock and with the "update in-progress" bit set in the memslots generation, which disables MMIO spte caching. The walk does need to be restarted if mmu_lock is dropped as the active pages list may be modified. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>