aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-05-17tls: don't use stack memory in a scatterlistMatt Mullins2-5/+7
scatterlist code expects virt_to_page() to work, which fails with CONFIG_VMAP_STACK=y. Fixes: c46234ebb4d1e ("tls: RX path for ktls") Signed-off-by: Matt Mullins <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds18-87/+157
Pull kvm fixes from Paolo Bonzini: - ARM/ARM64 locking fixes - x86 fixes: PCID, UMIP, locking - improved support for recent Windows version that have a 2048 Hz APIC timer - rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME - better behaved selftests * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME KVM: arm/arm64: VGIC/ITS save/restore: protect kvm_read_guest() calls KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock KVM: arm/arm64: VGIC/ITS: Promote irq_lock() in update_affinity KVM: arm/arm64: Properly protect VGIC locks from IRQs KVM: X86: Lower the default timer frequency limit to 200us KVM: vmx: update sec exec controls for UMIP iff emulating UMIP kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled KVM: selftests: exit with 0 status code when tests cannot be run KVM: hyperv: idr_find needs RCU protection x86: Delay skip of emulated hypercall instruction KVM: Extend MAX_IRQ_ROUTES to 4096 for all archs
2018-05-17Merge tag 'kvm-s390-master-4.17-1' of ↵Paolo Bonzini1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master KVM: s390: Fix vsie handling for transactional diagnostic block vsie (nested KVM) might reject a valid input. Fix it.
2018-05-17Merge tag 'sound-4.17-rc6' of ↵Linus Torvalds5-3/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "We have a core fix in the compat code for covering a potential race (double references), but it's a very minor change. The rest are all small device-specific quirks, as well as a correction of the new UAC3 support code" * tag 'sound-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: usb-audio: Use Class Specific EP for UAC3 devices. ALSA: hda/realtek - Clevo P950ER ALC1220 Fixup ALSA: usb: mixer: volume quirk for CM102-A+/102S+ ALSA: hda: Add Lenovo C50 All in one to the power_save blacklist ALSA: control: fix a redundant-copy issue
2018-05-17kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIMEMichael S. Tsirkin3-8/+8
KVM_HINTS_DEDICATED seems to be somewhat confusing: Guest doesn't really care whether it's the only task running on a host CPU as long as it's not preempted. And there are more reasons for Guest to be preempted than host CPU sharing, for example, with memory overcommit it can get preempted on a memory access, post copy migration can cause preemption, etc. Let's call it KVM_HINTS_REALTIME which seems to better match what guests expect. Also, the flag most be set on all vCPUs - current guests assume this. Note so in the documentation. Signed-off-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-05-17Merge branch 'for-linus' of ↵Linus Torvalds23-167/+422
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: - a fix for the vfio ccw translation code - update an incorrect email address in the MAINTAINERS file - fix a division by zero oops in the cpum_sf code found by trinity - two fixes for the error handling of the qdio code - several spectre related patches to convert all left-over indirect branches in the kernel to expoline branches - update defconfigs to avoid warnings due to the netfilter Kconfig changes - avoid several compiler warnings in the kexec_file code for s390 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/qdio: don't release memory in qdio_setup_irq() s390/qdio: fix access to uninitialized qdio_q fields s390/cpum_sf: ensure sample frequency of perf event attributes is non-zero s390: use expoline thunks in the BPF JIT s390: extend expoline to BC instructions s390: remove indirect branch from do_softirq_own_stack s390: move spectre sysfs attribute code s390/kernel: use expoline for indirect branches s390/ftrace: use expoline for indirect branches s390/lib: use expoline for indirect branches s390/crc32-vx: use expoline for indirect branches s390: move expoline assembler macros to a header vfio: ccw: fix cleanup if cp_prefetch fails s390/kexec_file: add declaration of purgatory related globals s390: update defconfigs MAINTAINERS: update s390 zcrypt maintainers email address
2018-05-17Merge tag 'selinux-pr-20180516' of ↵Linus Torvalds1-22/+28
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull SELinux fixes from Paul Moore: "A small pull request to fix a few regressions in the SELinux/SCTP code with applications that call bind() with AF_UNSPEC/INADDR_ANY. The individual commit descriptions have more information, but the commits themselves should be self explanatory" * tag 'selinux-pr-20180516' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: correctly handle sa_family cases in selinux_sctp_bind_connect() selinux: fix address family in bind() and connect() to match address/port selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()
2018-05-17proc: do not access cmdline nor environ from file-backed areasWilly Tarreau3-4/+8
proc_pid_cmdline_read() and environ_read() directly access the target process' VM to retrieve the command line and environment. If this process remaps these areas onto a file via mmap(), the requesting process may experience various issues such as extra delays if the underlying device is slow to respond. Let's simply refuse to access file-backed areas in these functions. For this we add a new FOLL_ANON gup flag that is passed to all calls to access_remote_vm(). The code already takes care of such failures (including unmapped areas). Accesses via /proc/pid/mem were not changed though. This was assigned CVE-2018-1120. Note for stable backports: the patch may apply to kernels prior to 4.11 but silently miss one location; it must be checked that no call to access_remote_vm() keeps zero as the last argument. Reported-by: Qualys Security Advisory <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: [email protected] Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-05-17bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=nColy Li1-1/+3
Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()") returns the return value of debugfs_create_dir() to bcache_init(). When CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes bcache_init() failedi. This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n, so bcache can continue to work for the kernels which don't have debugfs enanbled. Changelog: v4: Add Acked-by from Kent Overstreet. v3: Use IS_ENABLED(CONFIG_DEBUG_FS) to replace #ifdef DEBUG_FS. v2: Remove a warning information v1: Initial version. Fixes: Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()") Cc: [email protected] Signed-off-by: Coly Li <[email protected]> Reported-by: Massimo B. <[email protected]> Reported-by: Kai Krakow <[email protected]> Tested-by: Kai Krakow <[email protected]> Acked-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-17KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBDTom Lendacky6-18/+50
Expose the new virtualized architectural mechanism, VIRT_SSBD, for using speculative store bypass disable (SSBD) under SVM. This will allow guests to use SSBD on hardware that uses non-architectural mechanisms for enabling SSBD. [ tglx: Folded the migration fixup from Paolo Bonzini ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-05-17x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFGThomas Gleixner2-0/+36
Add the necessary logic for supporting the emulated VIRT_SPEC_CTRL MSR to x86_virt_spec_ctrl(). If either X86_FEATURE_LS_CFG_SSBD or X86_FEATURE_VIRT_SPEC_CTRL is set then use the new guest_virt_spec_ctrl argument to check whether the state must be modified on the host. The update reuses speculative_store_bypass_update() so the ZEN-specific sibling coordination can be reused. Signed-off-by: Thomas Gleixner <[email protected]>
2018-05-17x86/bugs: Rework spec_ctrl base and mask logicThomas Gleixner1-7/+19
x86_spec_ctrL_mask is intended to mask out bits from a MSR_SPEC_CTRL value which are not to be modified. However the implementation is not really used and the bitmask was inverted to make a check easier, which was removed in "x86/bugs: Remove x86_spec_ctrl_set()" Aside of that it is missing the STIBP bit if it is supported by the platform, so if the mask would be used in x86_virt_spec_ctrl() then it would prevent a guest from setting STIBP. Add the STIBP bit if supported and use the mask in x86_virt_spec_ctrl() to sanitize the value which is supplied by the guest. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]>
2018-05-17x86/bugs: Remove x86_spec_ctrl_set()Thomas Gleixner2-13/+2
x86_spec_ctrl_set() is only used in bugs.c and the extra mask checks there provide no real value as both call sites can just write x86_spec_ctrl_base to MSR_SPEC_CTRL. x86_spec_ctrl_base is valid and does not need any extra masking or checking. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/bugs: Expose x86_spec_ctrl_base directlyThomas Gleixner3-24/+6
x86_spec_ctrl_base is the system wide default value for the SPEC_CTRL MSR. x86_spec_ctrl_get_default() returns x86_spec_ctrl_base and was intended to prevent modification to that variable. Though the variable is read only after init and globaly visible already. Remove the function and export the variable instead. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}Borislav Petkov2-49/+44
Function bodies are very similar and are going to grow more almost identical code. Add a bool arg to determine whether SPEC_CTRL is being set for the guest or restored to the host. No functional changes. Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/speculation: Rework speculative_store_bypass_update()Thomas Gleixner3-4/+9
The upcoming support for the virtual SPEC_CTRL MSR on AMD needs to reuse speculative_store_bypass_update() to avoid code duplication. Add an argument for supplying a thread info (TIF) value and create a wrapper speculative_store_bypass_update_current() which is used at the existing call site. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/speculation: Add virtualized speculative store bypass disable supportTom Lendacky4-2/+18
Some AMD processors only support a non-architectural means of enabling speculative store bypass disable (SSBD). To allow a simplified view of this to a guest, an architectural definition has been created through a new CPUID bit, 0x80000008_EBX[25], and a new MSR, 0xc001011f. With this, a hypervisor can virtualize the existence of this definition and provide an architectural method for using SSBD to a guest. Add the new CPUID feature, the new MSR and update the existing SSBD support to use this MSR when present. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]>
2018-05-17x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRLThomas Gleixner4-9/+35
AMD is proposing a VIRT_SPEC_CTRL MSR to handle the Speculative Store Bypass Disable via MSR_AMD64_LS_CFG so that guests do not have to care about the bit position of the SSBD bit and thus facilitate migration. Also, the sibling coordination on Family 17H CPUs can only be done on the host. Extend x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() with an extra argument for the VIRT_SPEC_CTRL MSR. Hand in 0 from VMX and in SVM add a new virt_spec_ctrl member to the CPU data structure which is going to be used in later patches for the actual implementation. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/speculation: Handle HT correctly on AMDThomas Gleixner3-6/+130
The AMD64_LS_CFG MSR is a per core MSR on Family 17H CPUs. That means when hyperthreading is enabled the SSBD bit toggle needs to take both cores into account. Otherwise the following situation can happen: CPU0 CPU1 disable SSB disable SSB enable SSB <- Enables it for the Core, i.e. for CPU0 as well So after the SSB enable on CPU1 the task on CPU0 runs with SSB enabled again. On Intel the SSBD control is per core as well, but the synchronization logic is implemented behind the per thread SPEC_CTRL MSR. It works like this: CORE_SPEC_CTRL = THREAD0_SPEC_CTRL | THREAD1_SPEC_CTRL i.e. if one of the threads enables a mitigation then this affects both and the mitigation is only disabled in the core when both threads disabled it. Add the necessary synchronization logic for AMD family 17H. Unfortunately that requires a spinlock to serialize the access to the MSR, but the locks are only shared between siblings. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/cpufeatures: Add FEATURE_ZENThomas Gleixner2-0/+2
Add a ZEN feature bit so family-dependent static_cpu_has() optimizations can be built for ZEN. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/cpufeatures: Disentangle SSBD enumerationThomas Gleixner6-16/+14
The SSBD enumeration is similarly to the other bits magically shared between Intel and AMD though the mechanisms are different. Make X86_FEATURE_SSBD synthetic and set it depending on the vendor specific features or family dependent setup. Change the Intel bit to X86_FEATURE_SPEC_CTRL_SSBD to denote that SSBD is controlled via MSR_SPEC_CTRL and fix up the usage sites. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRSThomas Gleixner4-9/+20
The availability of the SPEC_CTRL MSR is enumerated by a CPUID bit on Intel and implied by IBRS or STIBP support on AMD. That's just confusing and in case an AMD CPU has IBRS not supported because the underlying problem has been fixed but has another bit valid in the SPEC_CTRL MSR, the thing falls apart. Add a synthetic feature bit X86_FEATURE_MSR_SPEC_CTRL to denote the availability on both Intel and AMD. While at it replace the boot_cpu_has() checks with static_cpu_has() where possible. This prevents late microcode loading from exposing SPEC_CTRL, but late loading is already very limited as it does not reevaluate the mitigation options and other bits and pieces. Having static_cpu_has() is the simplest and least fragile solution. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
2018-05-17x86/speculation: Use synthetic bits for IBRS/IBPB/STIBPBorislav Petkov5-23/+26
Intel and AMD have different CPUID bits hence for those use synthetic bits which get set on the respective vendor's in init_speculation_control(). So that debacles like what the commit message of c65732e4f721 ("x86/cpu: Restore CPUID_8000_0008_EBX reload") talks about don't happen anymore. Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Jörg Otte <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-05-17KVM: SVM: Move spec control call after restore of GSThomas Gleixner1-12/+12
svm_vcpu_run() invokes x86_spec_ctrl_restore_host() after VMEXIT, but before the host GS is restored. x86_spec_ctrl_restore_host() uses 'current' to determine the host SSBD state of the thread. 'current' is GS based, but host GS is not yet restored and the access causes a triple fault. Move the call after the host GS restore. Fixes: 885f82bfbc6f x86/process: Allow runtime control of Speculative Store Bypass Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Acked-by: Paolo Bonzini <[email protected]>
2018-05-18powerpc/powernv: Fix NVRAM sleep in invalid context when crashingNicholas Piggin1-2/+12
Similarly to opal_event_shutdown, opal_nvram_write can be called in the crash path with irqs disabled. Special case the delay to avoid sleeping in invalid context. Fixes: 3b8070335f75 ("powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops") Cc: [email protected] # v3.2 Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2018-05-17MAINTAINERS: add entry for STM32 I2C driverPierre-Yves MORDRET1-0/+6
Add I2C/SMBUS Driver entry for STM32 family from ST Microelectronics. Signed-off-by: Pierre-Yves MORDRET <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
2018-05-17Makefile: disable PIE before testing asm gotoMichal Kubecek1-2/+3
Since commit e501ce957a78 ("x86: Force asm-goto"), aarch64 build on distributions which enable PIE by default (e.g. openSUSE Tumbleweed) does not detect support for asm goto correctly. The problem is that ARM specific part of scripts/gcc-goto.sh fails with PIE even with recent gcc versions. Moving the asm goto detection up in Makefile put it before the place where we disable PIE. As a result, kernel is built without jump label support. Move the lines disabling PIE before the asm goto test to make it work. Fixes: e501ce957a78 ("x86: Force asm-goto") Reported-by: Andreas Faerber <[email protected]> Signed-off-by: Michal Kubecek <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2018-05-17kbuild: gcov: enable -fno-tree-loop-im if supportedNick Desaulniers1-1/+3
Clang does not recognize this compiler option. Reported-by: Prasad Sodagudi <[email protected]> Signed-off-by: Nick Desaulniers <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2018-05-17btrfs: fix crash when trying to resume balance without the resume flagAnand Jain1-0/+9
We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance() only, which isn't called during the remount. So when resuming from the paused balance we hit the bug: kernel: kernel BUG at fs/btrfs/volumes.c:3890! :: kernel: balance_kthread+0x51/0x60 [btrfs] kernel: kthread+0x111/0x130 :: kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8 Reproducer: On a mounted filesystem: btrfs balance start --full-balance /btrfs btrfs balance pause /btrfs mount -o remount,ro /dev/sdb /btrfs mount -o remount,rw /dev/sdb /btrfs To fix this set the BTRFS_BALANCE_RESUME flag in btrfs_resume_balance_async(). CC: [email protected] # 4.4+ Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: Fix delalloc inodes invalidation during transaction abortNikolay Borisov1-11/+15
When a transaction is aborted btrfs_cleanup_transaction is called to cleanup all the various in-flight bits and pieces which migth be active. One of those is delalloc inodes - inodes which have dirty pages which haven't been persisted yet. Currently the process of freeing such delalloc inodes in exceptional circumstances such as transaction abort boiled down to calling btrfs_invalidate_inodes whose sole job is to invalidate the dentries for all inodes related to a root. This is in fact wrong and insufficient since such delalloc inodes will likely have pending pages or ordered-extents and will be linked to the sb->s_inode_list. This means that unmounting a btrfs instance with an aborted transaction could potentially lead inodes/their pages visible to the system long after their superblock has been freed. This in turn leads to a "use-after-free" situation once page shrink is triggered. This situation could be simulated by running generic/019 which would cause such inodes to be left hanging, followed by generic/176 which causes memory pressure and page eviction which lead to touching the freed super block instance. This situation is additionally detected by the unmount code of VFS with the following message: "VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..." Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); in free_fs_root for the same reason. This patch aims to rectify the sitaution by doing the following: 1. Change btrfs_destroy_delalloc_inodes so that it calls invalidate_inode_pages2 for every inode on the delalloc list, this ensures that all the pages of the inode are released. This function boils down to calling btrfs_releasepage. During test I observed cases where inodes on the delalloc list were having an i_count of 0, so this necessitates using igrab to be sure we are working on a non-freed inode. 2. Since calling btrfs_releasepage might queue delayed iputs move the call out to btrfs_cleanup_transaction in btrfs_error_commit_super before calling run_delayed_iputs for the last time. This is necessary to ensure that delayed iputs are run. Note: this patch is tagged for 4.14 stable but the fix applies to older versions too but needs to be backported manually due to conflicts. CC: [email protected] # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions CC: [email protected] # 4.14.x Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add comment to igrab ] Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: Split btrfs_del_delalloc_inode into 2 functionsNikolay Borisov2-3/+12
This is in preparation of fixing delalloc inodes leakage on transaction abort. Also export the new function. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: fix reading stale metadata blocks after degraded raid1 mountsLiu Bo1-3/+3
If a btree block, aka. extent buffer, is not available in the extent buffer cache, it'll be read out from the disk instead, i.e. btrfs_search_slot() read_block_for_search() # hold parent and its lock, go to read child btrfs_release_path() read_tree_block() # read child Unfortunately, the parent lock got released before reading child, so commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had used 0 as parent transid to read the child block. It forces read_tree_block() not to check if parent transid is different with the generation id of the child that it reads out from disk. A simple PoC is included in btrfs/124, 0. A two-disk raid1 btrfs, 1. Right after mkfs.btrfs, block A is allocated to be device tree's root. 2. Mount this filesystem and put it in use, after a while, device tree's root got COW but block A hasn't been allocated/overwritten yet. 3. Umount it and reload the btrfs module to remove both disks from the global @fs_devices list. 4. mount -odegraded dev1 and write some data, so now block A is allocated to be a leaf in checksum tree. Note that only dev1 has the latest metadata of this filesystem. 5. Umount it and mount it again normally (with both disks), since raid1 can pick up one disk by the writer task's pid, if btrfs_search_slot() needs to read block A, dev2 which does NOT have the latest metadata might be read for block A, then we got a stale block A. 6. As parent transid is not checked, block A is marked as uptodate and put into the extent buffer cache, so the future search won't bother to read disk again, which means it'll make changes on this stale one and make it dirty and flush it onto disk. To avoid the problem, parent transid needs to be passed to read_tree_block(). In order to get a valid parent transid, we need to hold the parent's lock until finishing reading child. This patch needs to be slightly adapted for stable kernels, the &first_key parameter added to read_tree_block() is from 4.16+ (581c1760415c4). The fix is to replace 0 by 'gen'. Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race") CC: [email protected] # 4.4+ Signed-off-by: Liu Bo <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> [ update changelog ] Signed-off-by: David Sterba <[email protected]>
2018-05-17btrfs: property: Set incompat flag if lzo/zstd compression is setMisono Tomohiro1-4/+8
Incompat flag of LZO/ZSTD compression should be set at: 1. mount time (-o compress/compress-force) 2. when defrag is done 3. when property is set Currently 3. is missing and this commit adds this. This could lead to a filesystem that uses ZSTD but is not marked as such. If a kernel without a ZSTD support encounteres a ZSTD compressed extent, it will handle that but this could be confusing to the user. Typically the filesystem is mounted with the ZSTD option, but the discrepancy can arise when a filesystem is never mounted with ZSTD and then the property on some file is set (and some new extents are written). A simple mount with -o compress=zstd will fix that up on an unpatched kernel. Same goes for LZO, but this has been around for a very long time (2.6.37) so it's unlikely that a pre-LZO kernel would be used. Fixes: 5c1aab1dd544 ("btrfs: Add zstd support") CC: [email protected] # 4.14+ Signed-off-by: Tomohiro Misono <[email protected]> Reviewed-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add user visible impact ] Signed-off-by: David Sterba <[email protected]>
2018-05-17Btrfs: fix duplicate extents after fsync of file with prealloc extentsFilipe Manana1-25/+112
In commit 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), on fsync, we started to always log all prealloc extents beyond an inode's i_size in order to avoid losing them after a power failure. However under some cases this can lead to the log replay code to create duplicate extent items, with different lengths, in the extent tree. That happens because, as of that commit, we can now log extent items based on extent maps that are not on the "modified" list of extent maps of the inode's extent map tree. Logging extent items based on extent maps is used during the fast fsync path to save time and for this to work reliably it requires that the extent maps are not merged with other adjacent extent maps - having the extent maps in the list of modified extents gives such guarantee. Consider the following example, captured during a long run of fsstress, which illustrates this problem. We have inode 271, in the filesystem tree (root 5), for which all of the following operations and discussion apply to. A buffered write starts at offset 312391 with a length of 933471 bytes (end offset at 1245862). At this point we have, for this inode, the following extent maps with the their field values: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 376832, block_start 1106399232, block_len 376832, orig_block_len 376832 em C, start 417792, orig_start 417792, len 782336, block_start 18446744073709551613, block_len 0, orig_block_len 0 em D, start 1200128, orig_start 1200128, len 835584, block_start 1106776064, block_len 835584, orig_block_len 835584 em E, start 2035712, orig_start 2035712, len 245760, block_start 1107611648, block_len 245760, orig_block_len 245760 Extent map A corresponds to a hole and extent maps D and E correspond to preallocated extents. Extent map D ends where extent map E begins (1106776064 + 835584 = 1107611648), but these extent maps were not merged because they are in the inode's list of modified extent maps. An fsync against this inode is made, which triggers the fast path (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback of the data previously written using buffered IO, and when the respective ordered extent finishes, btrfs_drop_extents() is called against the (aligned) range 311296..1249279. This causes a split of extent map D at btrfs_drop_extent_cache(), replacing extent map D with a new extent map D', also added to the list of modified extents, with the following values: em D', start 1249280, orig_start of 1200128, block_start 1106825216 (= 1106776064 + 1249280 - 1200128), orig_block_len 835584, block_len 786432 (835584 - (1249280 - 1200128)) Then, during the fast fsync, btrfs_log_changed_extents() is called and extent maps D' and E are removed from the list of modified extents. The flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged clear_em_logging() is called on each of them, and that makes extent map E to be merged with extent map D' (try_merge_map()), resulting in D' being deleted and E adjusted to: em E, start 1249280, orig_start 1200128, len 1032192, block_start 1106825216, block_len 1032192, orig_block_len 245760 A direct IO write at offset 1847296 and length of 360448 bytes (end offset at 2207744) starts, and at that moment the following extent maps exist for our inode: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em E (prealloc), start 1249280, orig_start 1200128, len 1032192, block_start 1106825216, block_len 1032192, orig_block_len 245760 The dio write results in drop_extent_cache() being called twice. The first time for a range that starts at offset 1847296 and ends at offset 2035711 (length of 188416), which results in a double split of extent map E, replacing it with two new extent maps: em F, start 1249280, orig_start 1200128, block_start 1106825216, block_len 598016, orig_block_len 598016 em G, start 2035712, orig_start 1200128, block_start 1107611648, block_len 245760, orig_block_len 1032192 It also creates a new extent map that represents a part of the requested IO (through create_io_em()): em H, start 1847296, len 188416, block_start 1107423232, block_len 188416 The second call to drop_extent_cache() has a range with a start offset of 2035712 and end offset of 2207743 (length of 172032). This leads to replacing extent map G with a new extent map I with the following values: em I, start 2207744, orig_start 1200128, block_start 1107783680, block_len 73728, orig_block_len 1032192 It also creates a new extent map that represents the second part of the requested IO (through create_io_em()): em J, start 2035712, len 172032, block_start 1107611648, block_len 172032 The dio write set the inode's i_size to 2207744 bytes. After the dio write the inode has the following extent maps: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em F, start 1249280, orig_start 1200128, len 598016, block_start 1106825216, block_len 598016, orig_block_len 598016 em H, start 1847296, orig_start 1200128, len 188416, block_start 1107423232, block_len 188416, orig_block_len 835584 em J, start 2035712, orig_start 2035712, len 172032, block_start 1107611648, block_len 172032, orig_block_len 245760 em I, start 2207744, orig_start 1200128, len 73728, block_start 1107783680, block_len 73728, orig_block_len 1032192 Now do some change to the file, like adding a xattr for example and then fsync it again. This triggers a fast fsync path, and as of commit 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), we use the extent map I to log a file extent item because it's a prealloc extent and it starts at an offset matching the inode's i_size. However when we log it, we create a file extent item with a value for the disk byte location that is wrong, as can be seen from the following output of "btrfs inspect-internal dump-tree": item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53 generation 22 type 2 (prealloc) prealloc data disk byte 1106776064 nr 1032192 prealloc data offset 1007616 nr 73728 Here the disk byte value corresponds to calculation based on some fields from the extent map I: 1106776064 = block_start (1107783680) - 1007616 (extent_offset) extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616 The disk byte value of 1106776064 clashes with disk byte values of the file extent items at offsets 1249280 and 1847296 in the fs tree: item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1106776064 nr 835584 prealloc data offset 49152 nr 598016 item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1106776064 nr 835584 extent data offset 647168 nr 188416 ram 835584 extent compression 0 (none) item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1107611648 nr 245760 extent data offset 0 nr 172032 ram 245760 extent compression 0 (none) item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1107611648 nr 245760 prealloc data offset 172032 nr 73728 Instead of the disk byte value of 1106776064, the value of 1107611648 should have been logged. Also the data offset value should have been 172032 and not 1007616. After a log replay we end up getting two extent items in the extent tree with different lengths, one of 835584, which is correct and existed before the log replay, and another one of 1032192 which is wrong and is based on the logged file extent item: item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53 refs 2 gen 15 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 2 item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53 refs 1 gen 22 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 1 Obviously this leads to many problems and a filesystem check reports many errors: (...) checking extents Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1 extent item 1106776064 has multiple extent items ref mismatch on [1106776064 835584] extent item 2, found 3 Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680 Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70 Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192 backpointer mismatch on [1106776064 835584] checking free space cache block group 1103101952 has wrong amount of free space failed to load free space cache for block group 1103101952 checking fs roots (...) So fix this by logging the prealloc extents beyond the inode's i_size based on searches in the subvolume tree instead of the extent maps. Fixes: 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay") CC: [email protected] # 4.14+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-05-17netfilter: ebtables: handle string from userspace with carePaolo Abeni1-1/+2
strlcpy() can't be safely used on a user-space provided string, as it can try to read beyond the buffer's end, if the latter is not NULL terminated. Leveraging the above, syzbot has been able to trigger the following splat: BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline] BUG: KASAN: stack-out-of-bounds in compat_mtw_from_user net/bridge/netfilter/ebtables.c:1957 [inline] BUG: KASAN: stack-out-of-bounds in ebt_size_mwt net/bridge/netfilter/ebtables.c:2059 [inline] BUG: KASAN: stack-out-of-bounds in size_entry_mwt net/bridge/netfilter/ebtables.c:2155 [inline] BUG: KASAN: stack-out-of-bounds in compat_copy_entries+0x96c/0x14a0 net/bridge/netfilter/ebtables.c:2194 Write of size 33 at addr ffff8801b0abf888 by task syz-executor0/4504 CPU: 0 PID: 4504 Comm: syz-executor0 Not tainted 4.17.0-rc2+ #40 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 memcpy+0x37/0x50 mm/kasan/kasan.c:303 strlcpy include/linux/string.h:300 [inline] compat_mtw_from_user net/bridge/netfilter/ebtables.c:1957 [inline] ebt_size_mwt net/bridge/netfilter/ebtables.c:2059 [inline] size_entry_mwt net/bridge/netfilter/ebtables.c:2155 [inline] compat_copy_entries+0x96c/0x14a0 net/bridge/netfilter/ebtables.c:2194 compat_do_replace+0x483/0x900 net/bridge/netfilter/ebtables.c:2285 compat_do_ebt_set_ctl+0x2ac/0x324 net/bridge/netfilter/ebtables.c:2367 compat_nf_sockopt net/netfilter/nf_sockopt.c:144 [inline] compat_nf_setsockopt+0x9b/0x140 net/netfilter/nf_sockopt.c:156 compat_ip_setsockopt+0xff/0x140 net/ipv4/ip_sockglue.c:1279 inet_csk_compat_setsockopt+0x97/0x120 net/ipv4/inet_connection_sock.c:1041 compat_tcp_setsockopt+0x49/0x80 net/ipv4/tcp.c:2901 compat_sock_common_setsockopt+0xb4/0x150 net/core/sock.c:3050 __compat_sys_setsockopt+0x1ab/0x7c0 net/compat.c:403 __do_compat_sys_setsockopt net/compat.c:416 [inline] __se_compat_sys_setsockopt net/compat.c:413 [inline] __ia32_compat_sys_setsockopt+0xbd/0x150 net/compat.c:413 do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline] do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139 RIP: 0023:0xf7fb3cb9 RSP: 002b:00000000fff0c26c EFLAGS: 00000282 ORIG_RAX: 000000000000016e RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000000 RDX: 0000000000000080 RSI: 0000000020000300 RDI: 00000000000005f4 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:ffffea0006c2afc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x2fffc0000000000() raw: 02fffc0000000000 0000000000000000 0000000000000000 00000000ffffffff raw: 0000000000000000 ffffea0006c20101 0000000000000000 0000000000000000 page dumped because: kasan: bad access detected Fix the issue replacing the unsafe function with strscpy() and taking care of possible errors. Fixes: 81e675c227ec ("netfilter: ebtables: add CONFIG_COMPAT support") Reported-and-tested-by: [email protected] Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-05-17netfilter: nf_tables: fix NULL pointer dereference on nft_ct_helper_obj_dump()Taehee Yoo1-8/+12
In the nft_ct_helper_obj_dump(), always priv->helper4 is dereferenced. But if family is ipv6, priv->helper6 should be dereferenced. Steps to reproduces: #test.nft table ip6 filter { ct helper ftp { type "ftp" protocol tcp } chain input { type filter hook input priority 4; ct helper set "ftp" } } %nft -f test.nft %nft list ruleset we can see the below messages: [ 916.286233] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 916.294777] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 916.302613] Modules linked in: nft_objref nf_conntrack_sip nf_conntrack_snmp nf_conntrack_broadcast nf_conntrack_ftp nft_ct nf_conntrack nf_tables nfnetlink [last unloaded: nfnetlink] [ 916.318758] CPU: 1 PID: 2093 Comm: nft Not tainted 4.17.0-rc4+ #181 [ 916.326772] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015 [ 916.338773] RIP: 0010:strlen+0x1a/0x90 [ 916.342781] RSP: 0018:ffff88010ff0f2f8 EFLAGS: 00010292 [ 916.346773] RAX: dffffc0000000000 RBX: ffff880119b26ee8 RCX: ffff88010c150038 [ 916.354777] RDX: 0000000000000002 RSI: ffff880119b26ee8 RDI: 0000000000000010 [ 916.362773] RBP: 0000000000000010 R08: 0000000000007e88 R09: ffff88010c15003c [ 916.370773] R10: ffff88010c150037 R11: ffffed002182a007 R12: ffff88010ff04040 [ 916.378779] R13: 0000000000000010 R14: ffff880119b26f30 R15: ffff88010ff04110 [ 916.387265] FS: 00007f57a1997700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000 [ 916.394785] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 916.402778] CR2: 00007f57a0ac80f0 CR3: 000000010ff02000 CR4: 00000000001006e0 [ 916.410772] Call Trace: [ 916.414787] nft_ct_helper_obj_dump+0x94/0x200 [nft_ct] [ 916.418779] ? nft_ct_set_eval+0x560/0x560 [nft_ct] [ 916.426771] ? memset+0x1f/0x40 [ 916.426771] ? __nla_reserve+0x92/0xb0 [ 916.434774] ? memcpy+0x34/0x50 [ 916.434774] nf_tables_fill_obj_info+0x484/0x860 [nf_tables] [ 916.442773] ? __nft_release_basechain+0x600/0x600 [nf_tables] [ 916.450779] ? lock_acquire+0x193/0x380 [ 916.454771] ? lock_acquire+0x193/0x380 [ 916.458789] ? nf_tables_dump_obj+0x148/0xcb0 [nf_tables] [ 916.462777] nf_tables_dump_obj+0x5f0/0xcb0 [nf_tables] [ 916.470769] ? __alloc_skb+0x30b/0x500 [ 916.474779] netlink_dump+0x752/0xb50 [ 916.478775] __netlink_dump_start+0x4d3/0x750 [ 916.482784] nf_tables_getobj+0x27a/0x930 [nf_tables] [ 916.490774] ? nft_obj_notify+0x100/0x100 [nf_tables] [ 916.494772] ? nf_tables_getobj+0x930/0x930 [nf_tables] [ 916.502579] ? nf_tables_dump_flowtable_done+0x70/0x70 [nf_tables] [ 916.506774] ? nft_obj_notify+0x100/0x100 [nf_tables] [ 916.514808] nfnetlink_rcv_msg+0x8ab/0xa86 [nfnetlink] [ 916.518771] ? nfnetlink_rcv_msg+0x550/0xa86 [nfnetlink] [ 916.526782] netlink_rcv_skb+0x23e/0x360 [ 916.530773] ? nfnetlink_bind+0x200/0x200 [nfnetlink] [ 916.534778] ? debug_check_no_locks_freed+0x280/0x280 [ 916.542770] ? netlink_ack+0x870/0x870 [ 916.546786] ? ns_capable_common+0xf4/0x130 [ 916.550765] nfnetlink_rcv+0x172/0x16c0 [nfnetlink] [ 916.554771] ? sched_clock_local+0xe2/0x150 [ 916.558774] ? sched_clock_cpu+0x144/0x180 [ 916.566575] ? lock_acquire+0x380/0x380 [ 916.570775] ? sched_clock_local+0xe2/0x150 [ 916.574765] ? nfnetlink_net_init+0x130/0x130 [nfnetlink] [ 916.578763] ? sched_clock_cpu+0x144/0x180 [ 916.582770] ? lock_acquire+0x193/0x380 [ 916.590771] ? lock_acquire+0x193/0x380 [ 916.594766] ? lock_acquire+0x380/0x380 [ 916.598760] ? netlink_deliver_tap+0x262/0xa60 [ 916.602766] ? lock_acquire+0x193/0x380 [ 916.606766] netlink_unicast+0x3ef/0x5a0 [ 916.610771] ? netlink_attachskb+0x630/0x630 [ 916.614763] netlink_sendmsg+0x72a/0xb00 [ 916.618769] ? netlink_unicast+0x5a0/0x5a0 [ 916.626766] ? _copy_from_user+0x92/0xc0 [ 916.630773] __sys_sendto+0x202/0x300 [ 916.634772] ? __ia32_sys_getpeername+0xb0/0xb0 [ 916.638759] ? lock_acquire+0x380/0x380 [ 916.642769] ? lock_acquire+0x193/0x380 [ 916.646761] ? finish_task_switch+0xf4/0x560 [ 916.650763] ? __schedule+0x582/0x19a0 [ 916.655301] ? __sched_text_start+0x8/0x8 [ 916.655301] ? up_read+0x1c/0x110 [ 916.655301] ? __do_page_fault+0x48b/0xaa0 [ 916.655301] ? entry_SYSCALL_64_after_hwframe+0x59/0xbe [ 916.655301] __x64_sys_sendto+0xdd/0x1b0 [ 916.655301] do_syscall_64+0x96/0x3d0 [ 916.655301] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 916.655301] RIP: 0033:0x7f57a0ff5e03 [ 916.655301] RSP: 002b:00007fff6367e0a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 916.655301] RAX: ffffffffffffffda RBX: 00007fff6367f1e0 RCX: 00007f57a0ff5e03 [ 916.655301] RDX: 0000000000000020 RSI: 00007fff6367e110 RDI: 0000000000000003 [ 916.655301] RBP: 00007fff6367e100 R08: 00007f57a0ce9160 R09: 000000000000000c [ 916.655301] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff6367e110 [ 916.655301] R13: 0000000000000020 R14: 00007f57a153c610 R15: 0000562417258de0 [ 916.655301] Code: ff ff ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 fa 53 48 c1 ea 03 48 b8 00 00 00 00 00 fc ff df 48 89 fd 48 83 ec 08 <0f> b6 04 02 48 89 fa 83 e2 07 38 d0 7f [ 916.655301] RIP: strlen+0x1a/0x90 RSP: ffff88010ff0f2f8 [ 916.771929] ---[ end trace 1065e048e72479fe ]--- [ 916.777204] Kernel panic - not syncing: Fatal exception [ 916.778158] Kernel Offset: 0x14000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Signed-off-by: Taehee Yoo <[email protected]> Acked-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-05-17dmaengine: qcom: bam_dma: check if the runtime pm enabledSrinivas Kandagatla1-5/+13
Disabling pm runtime at probe is not sufficient to get BAM working on remotely controller instances. pm_runtime_get_sync() would return -EACCES in such cases. So check if runtime pm is enabled before returning error from bam functions. Fixes: 5b4a68952a89 ("dmaengine: qcom: bam_dma: disable runtime pm on remote controlled") Signed-off-by: Srinivas Kandagatla <[email protected]> Signed-off-by: Vinod Koul <[email protected]>
2018-05-17KVM: s390: vsie: fix < 8k check for the itdbaDavid Hildenbrand1-1/+1
By missing an "L", we might detect some addresses to be <8k, although they are not. e.g. for itdba = 100001fff !(gpa & ~0x1fffU) -> 1 !(gpa & ~0x1fffUL) -> 0 So we would report a SIE validity intercept although everything is fine. Fixes: 166ecb3 ("KVM: s390: vsie: support transactional execution") Reported-by: Dan Carpenter <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Janosch Frank <[email protected]> Cc: [email protected] # v4.8+ Signed-off-by: Christian Borntraeger <[email protected]>
2018-05-17KVM: PPC: Book 3S HV: Do ptesync in radix guest exit pathPaul Mackerras1-0/+8
A radix guest can execute tlbie instructions to invalidate TLB entries. After a tlbie or a group of tlbies, it must then do the architected sequence eieio; tlbsync; ptesync to ensure that the TLB invalidation has been processed by all CPUs in the system before it can rely on no CPU using any translation that it just invalidated. In fact it is the ptesync which does the actual synchronization in this sequence, and hardware has a requirement that the ptesync must be executed on the same CPU thread as the tlbies which it is expected to order. Thus, if a vCPU gets moved from one physical CPU to another after it has done some tlbies but before it can get to do the ptesync, the ptesync will not have the desired effect when it is executed on the second physical CPU. To fix this, we do a ptesync in the exit path for radix guests. If there are any pending tlbies, this will wait for them to complete. If there aren't, then ptesync will just do the same as sync. Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority changeBenjamin Herrenschmidt1-7/+101
When a vcpu priority (CPPR) is set to a lower value (masking more interrupts), we stop processing interrupts already in the queue for the priorities that have now been masked. If those interrupts were previously re-routed to a different CPU, they might still be stuck until the older one that has them in its queue processes them. In the case of guest CPU unplug, that can be never. To address that without creating additional overhead for the normal interrupt processing path, this changes H_CPPR handling so that when such a priority change occurs, we scan the interrupt queue for that vCPU, and for any interrupt in there that has been re-routed, we replace it with a dummy and force a re-trigger. Signed-off-by: Benjamin Herrenschmidt <[email protected]> Tested-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: Make radix clear pte when unmappingNicholas Piggin1-1/+1
The current partition table unmap code clears the _PAGE_PRESENT bit out of the pte, which leaves pud_huge/pmd_huge true and does not clear pud_present/pmd_present. This can confuse subsequent page faults and possibly lead to the guest looping doing continual hypervisor page faults. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in ↵Nicholas Piggin1-2/+2
kvmppc_radix_tlbie_page The standard eieio ; tlbsync ; ptesync must follow tlbie to ensure it is ordered with respect to subsequent operations. Signed-off-by: Nicholas Piggin <[email protected]> Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17KVM: PPC: Book3S HV: Snapshot timebase offset on guest entryPaul Mackerras4-45/+47
Currently, the HV KVM guest entry/exit code adds the timebase offset from the vcore struct to the timebase on guest entry, and subtracts it on guest exit. Which is fine, except that it is possible for userspace to change the offset using the SET_ONE_REG interface while the vcore is running, as there is only one timebase offset per vcore but potentially multiple VCPUs in the vcore. If that were to happen, KVM would subtract a different offset on guest exit from that which it had added on guest entry, leading to the timebase being out of sync between cores in the host, which then leads to bad things happening such as hangs and spurious watchdog timeouts. To fix this, we add a new field 'tb_offset_applied' to the vcore struct which stores the offset that is currently applied to the timebase. This value is set from the vcore tb_offset field on guest entry, and is what is subtracted from the timebase on guest exit. Since it is zero when the timebase offset is not applied, we can simplify the logic in kvmhv_start_timing and kvmhv_accumulate_time. In addition, we had secondary threads reading the timebase while running concurrently with code on the primary thread which would eventually add or subtract the timebase offset from the timebase. This occurred while saving or restoring the DEC register value on the secondary threads. Although no specific incorrect behaviour has been observed, this is a race which should be fixed. To fix it, we move the DEC saving code to just before we call kvmhv_commence_exit, and the DEC restoring code to after the point where we have waited for the primary thread to switch the MMU context and add the timebase offset. That way we are sure that the timebase contains the guest timebase value in both cases. Signed-off-by: Paul Mackerras <[email protected]>
2018-05-17Merge branch 'vmwgfx-fixes-4.17' of ↵Dave Airlie1-0/+2
git://people.freedesktop.org/~thomash/linux into drm-fixes A single fix for a recent regression. * 'vmwgfx-fixes-4.17' of git://people.freedesktop.org/~thomash/linux: drm/vmwgfx: Set dmabuf_size when vmw_dmabuf_init is successful
2018-05-17Merge tag 'drm-misc-fixes-2018-05-16' of ↵Dave Airlie3-4/+6
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes - core: Fix regression in dev node offsets (Haneen) - vc4: Fix memory leak on driver close (Eric) - dumb-buffers: Prevent overflow in DIV_ROUND_UP() (Dan) Cc: Haneen Mohammed <[email protected]> Cc: Eric Anholt <[email protected]> Cc: Dan Carpenter <[email protected]> * tag 'drm-misc-fixes-2018-05-16' of git://anongit.freedesktop.org/drm/drm-misc: drm/dumb-buffers: Integer overflow in drm_mode_create_ioctl() drm/vc4: Fix leak of the file_priv that stored the perfmon. drm: Match sysfs name in link removal to link creation
2018-05-16Merge tag 'trace-v4.17-rc4-2' of ↵Linus Torvalds3-22/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Some of the ftrace internal events use a zero for a data size of a field event. This is increasingly important for the histogram trigger work that is being extended. While auditing trace events, I found that a couple of the xen events were used as just marking that a function was called, by creating a static array of size zero. This can play havoc with the tracing features if these events are used, because a zero size of a static array is denoted as a special nul terminated dynamic array (this is what the trace_marker code uses). But since the xen events have no size, they are not nul terminated, and unexpected results may occur. As trace events were never intended on being a marker to denote that a function was hit or not, especially since function tracing and kprobes can trivially do the same, the best course of action is to simply remove these events" * tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/x86/xen: Remove zero data size trace events trace_xen_mmu_flush_tlb{_all}
2018-05-16afs: Fix mounting of backup volumesMarc Dionne1-9/+10
In theory the AFS_VLSF_BACKVOL flag for a server in a vldb entry would indicate the presence of a backup volume on that server. In practice however, this flag is never set, and the presence of a backup volume is implied by the entry having AFS_VLF_BACKEXISTS set, for the server that hosts the read-write volume (has AFS_VLSF_RWVOL). Signed-off-by: Marc Dionne <[email protected]> Signed-off-by: David Howells <[email protected]>
2018-05-16afs: Fix directory permissions checkDavid Howells1-7/+3
Doing faccessat("/afs/some/directory", 0) triggers a BUG in the permissions check code. Fix this by just removing the BUG section. If no permissions are asked for, just return okay if the file exists. Also: (1) Split up the directory check so that it has separate if-statements rather than if-else-if (e.g. checking for MAY_EXEC shouldn't skip the check for MAY_READ and MAY_WRITE). (2) Check for MAY_CHDIR as MAY_EXEC. Without the main fix, the following BUG may occur: kernel BUG at fs/afs/security.c:386! invalid opcode: 0000 [#1] SMP PTI ... RIP: 0010:afs_permission+0x19d/0x1a0 [kafs] ... Call Trace: ? inode_permission+0xbe/0x180 ? do_faccessat+0xdc/0x270 ? do_syscall_64+0x60/0x1f0 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 00d3b7a4533e ("[AFS]: Add security support.") Reported-by: Jonathan Billings <[email protected]> Signed-off-by: David Howells <[email protected]>
2018-05-16tuntap: fix use after free during releaseJason Wang1-1/+1
After commit b196d88aba8a ("tun: fix use after free for ptr_ring") we need clean up tx ring during release(). But unfortunately, it tries to do the cleanup blindly after socket were destroyed which will lead another use-after-free. Fix this by doing the cleanup before dropping the last reference of the socket in __tun_detach(). Reported-by: Andrei Vagin <[email protected]> Acked-by: Andrei Vagin <[email protected]> Fixes: b196d88aba8a ("tun: fix use after free for ptr_ring") Signed-off-by: Jason Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-16Merge branch 'qed-LL2-fixes'David S. Miller1-11/+50
Michal Kalderon says: ==================== qed: LL2 fixes This series fixes some issues in ll2 related to synchronization and resource freeing ==================== Signed-off-by: Ariel Elior <[email protected]> Signed-off-by: Michal Kalderon <[email protected]> Signed-off-by: David S. Miller <[email protected]>