aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2016-04-28mm/hwpoison: fix wrong num_poisoned_pages accountingMinchan Kim1-1/+7
Currently, migration code increses num_poisoned_pages on *failed* migration page as well as successfully migrated one at the trial of memory-failure. It will make the stat wrong. As well, it marks the page as PG_HWPoison even if the migration trial failed. It would mean we cannot recover the corrupted page using memory-failure facility. This patches fixes it. Signed-off-by: Minchan Kim <[email protected]> Reported-by: Vlastimil Babka <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28mm: call swap_slot_free_notify() with page lock heldMinchan Kim1-1/+5
Kyeongdon reported below error which is BUG_ON(!PageSwapCache(page)) in page_swap_info. The reason is that page_endio in rw_page unlocks the page if read I/O is completed so we need to hold a PG_lock again to check PageSwapCache. Otherwise, the page can be removed from swapcache. Kernel BUG at c00f9040 [verbose debug info unavailable] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM Modules linked in: CPU: 4 PID: 13446 Comm: RenderThread Tainted: G W 3.10.84-g9f14aec-dirty #73 task: c3b73200 ti: dd192000 task.ti: dd192000 PC is at page_swap_info+0x10/0x2c LR is at swap_slot_free_notify+0x18/0x6c pc : [<c00f9040>] lr : [<c00f5560>] psr: 400f0113 sp : dd193d78 ip : c2deb1e4 fp : da015180 r10: 00000000 r9 : 000200da r8 : c120fe08 r7 : 00000000 r6 : 00000000 r5 : c249a6c0 r4 : = c249a6c0 r3 : 00000000 r2 : 40080009 r1 : 200f0113 r0 : = c249a6c0 ..<snip> .. Call Trace: page_swap_info+0x10/0x2c swap_slot_free_notify+0x18/0x6c swap_readpage+0x90/0x11c read_swap_cache_async+0x134/0x1ac swapin_readahead+0x70/0xb0 handle_pte_fault+0x320/0x6fc handle_mm_fault+0xc0/0xf0 do_page_fault+0x11c/0x36c do_DataAbort+0x34/0x118 Fixes: 3f2b1a04f44933f2 ("zram: revive swap_slot_free_notify") Signed-off-by: Minchan Kim <[email protected]> Tested-by: Kyeongdon Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28mm: vmscan: reclaim highmem zone if buffer_heads is over limitMinchan Kim1-1/+1
We have been reclaimed highmem zone if buffer_heads is over limit but commit 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()") changed the behavior so it doesn't reclaim highmem zone although buffer_heads is over the limit. This patch restores the logic. Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()") Signed-off-by: Minchan Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28numa: fix /proc/<pid>/numa_maps for THPGerald Schaefer3-3/+72
In gather_pte_stats() a THP pmd is cast into a pte, which is wrong because the layouts may differ depending on the architecture. On s390 this will lead to inaccurate numa_maps accounting in /proc because of misguided pte_present() and pte_dirty() checks on the fake pte. On other architectures pte_present() and pte_dirty() may work by chance, but there may be an issue with direct-access (dax) mappings w/o underlying struct pages when HAVE_PTE_SPECIAL is set and THP is available. In vm_normal_page() the fake pte will be checked with pte_special() and because there is no "special" bit in a pmd, this will always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be skipped. On dax mappings w/o struct pages, an invalid struct page pointer would then be returned that can crash the kernel. This patch fixes the numa_maps THP handling by introducing new "_pmd" variants of the can_gather_numa_stats() and vm_normal_page() functions. Signed-off-by: Gerald Schaefer <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jerome Marchand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Dan Williams <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Michael Holzheu <[email protected]> Cc: <[email protected]> [4.3+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA checkKonstantin Khlebnikov1-4/+2
Khugepaged detects own VMAs by checking vm_file and vm_ops but this way it cannot distinguish private /dev/zero mappings from other special mappings like /dev/hpet which has no vm_ops and popultes PTEs in mmap. This fixes false-positive VM_BUG_ON and prevents installing THP where they are not expected. Link: http://lkml.kernel.org/r/CACT4Y+ZmuZMV5CjSFOeXviwQdABAgT7T+StKfTqan9YDtgEi5g@mail.gmail.com Fixes: 78f11a255749 ("mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups") Signed-off-by: Konstantin Khlebnikov <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: stable <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28mailmap: fix Krzysztof Kozlowski's misspelled nameKrzysztof Kozlowski1-0/+1
Patchwork introduced a garbled Polish character in commit 1e3012d0fdc5 ("crypto: s5p-sss - Use memcpy_toio for iomem annotated memory") so fix the mail mapping. Additionally prefer to use kernel.org account for personal work, instead of my gmail address. Signed-off-by: Krzysztof Kozlowski <[email protected]> Cc: Herbert Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28thp: keep huge zero page pinned until tlb flushKirill A. Shutemov3-3/+13
Andrea has found[1] a race condition on MMU-gather based TLB flush vs split_huge_page() or shrinker which frees huge zero under us (patch 1/2 and 2/2 respectively). With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the page pinned until flush is complete and the pin prevents the page from being split under us. We still need patch 2/2. This is simplified version of Andrea's patch. We don't need fancy encoding. [1] http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Reported-by: Andrea Arcangeli <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28mm: exclude HugeTLB pages from THP page_mapped() logicSteve Capper1-0/+2
HugeTLB pages cannot be split, so we use the compound_mapcount to track rmaps. Currently page_mapped() will check the compound_mapcount, but will also go through the constituent pages of a THP compound page and query the individual _mapcount's too. Unfortunately, page_mapped() does not distinguish between HugeTLB and THP compound pages and assumes that a compound page always needs to have HPAGE_PMD_NR pages querying. For most cases when dealing with HugeTLB this is just inefficient, but for scenarios where the HugeTLB page size is less than the pmd block size (e.g. when using contiguous bit on ARM) this can lead to crashes. This patch adjusts the page_mapped function such that we skip the unnecessary THP reference checks for HugeTLB pages. Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages") Signed-off-by: Steve Capper <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28kexec: export OFFSET(page.compound_head) to find out compound tail pageAtsushi Kumagai1-0/+1
PageAnon() always look at head page to check PAGE_MAPPING_ANON and tail page's page->mapping has just a poisoned data since commit 1c290f642101 ("mm: sanitize page->mapping for tail pages"). If makedumpfile checks page->mapping of a compound tail page to distinguish anonymous page as usual, it must fail in newer kernel. So it's necessary to export OFFSET(page.compound_head) to avoid checking compound tail pages. The problem is that unnecessary hugepages won't be removed from a dump file in kernels 4.5.x and later. This means that extra disk space would be consumed. It's a problem, but not critical. Signed-off-by: Atsushi Kumagai <[email protected]> Acked-by: Dave Young <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28kexec: update VMCOREINFO for compound_order/dtorAtsushi Kumagai1-2/+4
makedumpfile refers page.lru.next to get the order of compound pages for page filtering. However, now the order is stored in page.compound_order, hence VMCOREINFO should be updated to export the offset of page.compound_order. The fact is, page.compound_order was introduced already in kernel 4.0, but the offset of it was the same as page.lru.next until kernel 4.3, so this was not actual problem. The above can be said also for page.lru.prev and page.compound_dtor, it's necessary to detect hugetlbfs pages. Further, the content was changed from direct address to the ID which means dtor. The problem is that unnecessary hugepages won't be removed from a dump file in kernels 4.4.x and later. This means that extra disk space would be consumed. It's a problem, but not critical. Signed-off-by: Atsushi Kumagai <[email protected]> Acked-by: Dave Young <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-28Merge branch 'for-linus' of ↵Linus Torvalds10-92/+87
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fixes from Sage Weil: "There is a lifecycle fix in the auth code, a fix for a narrow race condition on map, and a helpful message in the log when there is a feature mismatch (which happens frequently now that the default server-side options have changed)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: report unsupported features to syslog rbd: fix rbd map vs notify races libceph: make authorizer destruction independent of ceph_auth_client
2016-04-28Merge branch 'for-linus' of ↵Linus Torvalds10-88/+79
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: "Three more bug fixes for 4.6 - Due to a race in the dynamic page table code a multi-threaded program can cause a translation specification exception. With panic_on_oops a user space program can crash the system. - An information leak with the /dev/sclp device. - A use after free in the s390 PCI code" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/sclp_ctl: fix potential information leak with /dev/sclp s390/mm: fix asce_bits handling with dynamic pagetable levels s390/pci: fix use after free in dma_init
2016-04-28RDMA/nes: don't leak skb if carrier downFlorian Westphal1-3/+0
Alternatively one could free the skb, OTOH I don't think this test is useful so just remove it. Cc: <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-04-28drm/vmwgfx: Fix order of operationSinclair Yeh1-3/+3
mode->hdisplay * (var->bits_per_pixel + 7) gets evaluated before the division, potentially making the pitch larger than it should be. Since the original intention is to do a div-round-up, just use the macro instead. Signed-off-by: Sinclair Yeh <[email protected]> Reviewed-by: Thomas Hellstrom <[email protected]>
2016-04-28drm/vmwgfx: use vmw_cmd_dx_cid_check for query commands.Charmaine Lee1-4/+4
Instead of calling vmw_cmd_ok, call vmw_cmd_dx_cid_check to validate the context id for query commands. Signed-off-by: Charmaine Lee <[email protected]> Reviewed-by: Sinclair Yeh <[email protected]>
2016-04-28drm/vmwgfx: Enable SVGA_3D_CMD_DX_SET_PREDICATIONCharmaine Lee1-1/+1
Fixes piglit tests nv_conditional_render-* crashes. Signed-off-by: Charmaine Lee <[email protected]> Reviewed-by: Brian Paul <[email protected]> Reviewed-by: Sinclair Yeh <[email protected]>
2016-04-28IB/security: Restrict use of the write() interfaceJason Gunthorpe7-1/+40
The drivers/infiniband stack uses write() as a replacement for bi-directional ioctl(). This is not safe. There are ways to trigger write calls that result in the return structure that is normally written to user space being shunted off to user specified kernel memory instead. For the immediate repair, detect and deny suspicious accesses to the write API. For long term, update the user space libraries and the kernel API to something that doesn't present the same security vulnerabilities (likely a structured ioctl() interface). The impacted uAPI interfaces are generally only available if hardware from drivers/infiniband is installed in the system. Reported-by: Jann Horn <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> [ Expanded check to all known write() entry points ] Cc: [email protected] Signed-off-by: Doug Ledford <[email protected]>
2016-04-28IB/hfi1: Use kernel default llseek for ui deviceDean Luick1-23/+2
The ui device llseek had a mistake with SEEK_END and did not fully follow seek semantics. Correct all this by using a kernel supplied function for fixed size devices. Cc: Al Viro <[email protected]> Reviewed-by: Dennis Dalessandro <[email protected]> Signed-off-by: Dean Luick <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-04-28IB/hfi1: Don't attempt to free resources if initialization failedMitko Haralanov2-33/+29
Attempting to free resources which have not been allocated and initialized properly led to the following kernel backtrace: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffffa09658fe>] unlock_exp_tids.isra.8+0x2e/0x120 [hfi1] PGD 852a43067 PUD 85d4a6067 PMD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 2831 Comm: osu_bw Tainted: G IO 3.12.18-wfr+ #1 task: ffff88085b15b540 ti: ffff8808588fe000 task.ti: ffff8808588fe000 RIP: 0010:[<ffffffffa09658fe>] [<ffffffffa09658fe>] unlock_exp_tids.isra.8+0x2e/0x120 [hfi1] RSP: 0018:ffff8808588ffde0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff880858a31800 RCX: 0000000000000000 RDX: ffff88085d971bc0 RSI: ffff880858a318f8 RDI: ffff880858a318c0 RBP: ffff8808588ffe20 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88087ffd6f40 R11: 0000000001100348 R12: ffff880852900000 R13: ffff880858a318c0 R14: 0000000000000000 R15: ffff88085d971be8 FS: 00007f4674e83740(0000) GS:ffff88087f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000085c377000 CR4: 00000000001407f0 Stack: ffffffffa0941a71 ffff880858a318f8 ffff88085d971bc0 ffff880858a31800 ffff880852900000 ffff880858a31800 00000000003ffff7 ffff88085d971bc0 ffff8808588ffe60 ffffffffa09663fc ffff8808588ffe60 ffff880858a31800 Call Trace: [<ffffffffa0941a71>] ? find_mmu_handler+0x51/0x70 [hfi1] [<ffffffffa09663fc>] hfi1_user_exp_rcv_free+0x6c/0x120 [hfi1] [<ffffffffa0932809>] hfi1_file_close+0x1a9/0x340 [hfi1] [<ffffffff8116c189>] __fput+0xe9/0x270 [<ffffffff8116c35e>] ____fput+0xe/0x10 [<ffffffff81065707>] task_work_run+0xa7/0xe0 [<ffffffff81002969>] do_notify_resume+0x59/0x80 [<ffffffff814ffc1a>] int_signal+0x12/0x17 This commit re-arranges the context initialization code in a way that would allow for context event flags to be used to determine whether the context has been successfully initialized. In turn, this can be used to skip the resource de-allocation if they were never allocated in the first place. Fixes: 3abb33ac6521 ("staging/hfi1: Add TID cache receive init and free funcs") Reviewed-by: Dennis Dalessandro <[email protected]> Signed-off-by: Mitko Haralanov <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]. Signed-off-by: Doug Ledford <[email protected]>
2016-04-28IB/hfi1: Fix missing lock/unlock in verbs drain callbackMike Marciniszyn1-0/+2
The iowait_sdma_drained() callback lacked locking to protect the qp s_flags field. This causes the s_flags to be out of sync on multiple CPUs, potentially corrupting the s_flags. Fixes: a545f5308b6c ("staging/rdma/hfi: fix CQ completion order issue") Reviewed-by: Sebastian Sanchez <[email protected]> Signed-off-by: Mike Marciniszyn <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-04-28IB/rdmavt: Fix send schedulingJubin John1-2/+2
call_send is used to determine whether to send immediately or schedule a send for later. The current logic in rdmavt is inverted and has a negative impact on the latency of the hfi1 and qib drivers. Fix this regression by correctly calling send immediately when call_send is set. Reviewed-by: Dennis Dalessandro <[email protected]> Reviewed-by: Mike Marciniszyn <[email protected]> Signed-off-by: Jubin John <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-04-28IB/hfi1: Prevent unpinning of wrong pagesMitko Haralanov1-5/+8
The routine used by the SDMA cache to handle already cached nodes can extend an already existing node. In its error handling code, the routine will unpin pages when not all pages of the buffer extension were pinned. There was a bug in that part of the routine, which would mistakenly unpin pages from the original set rather than the newly pinned pages. This commit fixes that bug by offsetting the page array to the proper place pointing at the beginning of the newly pinned pages. Reviewed-by: Dean Luick <[email protected]> Signed-off-by: Mitko Haralanov <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-04-28IB/hfi1: Fix deadlock caused by locking with wrong scopeMitko Haralanov1-5/+11
The locking around the interval RB tree is designed to prevent access to the tree while it's being modified. The locking in its current form is too overzealous, which is causing a deadlock in certain cases with the following backtrace: Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0 CPU: 0 PID: 5836 Comm: IMB-MPI1 Tainted: G O 3.12.18-wfr+ #1 0000000000000000 ffff88087f206c50 ffffffff814f1caa ffffffff817b53f0 ffff88087f206cc8 ffffffff814ecd56 0000000000000010 ffff88087f206cd8 ffff88087f206c78 0000000000000000 0000000000000000 0000000000001662 Call Trace: <NMI> [<ffffffff814f1caa>] dump_stack+0x45/0x56 [<ffffffff814ecd56>] panic+0xc2/0x1cb [<ffffffff810d4370>] ? restart_watchdog_hrtimer+0x50/0x50 [<ffffffff810d4432>] watchdog_overflow_callback+0xc2/0xd0 [<ffffffff81109b4e>] __perf_event_overflow+0x8e/0x2b0 [<ffffffff8110a714>] perf_event_overflow+0x14/0x20 [<ffffffff8101c906>] intel_pmu_handle_irq+0x1b6/0x390 [<ffffffff814f927b>] perf_event_nmi_handler+0x2b/0x50 [<ffffffff814f8ad8>] nmi_handle.isra.3+0x88/0x180 [<ffffffff814f8d39>] do_nmi+0x169/0x310 [<ffffffff814f8177>] end_repeat_nmi+0x1e/0x2e [<ffffffff81272600>] ? unmap_single+0x30/0x30 [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40 [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40 [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40 <<EOE>> <IRQ> [<ffffffffa056c4a8>] hfi1_mmu_rb_search+0x38/0x70 [hfi1] [<ffffffffa05919cb>] user_sdma_free_request+0xcb/0x120 [hfi1] [<ffffffffa0593393>] user_sdma_txreq_cb+0x263/0x350 [hfi1] [<ffffffffa057fad7>] ? sdma_txclean+0x27/0x1c0 [hfi1] [<ffffffffa0593130>] ? user_sdma_send_pkts+0x1710/0x1710 [hfi1] [<ffffffffa057fdd6>] sdma_make_progress+0x166/0x480 [hfi1] [<ffffffff810762c9>] ? ttwu_do_wakeup+0x19/0xd0 [<ffffffffa0581c7e>] sdma_engine_interrupt+0x8e/0x100 [hfi1] [<ffffffffa0546bdd>] sdma_interrupt+0x5d/0xa0 [hfi1] [<ffffffff81097e57>] handle_irq_event_percpu+0x47/0x1d0 [<ffffffff81098017>] handle_irq_event+0x37/0x60 [<ffffffff8109aa5f>] handle_edge_irq+0x6f/0x120 [<ffffffff810044af>] handle_irq+0xbf/0x150 [<ffffffff8104c9b7>] ? irq_enter+0x17/0x80 [<ffffffff8150168d>] do_IRQ+0x4d/0xc0 [<ffffffff814f7c6a>] common_interrupt+0x6a/0x6a <EOI> [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0 [<ffffffff814f56c6>] __schedule+0x3b6/0x7e0 [<ffffffff810763a6>] __cond_resched+0x26/0x30 [<ffffffff814f5eda>] _cond_resched+0x3a/0x50 [<ffffffff814f4f82>] down_write+0x12/0x30 [<ffffffffa0591619>] hfi1_release_user_pages+0x69/0x90 [hfi1] [<ffffffffa059173a>] sdma_rb_remove+0x9a/0xc0 [hfi1] [<ffffffffa056c00d>] __mmu_rb_remove.isra.5+0x5d/0x70 [hfi1] [<ffffffffa056c536>] hfi1_mmu_rb_remove+0x56/0x70 [hfi1] [<ffffffffa059427b>] hfi1_user_sdma_process_request+0x74b/0x1160 [hfi1] [<ffffffffa055c763>] hfi1_aio_write+0xc3/0x100 [hfi1] [<ffffffff8116a14c>] do_sync_readv_writev+0x4c/0x80 [<ffffffff8116b58b>] do_readv_writev+0xbb/0x230 [<ffffffff811a9da1>] ? fsnotify+0x241/0x320 [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0 [<ffffffff8116b795>] vfs_writev+0x35/0x60 [<ffffffff8116b8c9>] SyS_writev+0x49/0xc0 [<ffffffff810cd876>] ? __audit_syscall_exit+0x1f6/0x2a0 [<ffffffff814ff992>] system_call_fastpath+0x16/0x1b As evident from the backtrace above, the process was being put to sleep while holding the lock. Limiting the scope of the lock only to the RB tree operation fixes the above error allowing for proper locking and the process being put to sleep when needed. Reviewed-by: Dennis Dalessandro <[email protected]> Reviewed-by: Dean Luick <[email protected]> Signed-off-by: Mitko Haralanov <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-04-28IB/hfi1: Prevent NULL pointer deferences in caching codeMitko Haralanov4-23/+37
There is a potential kernel crash when the MMU notifier calls the invalidation routines in the hfi1 pinned page caching code for sdma. The invalidation routine could call the remove callback for the node, which in turn ends up dereferencing the current task_struct to get a pointer to the mm_struct. However, the mm_struct pointer could be NULL resulting in the following backtrace: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 IP: [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1] 15 task: ffff88085e66e080 ti: ffff88085c244000 task.ti: ffff88085c244000 RIP: 0010:[<ffffffffa041f75a>] [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1] RSP: 0000:ffff88085c245878 EFLAGS: 00010002 RAX: 0000000000000000 RBX: ffff88105b9bbd40 RCX: ffffea003931a830 RDX: 0000000000000004 RSI: ffff88105754a9c0 RDI: ffff88105754a9c0 RBP: ffff88085c245890 R08: ffff88105b9bbd70 R09: 00000000fffffffb R10: ffff88105b9bbd58 R11: 0000000000000013 R12: ffff88105754a9c0 R13: 0000000000000001 R14: 0000000000000001 R15: ffff88105b9bbd40 FS: 0000000000000000(0000) GS:ffff88107ef40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a8 CR3: 0000000001a0b000 CR4: 00000000001407e0 Stack: ffff88105b9bbd40 ffff88080ec481a8 ffff88080ec481b8 ffff88085c2458c0 ffffffffa03fa00e ffff88080ec48190 ffff88080ed9cd00 0000000001024000 0000000000000000 ffff88085c245920 ffffffffa03fa0e7 0000000000000282 Call Trace: [<ffffffffa03fa00e>] __mmu_rb_remove.isra.5+0x5e/0x70 [hfi1] [<ffffffffa03fa0e7>] mmu_notifier_mem_invalidate+0xc7/0xf0 [hfi1] [<ffffffffa03fa143>] mmu_notifier_page+0x13/0x20 [hfi1] [<ffffffff81156dd0>] __mmu_notifier_invalidate_page+0x50/0x70 [<ffffffff81140bbb>] try_to_unmap_one+0x20b/0x470 [<ffffffff81141ee7>] try_to_unmap_anon+0xa7/0x120 [<ffffffff81141fad>] try_to_unmap+0x4d/0x60 [<ffffffff8111fd7b>] shrink_page_list+0x2eb/0x9d0 [<ffffffff81120ab3>] shrink_inactive_list+0x243/0x490 [<ffffffff81121491>] shrink_lruvec+0x4c1/0x640 [<ffffffff81121641>] shrink_zone+0x31/0x100 [<ffffffff81121b0f>] kswapd_shrink_zone.constprop.62+0xef/0x1c0 [<ffffffff811229e3>] kswapd+0x403/0x7e0 [<ffffffff811225e0>] ? shrink_all_memory+0xf0/0xf0 [<ffffffff81068ac0>] kthread+0xc0/0xd0 [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40 [<ffffffff814ff8ec>] ret_from_fork+0x7c/0xb0 [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40 To correct this, the mm_struct passed to us by the MMU notifier is used (which is what should have been done to begin with). This avoids the broken derefences and ensures that the correct mm_struct is used. Reviewed-by: Dennis Dalessandro <[email protected]> Reviewed-by: Dean Luick <[email protected]> Signed-off-by: Mitko Haralanov <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-04-28MAINTAINERS: Update iser/isert maintainer contact infoSagi Grimberg1-2/+2
Signed-off-by: Sagi Grimberg <[email protected]> Acked-by: Nicholas Bellinger <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-04-28IB/mlx5: Expose correct max_sge_rd limitSagi Grimberg2-1/+12
mlx5 devices (Connect-IB, ConnectX-4, ConnectX-4-LX) has a limitation where rdma read work queue entries cannot exceed 512 bytes. A rdma_read wqe needs to fit in 512 bytes: - wqe control segment (16 bytes) - rdma segment (16 bytes) - scatter elements (16 bytes each) So max_sge_rd should be: (512 - 16 - 16) / 16 = 30. Cc: [email protected] Reported-by: Christoph Hellwig <[email protected]> Tested-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-04-28mmc: sunxi: Disable eMMC HS-DDR (MMC_CAP_1_8V_DDR) for Allwinner A80Chen-Yu Tsai1-0/+5
eMMC HS-DDR no longer works on the A80, despite it working when support for this developed. Disable it for now. Signed-off-by: Chen-Yu Tsai <[email protected]> Signed-off-by: Ulf Hansson <[email protected]>
2016-04-28perf/x86/intel: Fix incorrect lbr_sel_mask valueKan Liang1-2/+4
This patch fixes a bug which was introduced by: b16a5b52eb90 ("perf/x86: Add option to disable reading branch flags/cycles") In this patch, lbr_sel_mask is used to mask the lbr_select. But LBR_SEL_MASK doesn't include the bit for LBR_CALL_STACK. So LBR call stack will never be set in lbr_select. This patch corrects the LBR_SEL_MASK by including all valid bits in LBR_SELECT. Also, the LBR_CALL_STACK bit is different as other bit in LBR_SELECT. It does not operate in suppress mode, so it needs to be specially handled in intel_pmu_setup_hw_lbr_filter. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-04-28perf/x86/intel/pt: Don't die on VMXONAlexander Shishkin4-11/+75
Some versions of Intel PT do not support tracing across VMXON, more specifically, VMXON will clear TraceEn control bit and any attempt to set it before VMXOFF will throw a #GP, which in the current state of things will crash the kernel. Namely: $ perf record -e intel_pt// kvm -nographic on such a machine will kill it. To avoid this, notify the intel_pt driver before VMXON and after VMXOFF so that it knows when not to enable itself. Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-04-28perf/core: Fix perf_event_open() vs. execve() racePeter Zijlstra1-16/+36
Jann reported that the ptrace_may_access() check in find_lively_task_by_vpid() is racy against exec(). Specifically: perf_event_open() execve() ptrace_may_access() commit_creds() ... if (get_dumpable() != SUID_DUMP_USER) perf_event_exit_task(); perf_install_in_context() would result in installing a counter across the creds boundary. Fix this by wrapping lots of perf_event_open() in cred_guard_mutex. This should be fine as perf_event_exit_task() is already called with cred_guard_mutex held, so all perf locks already nest inside it. Reported-by: Jann Horn <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2016-04-28perf/x86/amd: Set the size of event map array to PERF_COUNT_HW_MAXAdam Borowski1-1/+1
The entry for PERF_COUNT_HW_REF_CPU_CYCLES is not used on AMD, but is referenced by filter_events() which expects undefined events to have a value of 0. Found via KASAN: UBSAN: Undefined behaviour in arch/x86/events/amd/core.c:132:30 index 9 is out of range for type 'u64 [9]' UBSAN: Undefined behaviour in arch/x86/events/amd/core.c:132:9 load of address ffffffff81c021c8 with insufficient space for an object of type 'const u64' Signed-off-by: Adam Borowski <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-04-28rbd: report unsupported features to syslogIlya Dryomov1-3/+6
... instead of just returning an error. Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2016-04-28rbd: fix rbd map vs notify racesIlya Dryomov1-24/+19
A while ago, commit 9875201e1049 ("rbd: fix use-after free of rbd_dev->disk") fixed rbd unmap vs notify race by introducing an exported wrapper for flushing notifies and sticking it into do_rbd_remove(). A similar problem exists on the rbd map path, though: the watch is registered in rbd_dev_image_probe(), while the disk is set up quite a few steps later, in rbd_dev_device_setup(). Nothing prevents a notify from coming in and crashing on a NULL rbd_dev->disk: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 Call Trace: [<ffffffffa0508344>] rbd_watch_cb+0x34/0x180 [rbd] [<ffffffffa04bd290>] do_event_work+0x40/0xb0 [libceph] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [<ffffffff8109e3ab>] worker_thread+0x11b/0x400 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [<ffffffff810a5acf>] kthread+0xcf/0xe0 [<ffffffff810b41b3>] ? finish_task_switch+0x53/0x170 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [<ffffffff81645dd8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 RIP [<ffffffffa050828a>] rbd_dev_refresh+0xfa/0x180 [rbd] If an error occurs during rbd map, we have to error out, potentially tearing down a watch. Just like on rbd unmap, notifies have to be flushed, otherwise rbd_watch_cb() may end up trying to read in the image header after rbd_dev_image_release() has run: Assertion failure in rbd_dev_header_info() at line 4722: rbd_assert(rbd_image_format_valid(rbd_dev->image_format)); Call Trace: [<ffffffff81cccee0>] ? rbd_parent_request_create+0x150/0x150 [<ffffffff81cd4e59>] rbd_dev_refresh+0x59/0x390 [<ffffffff81cd5229>] rbd_watch_cb+0x69/0x290 [<ffffffff81fde9bf>] do_event_work+0x10f/0x1c0 [<ffffffff81107799>] process_one_work+0x689/0x1a80 [<ffffffff811076f7>] ? process_one_work+0x5e7/0x1a80 [<ffffffff81132065>] ? finish_task_switch+0x225/0x640 [<ffffffff81107110>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0 [<ffffffff81108c69>] worker_thread+0xd9/0x1320 [<ffffffff81108b90>] ? process_one_work+0x1a80/0x1a80 [<ffffffff8111b02d>] kthread+0x21d/0x2e0 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 [<ffffffff82022802>] ret_from_fork+0x22/0x40 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 RIP [<ffffffff81ccd8f9>] rbd_dev_header_info+0xa19/0x1e30 To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling revalidate_disk(), b) move ceph_osdc_flush_notifies() call into rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn header read-in into a critical section. The latter also happens to take care of rbd map foo@bar vs rbd snap rm foo@bar race. Fixes: http://tracker.ceph.com/issues/15490 Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2016-04-28x86/apic: Handle zero vector gracefully in clear_vector_irq()Keith Busch1-1/+2
If x86_vector_alloc_irq() fails x86_vector_free_irqs() is invoked to cleanup the already allocated vectors. This subsequently calls clear_vector_irq(). The failed irq has no vector assigned, which triggers the BUG_ON(!vector) in clear_vector_irq(). We cannot suppress the call to x86_vector_free_irqs() for the failed interrupt, because the other data related to this irq must be cleaned up as well. So calling clear_vector_irq() with vector == 0 is legitimate. Remove the BUG_ON and return if vector is zero, [ tglx: Massaged changelog ] Fixes: b5dc8e6c21e7 "x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors" Signed-off-by: Keith Busch <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-04-27thermal: use %d to print S32 parametersLeo Yan1-1/+1
Power allocator's parameters are S32 type, so use %d to print them. Acked-by: Javi Merino <[email protected]> Signed-off-by: Leo Yan <[email protected]> Signed-off-by: Eduardo Valentin <[email protected]>
2016-04-27thermal: hisilicon: increase temperature resolutionLeo Yan1-2/+2
When calculate temperature, old code firstly do division and then convert to "millicelsius" unit. This will lose resolution and only can read back temperature with "Celsius" unit. So firstly scale step value to "millicelsius" and then do division, so finally we can increase resolution for temperature value. Also refine the calculation from temperature value to step value. Signed-off-by: Leo Yan <[email protected]> Signed-off-by: Eduardo Valentin <[email protected]>
2016-04-27Merge branch 'for-4.6-fixes' of ↵Linus Torvalds1-0/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "So, it turns out we had a silly bug in the most fundamental part of workqueue for a very long time. AFAICS, this dates back to pre-git era and has quite likely been there from the time workqueue was first introduced. A work item uses its PENDING bit to synchronize multiple queuers. Anyone who wins the PENDING bit owns the pending state of the work item. Whether a queuer wins or loses the race, one thing should be guaranteed - there will soon be at least one execution of the work item - where "after" means that the execution instance would be able to see all the changes that the queuer has made prior to the queueing attempt. Unfortunately, we were missing a smp_mb() after clearing PENDING for execution, so nothing guaranteed visibility of the changes that a queueing loser has made, which manifested as a reproducible blk-mq stall. Lots of kudos to Roman for debugging the problem. The patch for -stable is the minimal one. For v3.7, Peter is working on a patch to make the code path slightly more efficient and less fragile" * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix ghost PENDING flag while doing MQ IO
2016-04-27Merge branch 'for-4.6-fixes' of ↵Linus Torvalds5-28/+27
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two patches to fix a deadlock which can be easily triggered if memcg charge moving is used. This bug was introduced while converting threadgroup locking to a global percpu_rwsem and is caused by cgroup controller task migration path depending on the ability to create new kthreads. cpuset had a similar issue which was fixed by performing heavy-lifting operations asynchronous to task migration. The two patches fix the same issue in memcg in a similar way. The first patch makes the mechanism generic and the second relocates memcg charge moving outside the migration path. Given that we don't want to perform heavy operations while writelocking threadgroup lock anyway, moving them out of the way is a desirable solution. One thing to note is that the problem was difficult to debug because lockdep couldn't figure out the deadlock condition. Looking into how to improve that" * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: memcg: relocate charge moving from ->attach to ->post_attach cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
2016-04-27Merge branch 'i2c/for-current' of ↵Linus Torvalds6-11/+28
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "I2C has one buildfix, one ABBA deadlock fix, and three simple 'add ID' patches" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: exynos5: Fix possible ABBA deadlock by keeping I2C clock prepared i2c: cpm: Fix build break due to incompatible pointer types i2c: ismt: Add Intel DNV PCI ID i2c: xlp9xx: add support for Broadcom Vulcan i2c: rk3x: add support for rk3228
2016-04-27Merge tag 'arc-4.6-rc6-fixes' of ↵Linus Torvalds7-4/+55
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC fixes from Vineet Gupta: - lockdep now works for ARCv2 builds - enable DT reserved-memory binding (for forthcoming HDMI driver) * tag 'arc-4.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: add support for reserved memory defined by device tree ARC: support generic per-device coherent dma mem Documentation: dt: arc: fix spelling mistakes ARCv2: Enable LOCKDEP
2016-04-27Merge tag 'nios2-v4.6-fix' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 Pull arch/nios2 fix from Ley Foon Tan: "memset: use the right constraint modifier for the %4 output operand" * tag 'nios2-v4.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2: nios2: memset: use the right constraint modifier for the %4 output operand
2016-04-27drm/amdgpu: disable vm interrupts with vm_fault_stop=2Flora Cui2-2/+8
V2: disable all vm interrupts in late_init() Signed-off-by: Flora Cui <[email protected]> Reviewed-by: Ken Wang <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2016-04-27drm/amdgpu: print a message if ATPX dGPU power control is missingAlex Deucher1-1/+4
It will help identify problematic boards. Signed-off-by: Alex Deucher <[email protected]>
2016-04-27Revert "drm/amdgpu: disable runtime pm on PX laptops without dGPU power control"Alex Deucher2-11/+5
This reverts commit bedf2a65c1aa8fb29ba8527fd00c0f68ec1f55f1. See the radeon revert for an extended description. Cc: [email protected]
2016-04-27drm/radeon: fix vertical bars appear on monitor (v2)Vitaly Prosyak2-1/+199
When crtc/timing is disabled on boot the dig block should be stopped in order ignore timing from crtc, reset the steering fifo otherwise we get display corruption or hung in dp sst mode. v2: agd: fix coding style Signed-off-by: Vitaly Prosyak <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected] Signed-off-by: Alex Deucher <[email protected]>
2016-04-27drm/ttm: fix kref count mess in ttm_bo_move_to_lru_tailFlora Cui1-13/+4
Fixes the following scenario: 1. Page table bo allocated in vram and linked to man->lru. tbo->list_kref.refcount=2 2. Page table bo is swapped out and removed from man->lru. tbo->list_kref.refcount=1 3. Command submission from userspace. Page table bo is moved to vram. ttm_bo_move_to_lru_tail() link it to man->lru and don't increase the kref count. Reviewed-by: Thomas Hellstrom <[email protected]> Signed-off-by: Flora Cui <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2016-04-27Merge tag 'platform-drivers-x86-v4.6-3' of ↵Linus Torvalds1-1/+1
git://git.infradead.org/users/dvhart/linux-platform-drivers-x86 Pull x86 platform driver fix from Darren Hart: "Fix regression caused by hotkey enabling value in toshiba_acpi" * tag 'platform-drivers-x86-v4.6-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: toshiba_acpi: Fix regression caused by hotkey enabling value
2016-04-27Merge tag 'asoc-fix-v4.6-rc5' of ↵Takashi Iwai1039-6774/+10806
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v4.6 This is a fairly large collection of fixes but almost all driver specific ones, especially to the new Intel drivers which have had a lot of recent development. The one core fix is a change to the debugfs code to avoid crashes in some relatively unusual configurations.
2016-04-27ARC: add support for reserved memory defined by device treeAlexey Brodkin2-0/+5
Enable reserved memory initialization from device tree. Signed-off-by: Alexey Brodkin <[email protected]> Cc: Grant Likely <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: [email protected] Signed-off-by: Vineet Gupta <[email protected]>
2016-04-27ARC: support generic per-device coherent dma memAlexey Brodkin1-0/+1
Signed-off-by: Alexey Brodkin <[email protected]> Cc: [email protected] Signed-off-by: Vineet Gupta <[email protected]>