aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2018-01-15usercopy: Prepare for usercopy whitelistingDavid Windsor4-12/+52
This patch prepares the slab allocator to handle caches having annotations (useroffset and usersize) defining usercopy regions. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass hardened usercopy checks since these sizes cannot change at runtime.) To support this whitelist annotation, usercopy region offset and size members are added to struct kmem_cache. The slab allocator receives a new function, kmem_cache_create_usercopy(), that creates a new cache with a usercopy region defined, suitable for declaring spans of fields within the objects that get copied to/from userspace. In this patch, the default kmem_cache_create() marks the entire allocation as whitelisted, leaving it semantically unchanged. Once all fine-grained whitelists have been added (in subsequent patches), this will be changed to a usersize of 0, making caches created with kmem_cache_create() not copyable to/from userspace. After the entire usercopy whitelist series is applied, less than 15% of the slab cache memory remains exposed to potential usercopy bugs after a fresh boot: Total Slab Memory: 48074720 Usercopyable Memory: 6367532 13.2% task_struct 0.2% 4480/1630720 RAW 0.3% 300/96000 RAWv6 2.1% 1408/64768 ext4_inode_cache 3.0% 269760/8740224 dentry 11.1% 585984/5273856 mm_struct 29.1% 54912/188448 kmalloc-8 100.0% 24576/24576 kmalloc-16 100.0% 28672/28672 kmalloc-32 100.0% 81920/81920 kmalloc-192 100.0% 96768/96768 kmalloc-128 100.0% 143360/143360 names_cache 100.0% 163840/163840 kmalloc-64 100.0% 167936/167936 kmalloc-256 100.0% 339968/339968 kmalloc-512 100.0% 350720/350720 kmalloc-96 100.0% 455616/455616 kmalloc-8192 100.0% 655360/655360 kmalloc-1024 100.0% 812032/812032 kmalloc-4096 100.0% 819200/819200 kmalloc-2048 100.0% 1310720/1310720 After some kernel build workloads, the percentage (mainly driven by dentry and inode caches expanding) drops under 10%: Total Slab Memory: 95516184 Usercopyable Memory: 8497452 8.8% task_struct 0.2% 4000/1456000 RAW 0.3% 300/96000 RAWv6 2.1% 1408/64768 ext4_inode_cache 3.0% 1217280/39439872 dentry 11.1% 1623200/14608800 mm_struct 29.1% 73216/251264 kmalloc-8 100.0% 24576/24576 kmalloc-16 100.0% 28672/28672 kmalloc-32 100.0% 94208/94208 kmalloc-192 100.0% 96768/96768 kmalloc-128 100.0% 143360/143360 names_cache 100.0% 163840/163840 kmalloc-64 100.0% 245760/245760 kmalloc-256 100.0% 339968/339968 kmalloc-512 100.0% 350720/350720 kmalloc-96 100.0% 563520/563520 kmalloc-8192 100.0% 655360/655360 kmalloc-1024 100.0% 794624/794624 kmalloc-4096 100.0% 819200/819200 kmalloc-2048 100.0% 1257472/1257472 Signed-off-by: David Windsor <[email protected]> [kees: adjust commit log, split out a few extra kmalloc hunks] [kees: add field names to function declarations] [kees: convert BUGs to WARNs and fail closed] [kees: add attack surface reduction analysis to commit log] Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Andrew Morton <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Kees Cook <[email protected]> Acked-by: Christoph Lameter <[email protected]>
2018-01-15usercopy: Include offset in hardened usercopy reportKees Cook3-64/+53
This refactors the hardened usercopy code so that failure reporting can happen within the checking functions instead of at the top level. This simplifies the return value handling and allows more details and offsets to be included in the report. Having the offset can be much more helpful in understanding hardened usercopy bugs. Signed-off-by: Kees Cook <[email protected]>
2018-01-15usercopy: Enhance and rename report_usercopy()Kees Cook1-5/+19
In preparation for refactoring the usercopy checks to pass offset to the hardened usercopy report, this renames report_usercopy() to the more accurate usercopy_abort(), marks it as noreturn because it is, adds a hopefully helpful comment for anyone investigating such reports, makes the function available to the slab allocators, and adds new "detail" and "offset" arguments. Signed-off-by: Kees Cook <[email protected]>
2018-01-15usercopy: Remove pointer from overflow reportKees Cook1-5/+4
Using %p was already mostly useless in the usercopy overflow reports, so this removes it entirely to avoid confusion now that %p-hashing is enabled. Fixes: ad67b74d2469d9b8 ("printk: hash addresses printed with %p") Signed-off-by: Kees Cook <[email protected]>
2018-01-13kmemleak: allow to coexist with fault injectionDmitry Vyukov1-1/+1
kmemleak does one slab allocation per user allocation. So if slab fault injection is enabled to any degree, kmemleak instantly fails to allocate and turns itself off. However, it's useful to use kmemleak with fault injection to find leaks on error paths. On the other hand, checking kmemleak itself is not so useful because (1) it's a debugging tool and (2) it has a very regular allocation pattern (basically a single allocation site, so it either works or not). Turn off fault injection for kmemleak allocations. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dmitry Vyukov <[email protected]> Cc: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-08memremap: drop private struct page_mapLogan Gunthorpe1-1/+1
'struct page_map' is a private structure of 'struct dev_pagemap' but the latter replicates all the same fields as the former so there isn't much value in it. Thus drop it in favour of a completely public struct. This is a clean up in preperation for a more generally useful 'devm_memeremap_pages' interface. Signed-off-by: Logan Gunthorpe <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08mm: optimize dev_pagemap reference counting around get_dev_pagemapChristoph Hellwig1-2/+5
Change the calling convention so that get_dev_pagemap always consumes the previous reference instead of doing this using an explicit earlier call to put_dev_pagemap in the callers. The callers will still need to put the final reference after finishing the loop over the pages. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08mm: merge vmem_altmap_alloc into altmap_alloc_block_bufChristoph Hellwig1-29/+16
There is no clear separation between the two, so merge them. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08mm: split altmap memory map allocation from normal caseChristoph Hellwig1-12/+3
No functional changes, just untangling the call chain and document why the altmap is passed around the hotplug code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08mm: pass the vmem_altmap to memmap_init_zoneChristoph Hellwig3-8/+9
Pass the vmem_altmap two levels down instead of needing a lookup. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08mm: pass the vmem_altmap to vmemmap_freeChristoph Hellwig2-13/+17
We can just pass this on instead of having to do a radix tree lookup without proper locking a few levels into the callchain. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08mm: pass the vmem_altmap to arch_remove_memory and __remove_pagesChristoph Hellwig2-8/+4
We can just pass this on instead of having to do a radix tree lookup without proper locking 2 levels into the callchain. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08mm: pass the vmem_altmap to vmemmap_populateChristoph Hellwig3-14/+20
We can just pass this on instead of having to do a radix tree lookup without proper locking a few levels into the callchain. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08mm: pass the vmem_altmap to arch_add_memory and __add_pagesChristoph Hellwig2-6/+6
We can just pass this on instead of having to do a radix tree lookup without proper locking 2 levels into the callchain. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-08mm: don't export __add_pagesChristoph Hellwig1-1/+0
This function isn't used by any modules, and is only to be called from core MM code. This includes the calls for the add_pages wrapper that might be inlined. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2018-01-07new primitive: vmemdup_user()Al Viro1-1/+28
similar to memdup_user(), but does *not* guarantee that result will be physically contiguous; use only in cases where that's not a requirement and free it with kvfree(). Signed-off-by: Al Viro <[email protected]>
2018-01-07memdup_user(): switch to GFP_USERAl Viro1-6/+1
Signed-off-by: Al Viro <[email protected]>
2018-01-06Merge branch 'for-linus' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: - untangle sys_close() abuses in xt_bpf - deal with register_shrinker() failures in sget() * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'" sget(): handle failures of register_shrinker() mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
2018-01-06block: convert to bio_first_bvec_all & bio_first_page_allMing Lei1-2/+2
This patch converts to bio_first_bvec_all() & bio_first_page_all() for retrieving the 1st bvec/page, and prepares for supporting multipage bvec. Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-01-04mm/sparse.c: wrong allocation for mem_sectionBaoquan He1-1/+1
In commit 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y") mem_section is allocated at runtime to save memory. It allocates the first dimension of array with sizeof(struct mem_section). It costs extra memory, should be sizeof(struct mem_section *). Fix it. Link: http://lkml.kernel.org/r/[email protected] Fixes: 83e3c48729 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y") Signed-off-by: Baoquan He <[email protected]> Tested-by: Dave Young <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Atsushi Kumagai <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-04mm/zsmalloc.c: include fs.hSergey Senozhatsky1-0/+1
`struct file_system_type' and alloc_anon_inode() function are defined in fs.h, include it directly. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-04mm/debug.c: provide useful debugging information for VM_BUGMatthew Wilcox1-14/+14
With the recent addition of hashed kernel pointers, places which need to produce useful debug output have to specify %px, not %p. This patch fixes all the VM debug to use %px. This is appropriate because it's debug output that the user should never be able to trigger, and kernel developers need to see the actual pointers. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: "Tobin C. Harding" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-04mm/mprotect: add a cond_resched() inside change_pmd_range()Anshuman Khandual1-2/+4
While testing on a large CPU system, detected the following RCU stall many times over the span of the workload. This problem is solved by adding a cond_resched() in the change_pmd_range() function. INFO: rcu_sched detected stalls on CPUs/tasks: 154-....: (670 ticks this GP) idle=022/140000000000000/0 softirq=2825/2825 fqs=612 (detected by 955, t=6002 jiffies, g=4486, c=4485, q=90864) Sending NMI from CPU 955 to CPUs 154: NMI backtrace for cpu 154 CPU: 154 PID: 147071 Comm: workload Not tainted 4.15.0-rc3+ #3 NIP: c0000000000b3f64 LR: c0000000000b33d4 CTR: 000000000000aa18 REGS: 00000000a4b0fb44 TRAP: 0501 Not tainted (4.15.0-rc3+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22422082 XER: 00000000 CFAR: 00000000006cf8f0 SOFTE: 1 GPR00: 0010000000000000 c00003ef9b1cb8c0 c0000000010cc600 0000000000000000 GPR04: 8e0000018c32b200 40017b3858fd6e00 8e0000018c32b208 40017b3858fd6e00 GPR08: 8e0000018c32b210 40017b3858fd6e00 8e0000018c32b218 40017b3858fd6e00 GPR12: ffffffffffffffff c00000000fb25100 NIP [c0000000000b3f64] plpar_hcall9+0x44/0x7c LR [c0000000000b33d4] pSeries_lpar_flush_hash_range+0x384/0x420 Call Trace: flush_hash_range+0x48/0x100 __flush_tlb_pending+0x44/0xd0 hpte_need_flush+0x408/0x470 change_protection_range+0xaac/0xf10 change_prot_numa+0x30/0xb0 task_numa_work+0x2d0/0x3e0 task_work_run+0x130/0x190 do_notify_resume+0x118/0x120 ret_from_except_lite+0x70/0x74 Instruction dump: 60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378 e9410060 e9610068 e9810070 44000022 <7d806378> e9810028 f88c0000 f8ac0008 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Suggested-by: Nicholas Piggin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-04mm: check pfn_valid first in zero_resv_unavailDave Young1-0/+2
With latest kernel I get below bug while testing kdump: BUG: unable to handle kernel paging request at ffffea00034b1040 IP: zero_resv_unavail+0xbd/0x126 PGD 37b98067 P4D 37b98067 PUD 37b97067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.15.0-rc1+ #316 Hardware name: LENOVO 20ARS1BJ02/20ARS1BJ02, BIOS GJET92WW (2.42 ) 03/03/2017 task: ffffffff81a0e4c0 task.stack: ffffffff81a00000 RIP: 0010:zero_resv_unavail+0xbd/0x126 RSP: 0000:ffffffff81a03d88 EFLAGS: 00010006 RAX: 0000000000000000 RBX: ffffea00034b1040 RCX: 0000000000000010 RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffffea00034b1040 RBP: 00000000000d2c41 R08: 00000000000000c0 R09: 0000000000000a0d R10: 0000000000000002 R11: 0000000000007f01 R12: ffffffff81a03d90 R13: ffffea0000000000 R14: 0000000000000063 R15: 0000000000000062 FS: 0000000000000000(0000) GS:ffffffff81c73000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffea00034b1040 CR3: 0000000037609000 CR4: 00000000000606b0 Call Trace: ? free_area_init_nodes+0x640/0x664 ? zone_sizes_init+0x58/0x72 ? setup_arch+0xb50/0xc6c ? start_kernel+0x64/0x43d ? secondary_startup_64+0xa5/0xb0 Code: c1 e8 0c 48 39 d8 76 27 48 89 de 48 c1 e3 06 48 c7 c7 7a 87 79 81 e8 b0 c0 3e ff 4c 01 eb b9 10 00 00 00 31 c0 48 89 df 49 ff c6 <f3> ab eb bc 6a 00 49 c7 c0 f0 93 d1 81 31 d2 83 ce ff 41 54 49 RIP: zero_resv_unavail+0xbd/0x126 RSP: ffffffff81a03d88 CR2: ffffea00034b1040 ---[ end trace f5ba9e8f73c7ee26 ]--- This is introduced by commit a4a3ede2132a ("mm: zero reserved and unavailable struct pages"). The reason is some efi reserved boot ranges is not reported in E820 ram. In my case it is a bgrt buffer: efi: mem00: [Boot Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x00000000d2c41000-0x00000000d2c85fff] (0MB) Use "add_efi_memmap" can workaround the problem with another fix: http://lkml.kernel.org/r/[email protected] In zero_resv_unavail it would be better to check pfn_valid first before zero the page struct. This fixes the problem and potential other similar problems. Also as Pavel Tatashin suggested checks pfn_valid at the beginning of the section. The range is backed by real memory. The memory range is efi "Boot Service Data", that means after ExitBootServices() these ranges can be used as system ram. But some of them need to be reserved, for example the bgrt image address in an acpi table, if the image memory is freed then kexec reboot will fail because kexec inherit same acpi table to initialize the driver. Link: http://lkml.kernel.org/r/[email protected] Fixes: a4a3ede2132a ("mm: zero reserved and unavailable struct pages") Signed-off-by: Dave Young <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-03Merge branch 'for-mingo' of ↵Ingo Molnar2-9/+2
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Updates to use cond_resched() instead of cond_resched_rcu_qs() where feasible (currently everywhere except in kernel/rcu and in kernel/torture.c). Also a couple of fixes to avoid sending IPIs to offline CPUs. - Updates to simplify RCU's dyntick-idle handling. - Updates to remove almost all uses of smp_read_barrier_depends() and read_barrier_depends(). - Miscellaneous fixes. - Torture-test updates. Signed-off-by: Ingo Molnar <[email protected]>
2017-12-21Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-4/+1
Pull block fixes from Jens Axboe: "It's been a few weeks, so here's a small collection of fixes that should go into the current series. This contains: - NVMe pull request from Christoph, with a few important fixes. - kyber hang fix from Omar. - A blk-throttl fix from Shaohua, fixing a case where we double charge a bio. - Two call_single_data alignment fixes from me, fixing up some unfortunate changes that went into 4.14 without being properly reviewed on the block side (since nobody was CC'ed on the patch...). - A bounce buffer fix in two parts, one from me and one from Ming. - Revert bdi debug error handling patch. It's causing boot issues for some folks, and a week down the line, we're still no closer to a fix. Revert this patch for now until it's figured out, then we can retry for 4.16" * 'for-linus' of git://git.kernel.dk/linux-block: Revert "bdi: add error handle for bdi_debug_register" null_blk: unalign call_single_data block: unalign call_single_data in struct request block-throttle: avoid double charge block: fix blk_rq_append_bio block: don't let passthrough IO go into .make_request_fn() nvme: setup streams after initializing namespace head nvme: check hw sectors before setting chunk sectors nvme: call blk_integrity_unregister after queue is cleaned up nvme-fc: remove double put reference if admin connect fails nvme: set discard_alignment to zero kyber: fix another domain token wait queue hang
2017-12-21Revert "bdi: add error handle for bdi_debug_register"Jens Axboe1-4/+1
This reverts commit a0747a859ef6d3cc5b6cd50eb694499b78dd0025. It breaks some booting for some users, and more than a week into this, there's still no good fix. Revert this commit for now until a solution has been found. Reported-by: Laura Abbott <[email protected]> Reported-by: Bruno Wolff III <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-12-18mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.Tetsuo Handa1-0/+3
Syzbot caught an oops at unregister_shrinker() because combination of commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault injection made register_shrinker() fail and the caller of register_shrinker() did not check for failure. ---------- [ 554.881422] FAULT_INJECTION: forcing a failure. [ 554.881422] name failslab, interval 1, probability 0, space 0, times 0 [ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82 [ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 554.881445] Call Trace: [ 554.881459] dump_stack+0x194/0x257 [ 554.881474] ? arch_local_irq_restore+0x53/0x53 [ 554.881486] ? find_held_lock+0x35/0x1d0 [ 554.881507] should_fail+0x8c0/0xa40 [ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0 [ 554.881537] ? check_noncircular+0x20/0x20 [ 554.881546] ? find_next_zero_bit+0x2c/0x40 [ 554.881560] ? ida_get_new_above+0x421/0x9d0 [ 554.881577] ? find_held_lock+0x35/0x1d0 [ 554.881594] ? __lock_is_held+0xb6/0x140 [ 554.881628] ? check_same_owner+0x320/0x320 [ 554.881634] ? lock_downgrade+0x990/0x990 [ 554.881649] ? find_held_lock+0x35/0x1d0 [ 554.881672] should_failslab+0xec/0x120 [ 554.881684] __kmalloc+0x63/0x760 [ 554.881692] ? lock_downgrade+0x990/0x990 [ 554.881712] ? register_shrinker+0x10e/0x2d0 [ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320 [ 554.881737] register_shrinker+0x10e/0x2d0 [ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0 [ 554.881755] ? _down_write_nest_lock+0x120/0x120 [ 554.881765] ? memcpy+0x45/0x50 [ 554.881785] sget_userns+0xbcd/0xe20 (...snipped...) [ 554.898693] kasan: CONFIG_KASAN_INLINE enabled [ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 554.898732] general protection fault: 0000 [#1] SMP KASAN [ 554.898737] Dumping ftrace buffer: [ 554.898741] (ftrace buffer empty) [ 554.898743] Modules linked in: [ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82 [ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000 [ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150 [ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246 [ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0 [ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004 [ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000 [ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98 [ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000 [ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0 [ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000 [ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 [ 554.898818] Call Trace: [ 554.898828] unregister_shrinker+0x79/0x300 [ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750 [ 554.898844] ? down_write+0x87/0x120 [ 554.898851] ? deactivate_super+0x139/0x1b0 [ 554.898857] ? down_read+0x150/0x150 [ 554.898864] ? check_same_owner+0x320/0x320 [ 554.898875] deactivate_locked_super+0x64/0xd0 [ 554.898883] deactivate_super+0x141/0x1b0 ---------- Since allowing register_shrinker() callers to call unregister_shrinker() when register_shrinker() failed can simplify error recovery path, this patch makes unregister_shrinker() no-op when register_shrinker() failed. Also, reset shrinker->nr_deferred in case unregister_shrinker() was by error called twice. Signed-off-by: Tetsuo Handa <[email protected]> Signed-off-by: Aliaksei Karaliou <[email protected]> Reported-by: syzbot <[email protected]> Cc: Glauber Costa <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Al Viro <[email protected]>
2017-12-15Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths"Linus Torvalds4-12/+12
This reverts commits 5c9d2d5c269c, c7da82b894e9, and e7fe7b5cae90. We'll probably need to revisit this, but basically we should not complicate the get_user_pages_fast() case, and checking the actual page table protection key bits will require more care anyway, since the protection keys depend on the exact state of the VM in question. Particularly when doing a "remote" page lookup (ie in somebody elses VM, not your own), you need to be much more careful than this was. Dave Hansen says: "So, the underlying bug here is that we now a get_user_pages_remote() and then go ahead and do the p*_access_permitted() checks against the current PKRU. This was introduced recently with the addition of the new p??_access_permitted() calls. We have checks in the VMA path for the "remote" gups and we avoid consulting PKRU for them. This got missed in the pkeys selftests because I did a ptrace read, but not a *write*. I also didn't explicitly test it against something where a COW needed to be done" It's also not entirely clear that it makes sense to check the protection key bits at this level at all. But one possible eventual solution is to make the get_user_pages_fast() case just abort if it sees protection key bits set, which makes us fall back to the regular get_user_pages() case, which then has a vma and can do the check there if we want to. We'll see. Somewhat related to this all: what we _do_ want to do some day is to check the PAGE_USER bit - it should obviously always be set for user pages, but it would be a good check to have back. Because we have no generic way to test for it, we lost it as part of moving over from the architecture-specific x86 GUP implementation to the generic one in commit e585513b76f7 ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"). Cc: Peter Zijlstra <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: "Jérôme Glisse" <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-12-15Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull early_ioremap fix from Ingo Molnar: "A boot hang fix when the EFI earlyprintk driver is enabled" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep
2017-12-14mm, oom_reaper: fix memory corruptionMichal Hocko2-6/+8
David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/[email protected] Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Signed-off-by: Michal Hocko <[email protected]> Reported-by: David Rientjes <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Andrea Argangeli <[email protected]> Cc: <[email protected]> [4.14] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-12-14mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()'Christophe JAILLET1-2/+4
A semaphore is acquired before this check, so we must release it before leaving. Link: http://lkml.kernel.org/r/[email protected] Fixes: b7f0554a56f2 ("mm: fail get_vaddr_frames() for filesystem-dax mappings") Signed-off-by: Christophe JAILLET <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David Sterba <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-12-14mm/slab.c: do not hash pointers when debugging slabGeert Uytterhoeven1-13/+10
If CONFIG_DEBUG_SLAB/CONFIG_DEBUG_SLAB_LEAK are enabled, the slab code prints extra debug information when e.g. corruption is detected. This includes pointers, which are not very useful when hashed. Fix this by using %px to print unhashed pointers instead where it makes sense, and by removing the printing of a last user pointer referring to code. [[email protected]: v2] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Fixes: ad67b74d2469d9b8 ("printk: hash addresses printed with %p") Signed-off-by: Geert Uytterhoeven <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: "Tobin C . Harding" <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-12-14mm/page_alloc.c: avoid excessive IRQ disabled times in free_unref_page_list()Lucas Stach1-0/+11
Since commit 9cca35d42eb6 ("mm, page_alloc: enable/disable IRQs once when freeing a list of pages") we see excessive IRQ disabled times of up to 25ms on an embedded ARM system (tracing overhead included). This is due to graphics buffers being freed back to the system via release_pages(). Graphics buffers can be huge, so it's not hard to hit cases where the list of pages to free has 2048 entries. Disabling IRQs while freeing all those pages is clearly not a good idea. Introduce a batch limit, which allows IRQ servicing once every few pages. The batch count is the same as used in other parts of the MM subsystem when dealing with IRQ disabled regions. Link: http://lkml.kernel.org/r/[email protected] Fixes: 9cca35d42eb6 ("mm, page_alloc: enable/disable IRQs once when freeing a list of pages") Signed-off-by: Lucas Stach <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-12-14mm/memory.c: mark wp_huge_pmd() inline to prevent build failureGeert Uytterhoeven1-1/+2
With gcc 4.1.2: mm/memory.o: In function `wp_huge_pmd': memory.c:(.text+0x9b4): undefined reference to `do_huge_pmd_wp_page' Interestingly, wp_huge_pmd() is emitted in the assembler output, but never called. Apparently replacing the call to pmd_write() in __handle_mm_fault() by a call to the more complex pmd_access_permitted() reduced the ability of the compiler to remove unused code. Fix this by marking wp_huge_pmd() inline, like was done in commit 91a90140f998 ("mm/memory.c: mark create_huge_pmd() inline to prevent build failure") for a similar problem. [[email protected]: add comment] Link: http://lkml.kernel.org/r/[email protected] Fixes: c7da82b894e9eef6 ("mm: replace pmd_write with pmd_access_permitted in fault + gup paths") Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-12-14mm/kmemleak.c: make cond_resched() rate-limiting more efficientAndrew Morton1-1/+1
Commit bde5f6bc68db ("kmemleak: add scheduling point to kmemleak_scan()") tries to rate-limit the frequency of cond_resched() calls, but does it in a way which might incur an expensive division operation in the inner loop. Simplify this. Fixes: bde5f6bc68db5 ("kmemleak: add scheduling point to kmemleak_scan()") Suggested-by: Linus Torvalds <[email protected]> Cc: Yisheng Xie <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-12-11Merge branch 'for-4.15-fixes' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu fix from Tejun Heo: "Just one patch to work around CRIS boot problem caused by a recent change which freed a temporary boot data structure. The root cause is on CRIS side but it doesn't seem trivial to fix. For now, work around by skipping freeing on CRIS" * 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: hack to let the CRIS architecture to boot until they clean up
2017-12-11Merge branches 'cond_resched.2017.12.04a', 'dyntick.2017.11.28a', ↵Paul E. McKenney1-8/+1
'fixes.2017.12.11a', 'srbd.2017.12.05a' and 'torture.2017.12.11a' into HEAD cond_resched.2017.12.04a: Convert cond_resched_rcu_qs() to cond_resched() dyntick.2017.11.28a: Make RCU dynticks handle interrupts from NMI fixes.2017.12.11a: Miscellaneous fixes srbd.2017.12.05a: Remove now-redundant smp_read_barrier_depends() torture.2017.12.11a: Torture-testing update
2017-12-11mm/early_ioremap: Fix boot hang with earlyprintk=efi,keepDave Young1-1/+1
earlyprintk=efi,keep does not work any more with a warning in mm/early_ioremap.c: WARN_ON(system_state != SYSTEM_BOOTING): Boot just hangs because of the earlyprintk within the earlyprintk implementation code itself. This is caused by a new introduced middle state in: 69a78ff226fe ("init: Introduce SYSTEM_SCHEDULING state") early_ioremap() is fine in both SYSTEM_BOOTING and SYSTEM_SCHEDULING states, original condition should be updated accordingly. Signed-off-by: Dave Young <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-12-08kmemcheck: rip it out for realMichal Hocko1-1/+0
Commit 4675ff05de2d ("kmemcheck: rip it out") has removed the code but for some reason SPDX header stayed in place. This looks like a rebase mistake in the mmotm tree or the merge mistake. Let's drop those leftovers as well. Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-12-04mm/ksm: Remove now-redundant smp_read_barrier_depends()Paul E. McKenney1-8/+1
Because READ_ONCE() now implies smp_read_barrier_depends(), the smp_read_barrier_depends() in get_ksm_page() is now redundant. This commit removes it and updates the comments. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: <[email protected]>
2017-12-02__get_user_pages_locked(): get rid of notify_drop argumentAl Viro1-6/+6
The only caller that doesn't pass true in it is get_user_pages() and it passes NULL in locked. The only place where we check it is if (notify_locked && lock_dropped && *locked) and lock_dropped can become true only if we have locked != NULL. In other words, the second part of condition will be false when called by get_user_pages(). Just get rid of the argument and turn the condition into if (lock_dropped && *locked) Signed-off-by: Al Viro <[email protected]>
2017-12-02get_user_pages_unlocked(): pass true to __get_user_pages_locked() notify_dropAl Viro1-1/+1
Equivalent transformation - the only place in __get_user_pages_locked() where we look at notify_drop argument is if (notify_drop && lock_dropped && *locked) { up_read(&mm->mmap_sem); *locked = 0; } in the very end. Changing notify_drop from false to true won't change behaviour unless *locked is non-zero. The caller is ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL, &locked, false, gup_flags | FOLL_TOUCH); if (locked) up_read(&mm->mmap_sem); so in that case the original kernel would have done up_read() right after return from __get_user_pages_locked(), while the modified one would've done it right before the return. Signed-off-by: Al Viro <[email protected]>
2017-12-02fold __get_user_pages_unlocked() into its sole remaining callerAl Viro1-26/+10
Signed-off-by: Al Viro <[email protected]>
2017-12-01Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-3/+19
Pull block fixes from Jens Axboe: "A selection of fixes/changes that should make it into this series. This contains: - NVMe, two merges, containing: - pci-e, rdma, and fc fixes - Device quirks - Fix for a badblocks leak in null_blk - bcache fix from Rui Hua for a race condition regression where -EINTR was returned to upper layers that didn't expect it. - Regression fix for blktrace for a bug introduced in this series. - blktrace cleanup for cgroup id. - bdi registration error handling. - Small series with cleanups for blk-wbt. - Various little fixes for typos and the like. Nothing earth shattering, most important are the NVMe and bcache fixes" * 'for-linus' of git://git.kernel.dk/linux-block: (34 commits) nvme-pci: fix NULL pointer dereference in nvme_free_host_mem() nvme-rdma: fix memory leak during queue allocation blktrace: fix trace mutex deadlock nvme-rdma: Use mr pool nvme-rdma: Check remotely invalidated rkey matches our expected rkey nvme-rdma: wait for local invalidation before completing a request nvme-rdma: don't complete requests before a send work request has completed nvme-rdma: don't suppress send completions bcache: check return value of register_shrinker bcache: recover data from backing when data is clean bcache: Fix building error on MIPS bcache: add a comment in journal bucket reading nvme-fc: don't use bit masks for set/test_bit() numbers blk-wbt: fix comments typo blk-wbt: move wbt_clear_stat to common place in wbt_done blk-sysfs: remove NULL pointer checking in queue_wb_lat_store blk-wbt: remove duplicated setting in wbt_init nvme-pci: add quirk for delay before CHK RDY for WDC SN200 block: remove useless assignment in bio_split null_blk: fix dev->badblocks leak ...
2017-11-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds13-32/+121
Mergr misc fixes from Andrew Morton: "28 fixes" * emailed patches from Andrew Morton <[email protected]>: (28 commits) fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate() mm/hugetlb: fix NULL-pointer dereference on 5-level paging machine autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored" autofs: revert "autofs: take more care to not update last_used on path walk" fs/fat/inode.c: fix sb_rdonly() change mm, memcg: fix mem_cgroup_swapout() for THPs mm: migrate: fix an incorrect call of prep_transhuge_page() kmemleak: add scheduling point to kmemleak_scan() scripts/bloat-o-meter: don't fail with division by 0 fs/mbcache.c: make count_objects() more robust Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical" mm/madvise.c: fix madvise() infinite loop under special circumstances exec: avoid RLIMIT_STACK races with prlimit() IB/core: disable memory registration of filesystem-dax vmas v4l2: disable filesystem-dax mapping support mm: fail get_vaddr_frames() for filesystem-dax mappings mm: introduce get_user_pages_longterm device-dax: implement ->split() to catch invalid munmap attempts mm, hugetlbfs: introduce ->split() to vm_operations_struct scripts/faddr2line: extend usage on generic arch ...
2017-11-29mm/hugetlb: fix NULL-pointer dereference on 5-level paging machineKirill A. Shutemov1-1/+3
I made a mistake during converting hugetlb code to 5-level paging: in huge_pte_alloc() we have to use p4d_alloc(), not p4d_offset(). Otherwise it leads to crash -- NULL-pointer dereference in pud_alloc() if p4d table is not yet allocated. It only can happen in 5-level paging mode. In 4-level paging mode p4d_offset() always returns pgd, so we are fine. Link: http://lkml.kernel.org/r/[email protected] Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging") Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: <[email protected]> [4.11+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-29mm, memcg: fix mem_cgroup_swapout() for THPsShakeel Butt1-1/+1
Commit d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout() support THP") changed mem_cgroup_swapout() to support transparent huge page (THP). However the patch missed one location which should be changed for correctly handling THPs. The resulting bug will cause the memory cgroups whose THPs were swapped out to become zombies on deletion. Link: http://lkml.kernel.org/r/[email protected] Fixes: d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout() support THP") Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Huang Ying <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Greg Thelen <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-29kmemleak: add scheduling point to kmemleak_scan()Yisheng Xie1-0/+2
kmemleak_scan() will scan struct page for each node and it can be really large and resulting in a soft lockup. We have seen a soft lockup when do scan while compile kernel: watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [bash:10287] [...] Call Trace: kmemleak_scan+0x21a/0x4c0 kmemleak_write+0x312/0x350 full_proxy_write+0x5a/0xa0 __vfs_write+0x33/0x150 vfs_write+0xad/0x1a0 SyS_write+0x52/0xc0 do_syscall_64+0x61/0x1a0 entry_SYSCALL64_slow_path+0x25/0x25 Fix this by adding cond_resched every MAX_SCAN_SIZE. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yisheng Xie <[email protected]> Suggested-by: Catalin Marinas <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-29Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings ↵Michal Hocko1-4/+1
are illogical" This reverts commit 0f6d24f87856 ("mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical") because it causes false positive warnings during OOM situations as noticed by Tetsuo Handa: Node 0 active_anon:3525940kB inactive_anon:8372kB active_file:216kB inactive_file:1872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2504kB dirty:52kB writeback:0kB shmem:8660kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 636928kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes Node 0 DMA free:14848kB min:284kB low:352kB high:420kB active_anon:992kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 2687 3645 3645 Node 0 DMA32 free:53004kB min:49608kB low:62008kB high:74408kB active_anon:2712648kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2773132kB mlocked:0kB kernel_stack:96kB pagetables:5096kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 958 958 Node 0 Normal free:17140kB min:17684kB low:22104kB high:26524kB active_anon:812300kB inactive_anon:8372kB active_file:1228kB inactive_file:1868kB unevictable:0kB writepending:52kB present:1048576kB managed:981224kB mlocked:0kB kernel_stack:3520kB pagetables:8552kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 [...] Out of memory: Kill process 8459 (a.out) score 999 or sacrifice child Killed process 8459 (a.out) total-vm:4180kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 8459 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB vm direct limit must be set greater than background limit. The problem is that both thresh and bg_thresh will be 0 if available_memory is less than 4 pages when evaluating global_dirtyable_memory. While this might be worked around the whole point of the warning is dubious at best. We do rely on admins to do sensible things when changing tunable knobs. Dirty memory writeback knobs are not any special in that regards so revert the warning rather than adding more hacks to work this around. Debugged by Yafang Shao. Link: http://lkml.kernel.org/r/[email protected] Fixes: 0f6d24f87856 ("mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical") Signed-off-by: Michal Hocko <[email protected]> Reported-by: Tetsuo Handa <[email protected]> Cc: Yafang Shao <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>