aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2017-11-15mm/page-writeback.c: print a warning if the vm dirtiness settings are illogicalYafang Shao1-1/+4
The vm direct limit setting must be set greater than vm background limit setting. Otherwise print a warning to help the operator to figure out that the vm dirtiness settings is in illogical state. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yafang Shao <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm/memblock.c: make the index explicit argument of for_each_memblock_typeGioh Kim1-4/+4
for_each_memblock_type macro function relies on idx variable defined in the caller context. Silent macro arguments are almost always wrong thing to do. They make code harder to read and easier to get wrong. Let's use an explicit iterator parameter for for_each_memblock_type and make the code more obious. This patch is a mere cleanup and it shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Gioh Kim <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm, memory_hotplug: remove timeout from __offline_memoryMichal Hocko1-7/+3
We have a hardcoded 120s timeout after which the memory offline fails basically since the hot remove has been introduced. This is essentially a policy implemented in the kernel. Moreover there is no way to adjust the timeout and so we are sometimes facing memory offline failures if the system is under a heavy memory pressure or very intensive CPU workload on large machines. It is not very clear what purpose the timeout actually serves. The offline operation is interruptible by a signal so if userspace wants some timeout based termination this can be done trivially by sending a signal. If there is a strong usecase to do this from the kernel then we should do it properly and have a it tunable from the userspace with the timeout disabled by default along with the explanation who uses it and for what purporse. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Michael Ellerman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm, memory_hotplug: do not fail offlining too earlyMichal Hocko1-30/+10
Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2. While testing memory hotplug on a large 4TB machine we have noticed that memory offlining is just too eager to fail. The primary reason is that the retry logic is just too easy to give up. We have 4 ways out of the offline - we have a permanent failure (isolation or memory notifiers fail, or hugetlb pages cannot be dropped) - userspace sends a signal - a hardcoded 120s timeout expires - page migration fails 5 times This is way too convoluted and it doesn't scale very well. We have seen both temporary migration failures as well as 120s being triggered. After removing those restrictions we were able to pass stress testing during memory hot remove without any other negative side effects observed. Therefore I suggest dropping both hard coded policies. I couldn't have found any specific reason for them in the changelog. I neither didn't get any response [1] from Kamezawa. If we need some upper bound - e.g. timeout based - then we should have a proper and user defined policy for that. In any case there should be a clear use case when introducing it. This patch (of 2): Memory offlining can fail too eagerly under heavy memory pressure. page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3 flags: 0x9855fe40010048(uptodate|active|mappedtodisk) page dumped because: isolation failed page->mem_cgroup:ffff8801cd662000 memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed Isolation has failed here because the page is not on LRU. Most probably because it was on the pcp LRU cache or it has been removed from the LRU already but it hasn't been freed yet. In both cases the page doesn't look non-migrable so retrying more makes sense. __offline_pages seems rather cluttered when it comes to the retry logic. We have 5 retries at maximum and a timeout. We could argue whether the timeout makes sense but failing just because of a race when somebody isoltes a page from LRU or puts it on a pcp LRU lists is just wrong. It only takes it to race with a process which unmaps some pages and remove them from the LRU list and we can fail the whole offline because of something that is a temporary condition and actually not harmful for the offline. Please note that unmovable pages should be already excluded during start_isolate_page_range. We could argue that has_unmovable_pages is racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable pages in most cases and movable zone shouldn't contain unmovable pages at all. Some of those pages might be pinned but not for ever because that would be a bug on its own. In any case the context is still interruptible and so the userspace can easily bail out when the operation takes too long. This is certainly better behavior than a hardcoded retry loop which is racy. Fix this by removing the max retry count and only rely on the timeout resp. interruption by a signal from the userspace. Also retry rather than fail when check_pages_isolated sees some !free pages because those could be a result of the race as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Michael Ellerman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm, page_alloc: fail has_unmovable_pages when seeing reserved pagesMichal Hocko1-0/+3
Reserved pages should be completely ignored by the core mm because they have a special meaning for their owners. has_unmovable_pages doesn't check those so we rely on other tests (reference count, or PageLRU) to fail on such pages. Althought this happens to work it is safer to simply check for those explicitly and do not rely on the owner of the page to abuse those fields for special purposes. Please note that this is more of a further fortification of the code rahter than a fix of an existing issue. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm: distinguish CMA and MOVABLE isolation in has_unmovable_pages()Michal Hocko2-6/+16
Joonsoo has noticed that "mm: drop migrate type checks from has_unmovable_pages" would break CMA allocator because it relies on has_unmovable_pages returning false even for CMA pageblocks which in fact don't have to be movable: alloc_contig_range start_isolate_page_range set_migratetype_isolate has_unmovable_pages This is a result of the code sharing between CMA and memory hotplug while each one has a different idea of what has_unmovable_pages should return. This is unfortunate but fixing it properly would require a lot of code duplication. Fix the issue by introducing the requested migrate type argument and special case MIGRATE_CMA case where CMA page blocks are handled properly. This will work for memory hotplug because it requires MIGRATE_MOVABLE. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Joonsoo Kim <[email protected]> Tested-by: Stefan Wahren <[email protected]> Tested-by: Ran Wang <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm: drop migrate type checks from has_unmovable_pagesMichal Hocko1-4/+0
Michael has noticed that the memory offline tries to migrate kernel code pages when doing echo 0 > /sys/devices/system/memory/memory0/online The current implementation will fail the operation after several failed page migration attempts but we shouldn't even attempt to migrate that memory and fail right away because this memory is clearly not migrateable. This will become a real problem when we drop the retry loop counter resp. timeout. The real problem is in has_unmovable_pages in fact. We should fail if there are any non migrateable pages in the area. In orther to guarantee that remove the migrate type checks because MIGRATE_MOVABLE is not guaranteed to contain only migrateable pages. It is merely a heuristic. Similarly MIGRATE_CMA does guarantee that the page allocator doesn't allocate any non-migrateable pages from the block but CMA allocations themselves are unlikely to migrateable. Therefore remove both checks. [[email protected]: remove unused local `mt'] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Michael Ellerman <[email protected]> Tested-by: Michael Ellerman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Tested-by: Tony Lindgren <[email protected]> Tested-by: Ran Wang <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm/page-writeback.c: remove unused parameter from balance_dirty_pages()Tahsin Erdogan1-3/+2
"mapping" parameter to balance_dirty_pages() is not used anymore. Fixes: dfb8ae567835 ("writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tahsin Erdogan <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm, swap: fix false error message in __swp_swapcount()Huang Ying1-0/+3
When a page fault occurs for a swap entry, the physical swap readahead (not the VMA base swap readahead) may readahead several swap entries after the fault swap entry. The readahead algorithm calculates some of the swap entries to readahead via increasing the offset of the fault swap entry without checking whether they are beyond the end of the swap device and it relys on the __swp_swapcount() and swapcache_prepare() to check it. Although __swp_swapcount() checks for the swap entry passed in, it will complain with the error message as follow for the expected invalid swap entry. This may make the end users confused. swap_info_get: Bad swap offset entry 0200f8a7 To fix the false error message, the swap entry checking is added in swapin_readahead() to avoid to pass the out-of-bound swap entries and the swap entry reserved for the swap header to __swp_swapcount() and swapcache_prepare(). Link: http://lkml.kernel.org/r/[email protected] Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots") Signed-off-by: "Huang, Ying" <[email protected]> Reported-by: Christian Kujau <[email protected]> Acked-by: Minchan Kim <[email protected]> Suggested-by: Minchan Kim <[email protected]> Cc: Tim Chen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> [4.11+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no ↵Minchan Kim2-9/+17
other reference When SWP_SYNCHRONOUS_IO swapped-in pages are shared by several processes, it can cause unnecessary memory wastage by skipping swap cache. Because, with swapin fault by read, they could share a page if the page were in swap cache. Thus, it avoids allocating same content new pages. This patch makes the swapcache skipping work only if the swap pte is non-sharable. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm, swap: skip swapcache for swapin of synchronous deviceMinchan Kim3-23/+46
With fast swap storage, the platforms want to use swap more aggressively and swap-in is crucial to application latency. The rw_page() based synchronous devices like zram, pmem and btt are such fast storage. When I profile swapin performance with zram lz4 decompress test, S/W overhead is more than 70%. Maybe, it would be bigger in nvdimm. This patch aims to reduce swap-in latency by skipping swapcache if the swap device is synchronous device like rw_page based device. It enhances 45% my swapin test(5G sequential swapin, no readahead, from 2.41sec to 1.64sec). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm, swap: introduce SWP_SYNCHRONOUS_IOMinchan Kim1-0/+3
If rw-page based fast storage is used for swap devices, we need to detect it to enhance swap IO operations. This patch is preparation for optimizing of swap-in operation with next patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm/mempool.c: use kmalloc_array_node()Johannes Thumshirn1-1/+1
Now that we have a NUMA-aware version of kmalloc_array() we can use it instead of kmalloc_node() without an overflow check in the size calculation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: David Rientjes <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Hal Rosenstock <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Santosh Shilimkar <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15slub: fix sysfs duplicate filename creation when slub_debug=OMiles Chen1-0/+4
When slub_debug=O is set. It is possible to clear debug flags for an "unmergeable" slab cache in kmem_cache_open(). It makes the "unmergeable" cache became "mergeable" in sysfs_slab_add(). These caches will generate their "unique IDs" by create_unique_id(), but it is possible to create identical unique IDs. In my experiment, sgpool-128, names_cache, biovec-256 generate the same ID ":Ft-0004096" and the kernel reports "sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096'". To repeat my experiment, set disable_higher_order_debug=1, CONFIG_SLUB_DEBUG_ON=y in kernel-4.14. Fix this issue by setting unmergeable=1 if slub_debug=O and the the default slub_debug contains any no-merge flags. call path: kmem_cache_create() __kmem_cache_alias() -> we set SLAB_NEVER_MERGE flags here create_cache() __kmem_cache_create() kmem_cache_open() -> clear DEBUG_METADATA_FLAGS sysfs_slab_add() -> the slab cache is mergeable now sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096' ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x7c Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 4.14.0-rc7ajb-00131-gd4c2e9f-dirty #123 Hardware name: linux,dummy-virt (DT) task: ffffffc07d4e0080 task.stack: ffffff8008008000 PC is at sysfs_warn_dup+0x60/0x7c LR is at sysfs_warn_dup+0x60/0x7c pc : lr : pstate: 60000145 Call trace: sysfs_warn_dup+0x60/0x7c sysfs_create_dir_ns+0x98/0xa0 kobject_add_internal+0xa0/0x294 kobject_init_and_add+0x90/0xb4 sysfs_slab_add+0x90/0x200 __kmem_cache_create+0x26c/0x438 kmem_cache_create+0x164/0x1f4 sg_pool_init+0x60/0x100 do_one_initcall+0x38/0x12c kernel_init_freeable+0x138/0x1d4 kernel_init+0x10/0xfc ret_from_fork+0x10/0x18 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Miles Chen <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15slab, slub, slob: convert slab_flags_t to 32-bitAlexey Dobriyan2-5/+5
struct kmem_cache::flags is "unsigned long" which is unnecessary on 64-bit as no flags are defined in the higher bits. Switch the field to 32-bit and save some space on x86_64 until such flags appear: add/remove: 0/0 grow/shrink: 0/107 up/down: 0/-657 (-657) function old new delta sysfs_slab_add 720 719 -1 ... check_object 699 676 -23 [[email protected]: fix printk warning] Link: http://lkml.kernel.org/r/20171021100635.GA8287@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Acked-by: Pekka Enberg <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15slab, slub, slob: add slab_flags_tAlexey Dobriyan6-47/+48
Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON, etc). SLAB is bloated temporarily by switching to "unsigned long", but only temporarily. Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Acked-by: Pekka Enberg <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm/slab.c: only set __GFP_RECLAIMABLE onceDavid Rientjes1-2/+2
SLAB_RECLAIM_ACCOUNT is a permanent attribute of a slab cache. Set __GFP_RECLAIMABLE as part of its ->allocflags rather than check the cachep flag on every page allocation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm/slob.c: remove an unnecessary check for __GFP_ZEROMiles Chen1-1/+1
Current flow guarantees a valid pointer when handling the __GFP_ZERO case. So remove the unnecessary NULL pointer check. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Miles Chen <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm: oom: show unreclaimable slab info when unreclaimable slabs > user memoryYang Shi3-2/+67
The kernel may panic when an oom happens without killable process sometimes it is caused by huge unreclaimable slabs used by kernel. Although kdump could help debug such problem, however, kdump is not available on all architectures and it might be malfunction sometime. And, since kernel already panic it is worthy capturing such information in dmesg to aid touble shooting. Print out unreclaimable slab info (used size and total size) which actual memory usage is not zero (num_objs * size != 0) when unreclaimable slabs amount is greater than total user memory (LRU pages). The output looks like: Unreclaimable slab info: Name Used Total rpc_buffers 31KB 31KB rpc_tasks 7KB 7KB ebitmap_node 1964KB 1964KB avtab_node 5024KB 5024KB xfs_buf 1402KB 1402KB xfs_ili 134KB 134KB xfs_efi_item 115KB 115KB xfs_efd_item 115KB 115KB xfs_buf_item 134KB 134KB xfs_log_item_desc 342KB 342KB xfs_trans 1412KB 1412KB xfs_ifork 212KB 212KB [[email protected]: v11] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15mm: slabinfo: remove CONFIG_SLABINFOYang Shi4-9/+6
According to discussion with Christoph (https://marc.info/?l=linux-kernel&m=150695909709711&w=2), it sounds like it is pointless to keep CONFIG_SLABINFO around. This patch removes the CONFIG_SLABINFO config option, but /proc/slabinfo is still available. [[email protected]: v11] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15Merge branch 'for-4.15' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu update from Tejun Heo: "Another minor pull request. It only contains one commit which can reclaim a bit of memory wasted during boot on UP" * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: don't forget to free the temporary struct pcpu_alloc_info
2017-11-15mm/pagewalk.c: report holes in hugetlb rangesJann Horn1-1/+5
This matters at least for the mincore syscall, which will otherwise copy uninitialized memory from the page allocator to userspace. It is probably also a correctness error for /proc/$pid/pagemap, but I haven't tested that. Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has no effect because the caller already checks for that. This only reports holes in hugetlb ranges to callers who have specified a hugetlb_entry callback. This issue was found using an AFL-based fuzzer. v2: - don't crash on ->pte_hole==NULL (Andrew Morton) - add Cc stable (Andrew Morton) Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()") Signed-off-by: Jann Horn <[email protected]> Cc: <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-15Merge branch 'for-linus' of ↵Linus Torvalds1-4/+4
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual rocket-science from trivial tree for 4.15" * 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: MAINTAINERS: relinquish kconfig MAINTAINERS: Update my email address treewide: Fix typos in Kconfig kfifo: Fix comments init/Kconfig: Fix module signing document location misc: ibmasm: Return error on error path HID: logitech-hidpp: fix mistake in printk, "feeback" -> "feedback" MAINTAINERS: Correct path to uDraw PS3 driver tracing: Fix doc mistakes in trace sample tracing: Kconfig text fixes for CONFIG_HWLAT_TRACER MIPS: Alchemy: Remove reverted CONFIG_NETLINK_MMAP from db1xxx_defconfig mm/huge_memory.c: fixup grammar in comment lib/xz: Add fall-through comments to a switch statement
2017-11-14Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-blockLinus Torvalds4-40/+20
Pull core block layer updates from Jens Axboe: "This is the main pull request for block storage for 4.15-rc1. Nothing out of the ordinary in here, and no API changes or anything like that. Just various new features for drivers, core changes, etc. In particular, this pull request contains: - A patch series from Bart, closing the whole on blk/scsi-mq queue quescing. - A series from Christoph, building towards hidden gendisks (for multipath) and ability to move bio chains around. - NVMe - Support for native multipath for NVMe (Christoph). - Userspace notifications for AENs (Keith). - Command side-effects support (Keith). - SGL support (Chaitanya Kulkarni) - FC fixes and improvements (James Smart) - Lots of fixes and tweaks (Various) - bcache - New maintainer (Michael Lyle) - Writeback control improvements (Michael) - Various fixes (Coly, Elena, Eric, Liang, et al) - lightnvm updates, mostly centered around the pblk interface (Javier, Hans, and Rakesh). - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph) - Writeback series that fix the much discussed hundreds of millions of sync-all units. This goes all the way, as discussed previously (me). - Fix for missing wakeup on writeback timer adjustments (Yafang Shao). - Fix laptop mode on blk-mq (me). - {mq,name} tupple lookup for IO schedulers, allowing us to have alias names. This means you can use 'deadline' on both !mq and on mq (where it's called mq-deadline). (me). - blktrace race fix, oopsing on sg load (me). - blk-mq optimizations (me). - Obscure waitqueue race fix for kyber (Omar). - NBD fixes (Josef). - Disable writeback throttling by default on bfq, like we do on cfq (Luca Miccio). - Series from Ming that enable us to treat flush requests on blk-mq like any other request. This is a really nice cleanup. - Series from Ming that improves merging on blk-mq with schedulers, getting us closer to flipping the switch on scsi-mq again. - BFQ updates (Paolo). - blk-mq atomic flags memory ordering fixes (Peter Z). - Loop cgroup support (Shaohua). - Lots of minor fixes from lots of different folks, both for core and driver code" * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits) nvme: fix visibility of "uuid" ns attribute blk-mq: fixup some comment typos and lengths ide: ide-atapi: fix compile error with defining macro DEBUG blk-mq: improve tag waiting setup for non-shared tags brd: remove unused brd_mutex blk-mq: only run the hardware queue if IO is pending block: avoid null pointer dereference on null disk fs: guard_bio_eod() needs to consider partitions xtensa/simdisk: fix compile error nvme: expose subsys attribute to sysfs nvme: create 'slaves' and 'holders' entries for hidden controllers block: create 'slaves' and 'holders' entries for hidden gendisks nvme: also expose the namespace identification sysfs files for mpath nodes nvme: implement multipath access to nvme subsystems nvme: track shared namespaces nvme: introduce a nvme_ns_ids structure nvme: track subsystems block, nvme: Introduce blk_mq_req_flags_t block, scsi: Make SCSI quiesce and resume work reliably block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag ...
2017-11-14virtio_balloon: fix deadlock on OOMMichael S. Tsirkin1-7/+21
fill_balloon doing memory allocations under balloon_lock can cause a deadlock when leak_balloon is called from virtballoon_oom_notify and tries to take same lock. To fix, split page allocation and enqueue and do allocations outside the lock. Here's a detailed analysis of the deadlock by Tetsuo Handa: In leak_balloon(), mutex_lock(&vb->balloon_lock) is called in order to serialize against fill_balloon(). But in fill_balloon(), alloc_page(GFP_HIGHUSER[_MOVABLE] | __GFP_NOMEMALLOC | __GFP_NORETRY) is called with vb->balloon_lock mutex held. Since GFP_HIGHUSER[_MOVABLE] implies __GFP_DIRECT_RECLAIM | __GFP_IO | __GFP_FS, despite __GFP_NORETRY is specified, this allocation attempt might indirectly depend on somebody else's __GFP_DIRECT_RECLAIM memory allocation. And such indirect __GFP_DIRECT_RECLAIM memory allocation might call leak_balloon() via virtballoon_oom_notify() via blocking_notifier_call_chain() callback via out_of_memory() when it reached __alloc_pages_may_oom() and held oom_lock mutex. Since vb->balloon_lock mutex is already held by fill_balloon(), it will cause OOM lockup. Thread1 Thread2 fill_balloon() takes a balloon_lock balloon_page_enqueue() alloc_page(GFP_HIGHUSER_MOVABLE) direct reclaim (__GFP_FS context) takes a fs lock waits for that fs lock alloc_page(GFP_NOFS) __alloc_pages_may_oom() takes the oom_lock out_of_memory() blocking_notifier_call_chain() leak_balloon() tries to take that balloon_lock and deadlocks Reported-by: Tetsuo Handa <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Wei Wang <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2017-11-13Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds2-45/+79
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Ingo Molnar: "Note that in this cycle most of the x86 topics interacted at a level that caused them to be merged into tip:x86/asm - but this should be a temporary phenomenon, hopefully we'll back to the usual patterns in the next merge window. The main changes in this cycle were: Hardware enablement: - Add support for the Intel UMIP (User Mode Instruction Prevention) CPU feature. This is a security feature that disables certain instructions such as SGDT, SLDT, SIDT, SMSW and STR. (Ricardo Neri) [ Note that this is disabled by default for now, there are some smaller enhancements in the pipeline that I'll follow up with in the next 1-2 days, which allows this to be enabled by default.] - Add support for the AMD SEV (Secure Encrypted Virtualization) CPU feature, on top of SME (Secure Memory Encryption) support that was added in v4.14. (Tom Lendacky, Brijesh Singh) - Enable new SSE/AVX/AVX512 CPU features: AVX512_VBMI2, GFNI, VAES, VPCLMULQDQ, AVX512_VNNI, AVX512_BITALG. (Gayatri Kammela) Other changes: - A big series of entry code simplifications and enhancements (Andy Lutomirski) - Make the ORC unwinder default on x86 and various objtool enhancements. (Josh Poimboeuf) - 5-level paging enhancements (Kirill A. Shutemov) - Micro-optimize the entry code a bit (Borislav Petkov) - Improve the handling of interdependent CPU features in the early FPU init code (Andi Kleen) - Build system enhancements (Changbin Du, Masahiro Yamada) - ... plus misc enhancements, fixes and cleanups" * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (118 commits) x86/build: Make the boot image generation less verbose selftests/x86: Add tests for the STR and SLDT instructions selftests/x86: Add tests for User-Mode Instruction Prevention x86/traps: Fix up general protection faults caused by UMIP x86/umip: Enable User-Mode Instruction Prevention at runtime x86/umip: Force a page fault when unable to copy emulated result to user x86/umip: Add emulation code for UMIP instructions x86/cpufeature: Add User-Mode Instruction Prevention definitions x86/insn-eval: Add support to resolve 16-bit address encodings x86/insn-eval: Handle 32-bit address encodings in virtual-8086 mode x86/insn-eval: Add wrapper function for 32 and 64-bit addresses x86/insn-eval: Add support to resolve 32-bit address encodings x86/insn-eval: Compute linear address in several utility functions resource: Fix resource_size.cocci warnings X86/KVM: Clear encryption attribute when SEV is active X86/KVM: Decrypt shared per-cpu variables when SEV is active percpu: Introduce DEFINE_PER_CPU_DECRYPTED x86: Add support for changing memory encryption attribute in early boot x86/io: Unroll string I/O when SEV is active x86/boot: Add early boot support when running with SEV active ...
2017-11-13afs: Get rid of the afs_writeback recordDavid Howells1-0/+1
Get rid of the afs_writeback record that kAFS is using to match keys with writes made by that key. Instead, keep a list of keys that have a file open for writing and/or sync'ing and iterate through those. Signed-off-by: David Howells <[email protected]>
2017-11-10Merge branch 'x86/mm' into x86/asm, to merge branchesIngo Molnar2-10/+10
Most of x86/mm is already in x86/asm, so merge the rest too. Signed-off-by: Ingo Molnar <[email protected]>
2017-11-07mm/sparsemem: Fix ARM64 boot crash when CONFIG_SPARSEMEM_EXTREME=yKirill A. Shutemov2-10/+10
Since commit: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y") we allocate the mem_section array dynamically in sparse_memory_present_with_active_regions(), but some architectures, like arm64, don't call the routine to initialize sparsemem. Let's move the initialization into memory_present() it should cover all architectures. Reported-and-tested-by: Sudeep Holla <[email protected]> Tested-by: Bjorn Andersson <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Will Deacon <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-11-07Merge branch 'linus' into x86/asm, to pick up fixes and resolve conflictsIngo Molnar56-8/+103
Conflicts: arch/x86/kernel/cpu/Makefile Signed-off-by: Ingo Molnar <[email protected]>
2017-11-07Merge branch 'linus' into locking/core, to resolve conflictsIngo Molnar56-8/+103
Conflicts: include/linux/compiler-clang.h include/linux/compiler-gcc.h include/linux/compiler-intel.h include/uapi/linux/stddef.h Signed-off-by: Ingo Molnar <[email protected]>
2017-11-03block: add a poll_fn callback to struct request_queueChristoph Hellwig1-1/+1
That we we can also poll non blk-mq queues. Mostly needed for the NVMe multipath code, but could also be useful elsewhere. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-11-03mm, swap: fix race between swap count continuation operationsHuang Ying1-6/+17
One page may store a set of entries of the sis->swap_map (swap_info_struct->swap_map) in multiple swap clusters. If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX, multiple pages will be used to store the set of entries of the sis->swap_map. And the pages are linked with page->lru. This is called swap count continuation. To access the pages which store the set of entries of the sis->swap_map simultaneously, previously, sis->lock is used. But to improve the scalability of __swap_duplicate(), swap cluster lock may be used in swap_count_continued() now. This may race with add_swap_count_continuation() which operates on a nearby swap cluster, in which the sis->swap_map entries are stored in the same page. The race can cause wrong swap count in practice, thus cause unfreeable swap entries or software lockup, etc. To fix the race, a new spin lock called cont_lock is added to struct swap_info_struct to protect the swap count continuation page list. This is a lock at the swap device level, so the scalability isn't very well. But it is still much better than the original sis->lock, because it is only acquired/released when swap count continuation is used. Which is considered rare in practice. If it turns out that the scalability becomes an issue for some workloads, we can split the lock into some more fine grained locks. Link: http://lkml.kernel.org/r/[email protected] Fixes: 235b62176712 ("mm/swap: add cluster lock") Signed-off-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Tim Chen <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> [4.11+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-03mm/huge_memory.c: deposit page table when copying a PMD migration entryZi Yan1-0/+3
We need to deposit pre-allocated PTE page table when a PMD migration entry is copied in copy_huge_pmd(). Otherwise, we will leak the pre-allocated page and cause a NULL pointer dereference later in zap_huge_pmd(). The missing counters during PMD migration entry copy process are added as well. The bug report is here: https://lkml.org/lkml/2017/10/29/214 Link: http://lkml.kernel.org/r/[email protected] Fixes: 84c3fc4e9c563 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Zi Yan <[email protected]> Reported-by: Fengguang Wu <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-03userfaultfd: hugetlbfs: prevent UFFDIO_COPY to fill beyond the end of i_sizeAndrea Arcangeli1-2/+30
This oops: kernel BUG at fs/hugetlbfs/inode.c:484! RIP: remove_inode_hugepages+0x3d0/0x410 Call Trace: hugetlbfs_setattr+0xd9/0x130 notify_change+0x292/0x410 do_truncate+0x65/0xa0 do_sys_ftruncate.constprop.3+0x11a/0x180 SyS_ftruncate+0xe/0x10 tracesys+0xd9/0xde was caused by the lack of i_size check in hugetlb_mcopy_atomic_pte. mmap() can still succeed beyond the end of the i_size after vmtruncate zapped vmas in those ranges, but the faults must not succeed, and that includes UFFDIO_COPY. We could differentiate the retval to userland to represent a SIGBUS like a page fault would do (vs SIGSEGV), but it doesn't seem very useful and we'd need to pick a random retval as there's no meaningful syscall retval that would differentiate from SIGSEGV and SIGBUS, there's just -EFAULT. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-03mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flagsDan Williams1-0/+15
The mmap(2) syscall suffers from the ABI anti-pattern of not validating unknown flags. However, proposals like MAP_SYNC need a mechanism to define new behavior that is known to fail on older kernels without the support. Define a new MAP_SHARED_VALIDATE flag pattern that is guaranteed to fail on all legacy mmap implementations. It is worth noting that the original proposal was for a standalone MAP_VALIDATE flag. However, when that could not be supported by all archs Linus observed: I see why you *think* you want a bitmap. You think you want a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC etc, so that people can do ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_SYNC, fd, 0); and "know" that MAP_SYNC actually takes. And I'm saying that whole wish is bogus. You're fundamentally depending on special semantics, just make it explicit. It's already not portable, so don't try to make it so. Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value of 0x3, and make people do ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0); and then the kernel side is easier too (none of that random garbage playing games with looking at the "MAP_VALIDATE bit", but just another case statement in that map type thing. Boom. Done. Similar to ->fallocate() we also want the ability to validate the support for new flags on a per ->mmap() 'struct file_operations' instance basis. Towards that end arrange for flags to be generically validated against a mmap_supported_flags exported by 'struct file_operations'. By default all existing flags are implicitly supported, but new flags require MAP_SHARED_VALIDATE and per-instance-opt-in. Cc: Jan Kara <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Andrew Morton <[email protected]> Suggested-by: Christoph Hellwig <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman53-0/+53
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <[email protected]> Reviewed-by: Philippe Ombredanne <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-11-02Backmerge tag 'v4.14-rc7' into drm-nextDave Airlie22-126/+154
Linux 4.14-rc7 Requested by Ben Skeggs for nouveau to avoid major conflicts, and things were getting a bit conflicty already, esp around amdgpu reverts.
2017-10-30Merge tag 'v4.14-rc7' into x86/mm, to pick up fixesIngo Molnar2-20/+10
Signed-off-by: Ingo Molnar <[email protected]>
2017-10-25locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland1-1/+1
to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-25locking/atomics, mm: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()Paul E. McKenney1-3/+3
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't currently harmful. However, for some features it is necessary to instrument reads and writes separately, which is not possible with ACCESS_ONCE(). This distinction is critical to correct operation. It's possible to transform the bulk of kernel code using the Coccinelle script below. However, this doesn't handle comments, leaving references to ACCESS_ONCE() instances which have been removed. As a preparatory step, this patch converts the mm code and comments to use {READ,WRITE}_ONCE() consistently. ---- virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Will Deacon <[email protected]> Acked-by: Mark Rutland <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-24locking/barriers: Convert users of lockless_dereference() to READ_ONCE()Will Deacon1-1/+1
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it can be used instead of lockless_dereference() without any change in semantics. Signed-off-by: Will Deacon <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2-20/+10
Pull networking fixes from David Miller: "A little more than usual this time around. Been travelling, so that is part of it. Anyways, here are the highlights: 1) Deal with memcontrol races wrt. listener dismantle, from Eric Dumazet. 2) Handle page allocation failures properly in nfp driver, from Jaku Kicinski. 3) Fix memory leaks in macsec, from Sabrina Dubroca. 4) Fix crashes in pppol2tp_session_ioctl(), from Guillaume Nault. 5) Several fixes in bnxt_en driver, including preventing potential NVRAM parameter corruption from Michael Chan. 6) Fix for KRACK attacks in wireless, from Johannes Berg. 7) rtnetlink event generation fixes from Xin Long. 8) Deadlock in mlxsw driver, from Ido Schimmel. 9) Disallow arithmetic operations on context pointers in bpf, from Jakub Kicinski. 10) Missing sock_owned_by_user() check in sctp_icmp_redirect(), from Xin Long. 11) Only TCP is supported for sockmap, make that explicit with a check, from John Fastabend. 12) Fix IP options state races in DCCP and TCP, from Eric Dumazet. 13) Fix panic in packet_getsockopt(), also from Eric Dumazet. 14) Add missing locked in hv_sock layer, from Dexuan Cui. 15) Various aquantia bug fixes, including several statistics handling cures. From Igor Russkikh et al. 16) Fix arithmetic overflow in devmap code, from John Fastabend. 17) Fix busted socket memory accounting when we get a fault in the tcp zero copy paths. From Willem de Bruijn. 18) Don't leave opt->tot_len uninitialized in ipv6, from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) stmmac: Don't access tx_q->dirty_tx before netif_tx_lock ipv6: flowlabel: do not leave opt->tot_len with garbage of_mdio: Fix broken PHY IRQ in case of probe deferral textsearch: fix typos in library helpers rxrpc: Don't release call mutex on error pointer net: stmmac: Prevent infinite loop in get_rx_timestamp_status() net: stmmac: Fix stmmac_get_rx_hwtstamp() net: stmmac: Add missing call to dev_kfree_skb() mlxsw: spectrum_router: Configure TIGCR on init mlxsw: reg: Add Tunneling IPinIP General Configuration Register net: ethtool: remove error check for legacy setting transceiver type soreuseport: fix initialization race net: bridge: fix returning of vlan range op errors sock: correct sk_wmem_queued accounting on efault in tcp zerocopy bpf: add test cases to bpf selftests to cover all access tests bpf: fix pattern matches for direct packet access bpf: fix off by one for range markings with L{T, E} patterns bpf: devmap fix arithmetic overflow in bitmap_size calculation net: aquantia: Bad udp rate on default interrupt coalescing net: aquantia: Enable coalescing management via ethtool interface ...
2017-10-20mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=yKirill A. Shutemov2-6/+21
Size of the mem_section[] array depends on the size of the physical address space. In preparation for boot-time switching between paging modes on x86-64 we need to make the allocation of mem_section[] dynamic, because otherwise we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB for 4-level paging and 2MB for 5-level paging mode. The patch allocates the array on the first call to sparse_memory_present_with_active_regions(). Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-20Merge branch 'x86/urgent' into x86/mm, to pick up fixesIngo Molnar62-1977/+6526
Signed-off-by: Ingo Molnar <[email protected]>
2017-10-20Merge tag 'drm-intel-next-2017-10-12' of ↵Dave Airlie1-8/+22
git://anongit.freedesktop.org/drm/drm-intel into drm-next Last batch of drm/i915 features for v4.15: - transparent huge pages support (Matthew) - uapi: I915_PARAM_HAS_SCHEDULER into a capability bitmask (Chris) - execlists: preemption (Chris) - scheduler: user defined priorities (Chris) - execlists optimization (Michał) - plenty of display fixes (Imre) - has_ipc fix (Rodrigo) - platform features definition refactoring (Rodrigo) - legacy cursor update fix (Maarten) - fix vblank waits for cursor updates (Maarten) - reprogram dmc firmware on resume, dmc state fix (Imre) - remove use_mmio_flip module parameter (Maarten) - wa fixes (Oscar) - huc/guc firmware refacoring (Sagar, Michal) - push encoder specific code to encoder hooks (Jani) - DP MST fixes (Dhinakaran) - eDP power sequencing fixes (Manasi) - selftest updates (Chris, Matthew) - mmu notifier cpu hotplug deadlock fix (Daniel) - more VBT parser refactoring (Jani) - max pipe refactoring (Mika Kahola) - rc6/rps refactoring and separation (Sagar) - userptr lockdep fix (Chris) - tracepoint fixes and defunct tracepoint removal (Chris) - use rcu instead of abusing stop_machine (Daniel) - plenty of other fixes all around (Everyone) * tag 'drm-intel-next-2017-10-12' of git://anongit.freedesktop.org/drm/drm-intel: (145 commits) drm/i915: Update DRIVER_DATE to 20171012 drm/i915: Simplify intel_sanitize_enable_ppgtt drm/i915/userptr: Drop struct_mutex before cleanup drm/i915/dp: limit sink rates based on rate drm/i915/dp: centralize max source rate conditions more drm/i915: Allow PCH platforms fall back to BIOS LVDS mode drm/i915: Reuse normal state readout for LVDS/DVO fixed mode drm/i915: Use rcu instead of stop_machine in set_wedged drm/i915: Introduce separate status variable for RC6 and LLC ring frequency setup drm/i915: Create generic functions to control RC6, RPS drm/i915: Create generic function to setup LLC ring frequency table drm/i915: Rename intel_enable_rc6 to intel_rc6_enabled drm/i915: Name structure in dev_priv that contains RPS/RC6 state as "gt_pm" drm/i915: Move rps.hw_lock to dev_priv and s/hw_lock/pcu_lock drm/i915: Name i915_runtime_pm structure in dev_priv as "runtime_pm" drm/i915: Separate RPS and RC6 handling for CHV drm/i915: Separate RPS and RC6 handling for VLV drm/i915: Separate RPS and RC6 handling for BDW drm/i915: Remove superfluous IS_BDW checks and non-BDW changes from gen8_enable_rps drm/i915: Separate RPS and RC6 handling for gen6+ ...
2017-10-19mm, percpu: add support for __GFP_NOWARN flagDaniel Borkmann1-5/+10
Add an option for pcpu_alloc() to support __GFP_NOWARN flag. Currently, we always throw a warning when size or alignment is unsupported (and also dump stack on failed allocation requests). The warning itself is harmless since we return NULL anyway for any failed request, which callers are required to handle anyway. However, it becomes harmful when panic_on_warn is set. The rationale for the WARN() in pcpu_alloc() is that it can be tracked when larger than supported allocation requests are made such that allocations limits can be tweaked if warranted. This makes sense for in-kernel users, however, there are users of pcpu allocator where allocation size is derived from user space requests, e.g. when creating BPF maps. In these cases, the requests should fail gracefully without throwing a splat. The current work-around was to check allocation size against the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for bailing out prior to a call to pcpu_alloc() in order to avoid throwing the WARN(). This is bad in multiple ways since PCPU_MIN_UNIT_SIZE is an implementation detail, and having the checks on call-sites only complicates the code for no good reason. Thus, lets fix it generically by supporting the __GFP_NOWARN flag that users can then use with calling the __alloc_percpu_gfp() helper instead. Signed-off-by: Daniel Borkmann <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Mark Rutland <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-14mm/page-writeback.c: make changes of dirty_writeback_centisecs take effect ↵Yafang Shao1-1/+10
immediately This patch is the followup of the prvious patch: [writeback: schedule periodic writeback with sysctl]. There's another issue to fix. For example, - When the tunable was set to one hour and is reset to one second, the new setting will not take effect for up to one hour. Kicking the flusher threads immediately fixes it. Cc: Jens Axboe <[email protected]> Cc: Jan Kara <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-10-13mm, swap: use page-cluster as max window of VMA based swap readaheadHuang Ying1-34/+7
When the VMA based swap readahead was introduced, a new knob /sys/kernel/mm/swap/vma_ra_max_order was added as the max window of VMA swap readahead. This is to make it possible to use different max window for VMA based readahead and original physical readahead. But Minchan Kim pointed out that this will cause a regression because setting page-cluster sysctl to zero cannot disable swap readahead with the change. To fix the regression, the page-cluster sysctl is used as the max window of both the VMA based swap readahead and original physical swap readahead. If more fine grained control is needed in the future, more knobs can be added as the subordinate knobs of the page-cluster sysctl. The vma_ra_max_order knob is deleted. Because the knob was introduced in v4.14-rc1, and this patch is targeting being merged before v4.14 releasing, there should be no existing users of this newly added ABI. Link: http://lkml.kernel.org/r/[email protected] Fixes: ec560175c0b6fce ("mm, swap: VMA based swap readahead") Signed-off-by: "Huang, Ying" <[email protected]> Reported-by: Minchan Kim <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Fengguang Wu <[email protected]> Cc: Tim Chen <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-10-13mm: page_vma_mapped: ensure pmd is loaded with READ_ONCE outside of lockWill Deacon1-15/+10
Loading the pmd without holding the pmd_lock exposes us to races with concurrent updaters of the page tables but, worse still, it also allows the compiler to cache the pmd value in a register and reuse it later on, even if we've performed a READ_ONCE in between and seen a more recent value. In the case of page_vma_mapped_walk, this leads to the following crash when the pmd loaded for the initial pmd_trans_huge check is all zeroes and a subsequent valid table entry is loaded by check_pmd. We then proceed into map_pte, but the compiler re-uses the zero entry inside pte_offset_map, resulting in a junk pointer being installed in pvmw->pte: PC is at check_pte+0x20/0x170 LR is at page_vma_mapped_walk+0x2e0/0x540 [...] Process doio (pid: 2463, stack limit = 0xffff00000f2e8000) Call trace: check_pte+0x20/0x170 page_vma_mapped_walk+0x2e0/0x540 page_mkclean_one+0xac/0x278 rmap_walk_file+0xf0/0x238 rmap_walk+0x64/0xa0 page_mkclean+0x90/0xa8 clear_page_dirty_for_io+0x84/0x2a8 mpage_submit_page+0x34/0x98 mpage_process_page_bufs+0x164/0x170 mpage_prepare_extent_to_map+0x134/0x2b8 ext4_writepages+0x484/0xe30 do_writepages+0x44/0xe8 __filemap_fdatawrite_range+0xbc/0x110 file_write_and_wait_range+0x48/0xd8 ext4_sync_file+0x80/0x4b8 vfs_fsync_range+0x64/0xc0 SyS_msync+0x194/0x1e8 This patch fixes the problem by ensuring that READ_ONCE is used before the initial checks on the pmd, and this value is subsequently used when checking whether or not the pmd is present. pmd_check is removed and the pmd_present check is inlined directly. Link: http://lkml.kernel.org/r/[email protected] Fixes: f27176cfc363 ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()") Signed-off-by: Will Deacon <[email protected]> Tested-by: Yury Norov <[email protected]> Tested-by: Richard Ruigrok <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>