aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-10-14mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped ↵Jane Chu1-9/+13
more than once Mmap /dev/dax more than once, then read the poison location using address from one of the mappings. The other mappings due to not having the page mapped in will cause SIGKILLs delivered to the process. SIGKILL succeeds over SIGBUS, so user process loses the opportunity to handle the UE. Although one may add MAP_POPULATE to mmap(2) to work around the issue, MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so isn't always an option. Details - ndctl inject-error --block=10 --count=1 namespace6.0 ./read_poison -x dax6.0 -o 5120 -m 2 mmaped address 0x7f5bb6600000 mmaped address 0x7f3cf3600000 doing local read at address 0x7f3cf3601400 Killed Console messages in instrumented kernel - mce: Uncorrected hardware memory error in user-access at edbe201400 Memory failure: tk->addr = 7f5bb6601000 Memory failure: address edbe201: call dev_pagemap_mapping_shift dev_pagemap_mapping_shift: page edbe201: no PUD Memory failure: tk->size_shift == 0 Memory failure: Unable to find user space address edbe201 in read_poison Memory failure: tk->addr = 7f3cf3601000 Memory failure: address edbe201: call dev_pagemap_mapping_shift Memory failure: tk->size_shift = 21 Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page => to deliver SIGKILL Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption => to deliver SIGBUS Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jane Chu <[email protected]> Suggested-by: Naoya Horiguchi <[email protected]> Reviewed-by: Dan Williams <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14mm/slab.c: fix kernel-doc warning for __ksize()Randy Dunlap1-0/+3
Fix kernel-doc warning in mm/slab.c: mm/slab.c:4215: warning: Function parameter or member 'objp' not described in '__ksize' Also add Return: documentation section for this function. Link: http://lkml.kernel.org/r/[email protected] Fixes: 10d1f8cb3965 ("mm/slab: refactor common ksize KASAN logic into slab_common.c") Signed-off-by: Randy Dunlap <[email protected]> Acked-by: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14xarray.h: fix kernel-doc warningRandy Dunlap1-2/+2
Fix (Sphinx) kernel-doc warning in <linux/xarray.h>: include/linux/xarray.h:232: WARNING: Unexpected indentation. Link: http://lkml.kernel.org/r/[email protected] Fixes: a3e4d3f97ec8 ("XArray: Redesign xa_alloc API") Signed-off-by: Randy Dunlap <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14bitmap.h: fix kernel-doc warning and typoRandy Dunlap1-1/+2
Fix kernel-doc warning in <linux/bitmap.h>: include/linux/bitmap.h:341: warning: Function parameter or member 'nbits' not described in 'bitmap_or_equal' Also fix small typo (bitnaps). Link: http://lkml.kernel.org/r/[email protected] Fixes: b9fa6442f704 ("cpumask: Implement cpumask_or_equal()") Signed-off-by: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14fs/fs-writeback.c: fix kernel-doc warningRandy Dunlap1-1/+1
Fix kernel-doc warning in fs/fs-writeback.c: fs/fs-writeback.c:913: warning: Excess function parameter 'nr_pages' description in 'cgroup_writeback_by_id' Link: http://lkml.kernel.org/r/[email protected] Fixes: d62241c7a406 ("writeback, memcg: Implement cgroup_writeback_by_id()") Signed-off-by: Randy Dunlap <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14fs/libfs.c: fix kernel-doc warningRandy Dunlap1-2/+1
Fix kernel-doc warning in fs/libfs.c: fs/libfs.c:496: warning: Excess function parameter 'available' description in 'simple_write_end' Link: http://lkml.kernel.org/r/[email protected] Fixes: ad2a722f196d ("libfs: Open code simple_commit_write into only user") Signed-off-by: Randy Dunlap <[email protected]> Cc: Boaz Harrosh <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14fs/direct-io.c: fix kernel-doc warningRandy Dunlap1-2/+1
Fix kernel-doc warning in fs/direct-io.c: fs/direct-io.c:258: warning: Excess function parameter 'offset' description in 'dio_complete' Also, don't mark this function as having kernel-doc notation since it is not exported. Link: http://lkml.kernel.org/r/[email protected] Fixes: 6d544bb4d901 ("dio: centralize completion in dio_complete()") Signed-off-by: Randy Dunlap <[email protected]> Cc: Zach Brown <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14mm, compaction: fix wrong pfn handling in __reset_isolation_pfn()Vlastimil Babka1-3/+4
Florian and Dave reported [1] a NULL pointer dereference in __reset_isolation_pfn(). While the exact cause is unclear, staring at the code revealed two bugs, which might be related. One bug is that if zone starts in the middle of pageblock, block_page might correspond to different pfn than block_pfn, and then the pfn_valid_within() checks will check different pfn's than those accessed via struct page. This might result in acessing an unitialized page in CONFIG_HOLES_IN_ZONE configs. The other bug is that end_page refers to the first page of next pageblock and not last page of current pageblock. The online and valid check is then wrong and with sections, the while (page < end_page) loop might wander off actual struct page arrays. [1] https://lore.kernel.org/linux-xfs/[email protected]/ Link: http://lkml.kernel.org/r/[email protected] Fixes: 6b0868c820ff ("mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints") Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Florian Weimer <[email protected]> Reported-by: Dave Chinner <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14mm, hugetlb: allow hugepage allocations to reclaim as neededDavid Rientjes1-2/+4
Commit b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") has chnaged the allocator to bail out from the allocator early to prevent from a potentially excessive memory reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation, reclaim and compaction loop as long as there is a reasonable chance to make forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback. The most obvious affected subsystem is hugetlbfs which allocates huge pages based on an admin request (or via admin configured overcommit). I have done a simple test which tries to allocate half of the memory for hugetlb pages while the memory is full of a clean page cache. This is not an unusual situation because we try to cache as much of the memory as possible and sysctl/sysfs interface to allocate huge pages is there for flexibility to allocate hugetlb pages at any time. System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages after the memory is prefilled by a clean page cache: root@test1:~# cat hugetlb_test.sh set -x echo 0 > /proc/sys/vm/nr_hugepages echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) TS=$(date +%s) echo 256 > /proc/sys/vm/nr_hugepages cat /proc/sys/vm/nr_hugepages The results for 2 consecutive runs on clean 5.3 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s + date +%s + TS=1569905284 + echo 256 + cat /proc/sys/vm/nr_hugepages 256 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s + date +%s + TS=1569905311 + echo 256 + cat /proc/sys/vm/nr_hugepages 256 Now with b39d0ee2632d applied root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s + date +%s + TS=1569905516 + echo 256 + cat /proc/sys/vm/nr_hugepages 11 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s + date +%s + TS=1569905541 + echo 256 + cat /proc/sys/vm/nr_hugepages 12 The success rate went down by factor of 20! Although hugetlb allocation requests might fail and it is reasonable to expect them to under extremely fragmented memory or when the memory is under a heavy pressure but the above situation is not that case. Fix the regression by reverting back to the previous behavior for __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for those requests. Mike said: : hugetlbfs allocations are commonly done via sysctl/sysfs shortly after : boot where this may not be as much of an issue. However, I am aware of at : least three use cases where allocations are made after the system has been : up and running for quite some time: : : - DB reconfiguration. If sysctl/sysfs fails to get required number of : huge pages, system is rebooted to perform allocation after boot. : : - VM provisioning. If unable get required number of huge pages, fall : back to base pages. : : - An application that does not preallocate pool, but rather allocates : pages at fault time for optimal NUMA locality. : : In all cases, I would expect b39d0ee2632d to cause regressions and : noticable behavior changes. : : My quick/limited testing in : https://lkml.kernel.org/r/[email protected] : was insufficient. It was also mentioned that if something like : b39d0ee2632d went forward, I would like exemptions for __GFP_RETRY_MAYFAIL : requests as in this patch. [[email protected]: reworded changelog] Link: http://lkml.kernel.org/r/[email protected] Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14lib/test_meminit: add a kmem_cache_alloc_bulk() testAlexander Potapenko1-0/+27
Make sure allocations from kmem_cache_alloc_bulk() and kmem_cache_free_bulk() are properly initialized. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Cc: Kees Cook <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Thibaut Sautereau <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14mm/slub.c: init_on_free=1 should wipe freelist ptr for bulk allocationsAlexander Potapenko1-6/+16
slab_alloc_node() already zeroed out the freelist pointer if init_on_free was on. Thibaut Sautereau noticed that the same needs to be done for kmem_cache_alloc_bulk(), which performs the allocations separately. kmem_cache_alloc_bulk() is currently used in two places in the kernel, so this change is unlikely to have a major performance impact. SLAB doesn't require a similar change, as auto-initialization makes the allocator store the freelist pointers off-slab. Link: http://lkml.kernel.org/r/[email protected] Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Signed-off-by: Alexander Potapenko <[email protected]> Reported-by: Thibaut Sautereau <[email protected]> Reported-by: Kees Cook <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Laura Abbott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14lib/generic-radix-tree.c: add kmemleak annotationsEric Biggers1-6/+26
Kmemleak is falsely reporting a leak of the slab allocation in sctp_stream_init_ext(): BUG: memory leak unreferenced object 0xffff8881114f5d80 (size 96): comm "syz-executor934", pid 7160, jiffies 4294993058 (age 31.950s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000ce7a1326>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<00000000ce7a1326>] slab_post_alloc_hook mm/slab.h:439 [inline] [<00000000ce7a1326>] slab_alloc mm/slab.c:3326 [inline] [<00000000ce7a1326>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 [<000000007abb7ac9>] kmalloc include/linux/slab.h:547 [inline] [<000000007abb7ac9>] kzalloc include/linux/slab.h:742 [inline] [<000000007abb7ac9>] sctp_stream_init_ext+0x2b/0xa0 net/sctp/stream.c:157 [<0000000048ecb9c1>] sctp_sendmsg_to_asoc+0x946/0xa00 net/sctp/socket.c:1882 [<000000004483ca2b>] sctp_sendmsg+0x2a8/0x990 net/sctp/socket.c:2102 [...] But it's freed later. Kmemleak misses the allocation because its pointer is stored in the generic radix tree sctp_stream::out, and the generic radix tree uses raw pages which aren't tracked by kmemleak. Fix this by adding the kmemleak hooks to the generic radix tree code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Eric Biggers <[email protected]> Reported-by: <[email protected]> Reviewed-by: Marcelo Ricardo Leitner <[email protected]> Acked-by: Neil Horman <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Vlad Yasevich <[email protected]> Cc: Xin Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14mm/slub: fix a deadlock in show_slab_objects()Qian Cai1-2/+11
A long time ago we fixed a similar deadlock in show_slab_objects() [1]. However, it is apparently due to the commits like 01fb58bcba63 ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path") and 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by just reading files in /sys/kernel/slab which will generate a lockdep splat below. Since the "mem_hotplug_lock" here is only to obtain a stable online node mask while racing with NUMA node hotplug, in the worst case, the results may me miscalculated while doing NUMA node hotplug, but they shall be corrected by later reads of the same files. WARNING: possible circular locking dependency detected ------------------------------------------------------ cat/5224 is trying to acquire lock: ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at: show_slab_objects+0x94/0x3a8 but task is already holding lock: b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (kn->count#45){++++}: lock_acquire+0x31c/0x360 __kernfs_remove+0x290/0x490 kernfs_remove+0x30/0x44 sysfs_remove_dir+0x70/0x88 kobject_del+0x50/0xb0 sysfs_slab_unlink+0x2c/0x38 shutdown_cache+0xa0/0xf0 kmemcg_cache_shutdown_fn+0x1c/0x34 kmemcg_workfn+0x44/0x64 process_one_work+0x4f4/0x950 worker_thread+0x390/0x4bc kthread+0x1cc/0x1e8 ret_from_fork+0x10/0x18 -> #1 (slab_mutex){+.+.}: lock_acquire+0x31c/0x360 __mutex_lock_common+0x16c/0xf78 mutex_lock_nested+0x40/0x50 memcg_create_kmem_cache+0x38/0x16c memcg_kmem_cache_create_func+0x3c/0x70 process_one_work+0x4f4/0x950 worker_thread+0x390/0x4bc kthread+0x1cc/0x1e8 ret_from_fork+0x10/0x18 -> #0 (mem_hotplug_lock.rw_sem){++++}: validate_chain+0xd10/0x2bcc __lock_acquire+0x7f4/0xb8c lock_acquire+0x31c/0x360 get_online_mems+0x54/0x150 show_slab_objects+0x94/0x3a8 total_objects_show+0x28/0x34 slab_attr_show+0x38/0x54 sysfs_kf_seq_show+0x198/0x2d4 kernfs_seq_show+0xa4/0xcc seq_read+0x30c/0x8a8 kernfs_fop_read+0xa8/0x314 __vfs_read+0x88/0x20c vfs_read+0xd8/0x10c ksys_read+0xb0/0x120 __arm64_sys_read+0x54/0x88 el0_svc_handler+0x170/0x240 el0_svc+0x8/0xc other info that might help us debug this: Chain exists of: mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(kn->count#45); lock(slab_mutex); lock(kn->count#45); lock(mem_hotplug_lock.rw_sem); *** DEADLOCK *** 3 locks held by cat/5224: #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8 #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0 #2: b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0 stack backtrace: Call trace: dump_backtrace+0x0/0x248 show_stack+0x20/0x2c dump_stack+0xd0/0x140 print_circular_bug+0x368/0x380 check_noncircular+0x248/0x250 validate_chain+0xd10/0x2bcc __lock_acquire+0x7f4/0xb8c lock_acquire+0x31c/0x360 get_online_mems+0x54/0x150 show_slab_objects+0x94/0x3a8 total_objects_show+0x28/0x34 slab_attr_show+0x38/0x54 sysfs_kf_seq_show+0x198/0x2d4 kernfs_seq_show+0xa4/0xcc seq_read+0x30c/0x8a8 kernfs_fop_read+0xa8/0x314 __vfs_read+0x88/0x20c vfs_read+0xd8/0x10c ksys_read+0xb0/0x120 __arm64_sys_read+0x54/0x88 el0_svc_handler+0x170/0x240 el0_svc+0x8/0xc I think it is important to mention that this doesn't expose the show_slab_objects to use-after-free. There is only a single path that might really race here and that is the slab hotplug notifier callback __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path doesn't really destroy kmem_cache_node data structures. [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html [[email protected]: add comment explaining why we don't need mem_hotplug_lock] Link: http://lkml.kernel.org/r/[email protected] Fixes: 01fb58bcba63 ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path") Fixes: 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}") Signed-off-by: Qian Cai <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14mm, page_owner: rename flag indicating that page is allocatedVlastimil Babka2-7/+7
Commit 37389167a281 ("mm, page_owner: keep owner info when freeing the page") has introduced a flag PAGE_EXT_OWNER_ACTIVE to indicate that page is tracked as being allocated. Kirril suggested naming it PAGE_EXT_OWNER_ALLOCATED to make it more clear, as "active is somewhat loaded term for a page". Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Suggested-by: Kirill A. Shutemov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Walter Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14mm, page_owner: decouple freeing stack trace from debug_pageallocVlastimil Babka2-21/+10
Commit 8974558f49a6 ("mm, page_owner, debug_pagealloc: save and dump freeing stack trace") enhanced page_owner to also store freeing stack trace, when debug_pagealloc is also enabled. KASAN would also like to do this [1] to improve error reports to debug e.g. UAF issues. Kirill has suggested that the freeing stack trace saving should be also possible to be enabled separately from KASAN or debug_pagealloc, i.e. with an extra boot option. Qian argued that we have enough options already, and avoiding the extra overhead is not worth the complications in the case of a debugging option. Kirill noted that the extra stack handle in struct page_owner requires 0.1% of memory. This patch therefore enables free stack saving whenever page_owner is enabled, regardless of whether debug_pagealloc or KASAN is also enabled. KASAN kernels booted with page_owner=on will thus benefit from the improved error reports. [1] https://bugzilla.kernel.org/show_bug.cgi?id=203967 [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Qian Cai <[email protected]> Suggested-by: Dmitry Vyukov <[email protected]> Suggested-by: Walter Wu <[email protected]> Suggested-by: Andrey Ryabinin <[email protected]> Suggested-by: Kirill A. Shutemov <[email protected]> Suggested-by: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14mm, page_owner: fix off-by-one error in __set_page_owner_handle()Vlastimil Babka3-22/+24
Patch series "followups to debug_pagealloc improvements through page_owner", v3. These are followups to [1] which made it to Linus meanwhile. Patches 1 and 3 are based on Kirill's review, patch 2 on KASAN request [2]. It would be nice if all of this made it to 5.4 with [1] already there (or at least Patch 1). This patch (of 3): As noted by Kirill, commit 7e2f2a0cd17c ("mm, page_owner: record page owner for each subpage") has introduced an off-by-one error in __set_page_owner_handle() when looking up page_ext for subpages. As a result, the head page page_owner info is set twice, while for the last tail page, it's not set at all. Fix this and also make the code more efficient by advancing the page_ext pointer we already have, instead of calling lookup_page_ext() for each subpage. Since the full size of struct page_ext is not known at compile time, we can't use a simple page_ext++ statement, so introduce a page_ext_next() inline function for that. Link: http://lkml.kernel.org/r/[email protected] Fixes: 7e2f2a0cd17c ("mm, page_owner: record page owner for each subpage") Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Kirill A. Shutemov <[email protected]> Reported-by: Miles Chen <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Walter Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14xtensa: fix type conversion in __get_user_[no]checkMax Filippov1-26/+29
__get_user_[no]check uses temporary buffer of type long to store result of __get_user_size and do sign extension on it when necessary. This doesn't work correctly for 64-bit data. Fix it by moving temporary buffer/sign extension logic to __get_user_asm. Don't do assignment of __get_user_bad result to (x) as it may not always be integer-compatible now and issue warning even when it's going to be optimized. Instead do (x) = 0; and call __get_user_bad separately. Zero initialize __x in __get_user_asm and use '+' constraint for its assembly argument, so that its value is preserved in error cases. This may add at most 1 cycle to the fast path, but saves an instruction and two padding bytes in the fixup section for each use of this macro and works for both misaligned store and store exception. Signed-off-by: Max Filippov <[email protected]>
2019-10-14xtensa: clean up assembly arguments in uaccess macrosMax Filippov1-21/+21
Numeric assembly arguments are hard to understand and assembly code that uses them is hard to modify. Use named arguments in __check_align_*, __get_user_asm and __put_user_asm. Modify macro parameter names so that they don't affect argument names. Use '+' constraint for the [err] argument instead of having it as both input and output. Signed-off-by: Max Filippov <[email protected]>
2019-10-14block: Fix elv_support_iosched()Damien Le Moal1-1/+2
A BIO based request queue does not have a tag_set, which prevent testing for the flag BLK_MQ_F_NO_SCHED indicating that the queue does not require an elevator. This leads to an incorrect initialization of a default elevator in some cases such as BIO based null_blk (queue_mode == BIO) with zoned mode enabled as the default elevator in this case is mq-deadline instead of "none". Fix this by testing for a NULL queue mq_ops field which indicates that the queue is BIO based and should not have an elevator. Reported-by: Shinichiro Kawasaki <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-14parisc: Remove 32-bit DMA enforcement from sba_iommuSven Schnelle1-8/+0
This breaks booting from sata_sil24 with the recent DMA change. According to James Bottomley this was in to improve performance by kicking the device into 32 bit descriptors, which are usually more efficient, especially with older dual descriptor format cards like we have on parisc systems. Remove it for now to make DMA working again. Fixes: dcc02c19cc06 ("sata_sil24: use dma_set_mask_and_coherent") Signed-off-by: Sven Schnelle <[email protected]> Signed-off-by: Helge Deller <[email protected]>
2019-10-14parisc: Fix vmap memory leak in ioremap()/iounmap()Helge Deller1-5/+7
Sven noticed that calling ioremap() and iounmap() multiple times leads to a vmap memory leak: vmap allocation for size 4198400 failed: use vmalloc=<size> to increase size It seems we missed calling vunmap() in iounmap(). Signed-off-by: Helge Deller <[email protected]> Noticed-by: Sven Schnelle <[email protected]> Cc: <[email protected]> # v3.16+
2019-10-14parisc: prefer __section from compiler_attributes.hNick Desaulniers2-2/+2
Reported-by: Sedat Dilek <[email protected]> Suggested-by: Josh Poimboeuf <[email protected]> Signed-off-by: Nick Desaulniers <[email protected]> Signed-off-by: Helge Deller <[email protected]>
2019-10-14parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ defineHelge Deller1-2/+2
Signed-off-by: Helge Deller <[email protected]>
2019-10-14firmware: dmi: Fix unlikely out-of-bounds read in save_mem_devicesJean Delvare1-1/+1
Before reading the Extended Size field, we should ensure it fits in the DMI record. There is already a record length check but it does not cover that field. It would take a seriously corrupted DMI table to hit that bug, so no need to worry, but we should still fix it. Signed-off-by: Jean Delvare <[email protected]> Fixes: 6deae96b42eb ("firmware, DMI: Add function to look up a handle and return DIMM size") Cc: Tony Luck <[email protected]> Cc: Borislav Petkov <[email protected]>
2019-10-14riscv: tlbflush: remove confusing comment on local_flush_tlb_all()Paul Walmsley1-4/+0
Remove a confusing comment on our local_flush_tlb_all() implementation. Per an internal discussion with Andrew, while it's true that the fence.i is not necessary, it's not the case that an sfence.vma implies a fence.i. We also drop the section about "flush[ing] the entire local TLB" to better align with the language in section 4.2.1 "Supervisor Memory-Management Fence Instruction" of the RISC-V Privileged Specification v20190608. Fixes: c901e45a999a1 ("RISC-V: `sfence.vma` orderes the instruction cache") Reported-by: Alan Kao <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Andrew Waterman <[email protected]> Signed-off-by: Paul Walmsley <[email protected]>
2019-10-14riscv: dts: HiFive Unleashed: add default chosen/stdout-pathPaul Walmsley1-0/+1
Add a default "stdout-path" to the kernel DTS file, as is present in many of the board DTS files elsewhere in the kernel tree. With this line present, earlyconsole can be enabled by simply passing "earlycon" on the kernel command line. No specific device details are necessary, since the kernel will use the stdout-path as the default. Signed-off-by: Paul Walmsley <[email protected]> Reviewed-by: Atish Patra <[email protected]>
2019-10-14riscv: remove the switch statement in do_trap_break()Vincent Chen1-11/+11
To make the code more straightforward, replace the switch statement with an if statement. Suggested-by: Paul Walmsley <[email protected]> Signed-off-by: Vincent Chen <[email protected]> [[email protected]: cleaned up patch description; updated to apply] Link: https://lore.kernel.org/linux-riscv/[email protected]/ Link: https://lore.kernel.org/linux-riscv/CABvJ_xiHJSB7P5QekuLRP=LBPzXXghAfuUpPUYb=a_HbnOQ6BA@mail.gmail.com/ Link: https://lists.01.org/hyperkitty/list/[email protected]/thread/VDCU2WOB6KQISREO4V5DTXEI2M7VOV55/ Cc: Christoph Hellwig <[email protected]> Signed-off-by: Paul Walmsley <[email protected]>
2019-10-14drm/panfrost: Add missing GPU feature registersSteven Price1-0/+3
Three feature registers were declared but never actually read from the GPU. Add THREAD_MAX_THREADS, THREAD_MAX_WORKGROUP_SIZE and THREAD_MAX_BARRIER_SIZE so that the complete set are available. Fixes: 4bced8bea094 ("drm/panfrost: Export all GPU feature registers") Signed-off-by: Steven Price <[email protected]> Signed-off-by: Rob Herring <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2019-10-14xtensa: fix {get,put}_user() for 64bit valuesAl Viro1-2/+11
First of all, on short copies __copy_{to,from}_user() return the amount of bytes left uncopied, *not* -EFAULT. get_user() and put_user() are expected to return -EFAULT on failure. Another problem is get_user(v32, (__u64 __user *)p); that should fetch 64bit value and the assign it to v32, truncating it in process. Current code, OTOH, reads 8 bytes of data and stores them at the address of v32, stomping on the 4 bytes that follow v32 itself. Signed-off-by: Al Viro <[email protected]> Signed-off-by: Max Filippov <[email protected]>
2019-10-14Merge tag 'irqchip-fixes-5.4-1' of ↵Thomas Gleixner5-17/+43
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent Pull irqchip fixes from Marc Zyngier: - Add retrigger support to Amazon's al-fic driver - Add SAM9X60 support to Atmel's AIC5 irqchip - Fix GICv3 maximum interrupt calculation - Convert SiFive's PLIC to the fasteoi IRQ flow
2019-10-14kmemleak: Do not corrupt the object_list during clean-upCatalin Marinas1-9/+21
In case of an error (e.g. memory pool too small), kmemleak disables itself and cleans up the already allocated metadata objects. However, if this happens early before the RCU callback mechanism is available, put_object() skips call_rcu() and frees the object directly. This is not safe with the RCU list traversal in __kmemleak_do_cleanup(). Change the list traversal in __kmemleak_do_cleanup() to list_for_each_entry_safe() and remove the rcu_read_{lock,unlock} since the kmemleak is already disabled at this point. In addition, avoid an unnecessary metadata object rb-tree look-up since it already has the struct kmemleak_object pointer. Fixes: c5665868183f ("mm: kmemleak: use the memory pool for early allocations") Reported-by: Alexey Kardashevskiy <[email protected]> Reported-by: Marc Dionne <[email protected]> Reported-by: Ted Ts'o <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Catalin Marinas <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-14nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLLSebastian Andrzej Siewior1-0/+2
The access to sk->sk_ll_usec should be hidden behind CONFIG_NET_RX_BUSY_POLL like the definition of sk_ll_usec. Put access to ->sk_ll_usec behind CONFIG_NET_RX_BUSY_POLL. Fixes: 1a9460cef5711 ("nvme-tcp: support simple polling") Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2019-10-14nvme: Wait for reset state when requiredKeith Busch3-17/+74
Prevent simultaneous controller disabling/enabling tasks from interfering with each other through a function to wait until the task successfully transitioned the controller to the RESETTING state. This ensures disabling the controller will not be interrupted by another reset path, otherwise a concurrent reset may leave the controller in the wrong state. Tested-by: Edmund Nadolski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2019-10-14nvme: Prevent resets during paused controller stateKeith Busch1-4/+25
A paused controller is doing critical internal activation work in the background. Prevent subsequent controller resets from occurring during this period by setting the controller state to RESETTING first. A helper function, nvme_try_sched_reset_work(), is introduced for these paths so they may continue with scheduling the reset_work after they've completed their uninterruptible critical section. Tested-by: Edmund Nadolski <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2019-10-14nvme: Restart request timers in resetting stateKeith Busch2-0/+16
A controller in the resetting state has not yet completed its recovery actions. The pci and fc transports were already handling this, so update the remaining transports to not attempt additional recovery in this state. Instead, just restart the request timer. Tested-by: Edmund Nadolski <[email protected]> Reviewed-by: James Smart <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2019-10-14nvme: Remove ADMIN_ONLY stateKeith Busch4-37/+11
The admin only state was intended to fence off actions that don't apply to a non-IO capable controller. The only actual user of this is the scan_work, and pci was the only transport to ever set this state. The consequence of having this state is placing an additional burden on every other action that applies to both live and admin only controllers. Remove the admin only state and place the admin only burden on the only place that actually cares: scan_work. This also prepares to make it easier to temporarily pause a LIVE state so that we don't need to remember which state the controller had been in prior to the pause. Tested-by: Edmund Nadolski <[email protected]> Reviewed-by: James Smart <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2019-10-14nvme-pci: Free tagset if no IO queuesKeith Busch1-2/+9
If a controller becomes degraded after a reset, we will not be able to perform any IO. We currently teardown previously created request queues and namespaces, but we had kept the unusable tagset. Free it after all queues using it have been released. Tested-by: Edmund Nadolski <[email protected]> Reviewed-by: James Smart <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2019-10-14hrtimer: Annotate lockless access to timer->baseEric Dumazet1-4/+4
Followup to commit dd2261ed45aa ("hrtimer: Protect lockless access to timer->base") lock_hrtimer_base() fetches timer->base without lock exclusion. Compiler is allowed to read timer->base twice (even if considered dumb) which could end up trying to lock migration_base and return &migration_base. base = timer->base; if (likely(base != &migration_base)) { /* compiler reads timer->base again, and now (base == &migration_base) raw_spin_lock_irqsave(&base->cpu_base->lock, *flags); if (likely(base == timer->base)) return base; /* == &migration_base ! */ Similarly the write sides must use WRITE_ONCE() to avoid store tearing. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-10-14platform/x86: i2c-multi-instantiate: Fail the probe if no IRQ providedAndy Shevchenko1-0/+1
For APIC case of interrupt we don't fail a ->probe() of the driver, which makes kernel to print a lot of warnings from the children. We have two options here: - switch to platform_get_irq_optional(), though it won't stop children to be probed and failed - fail the ->probe() of i2c-multi-instantiate Since the in reality we never had devices in the wild where IRQ resource is optional, the latter solution suits the best. Fixes: 799d3379a672 ("platform/x86: i2c-multi-instantiate: Introduce IOAPIC IRQ support") Reported-by: Ammy Yi <[email protected]> Cc: Heikki Krogerus <[email protected]> Cc: Hans de Goede <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Heikki Krogerus <[email protected]>
2019-10-14drm/ttm: fix handling in ttm_bo_add_mem_to_lruChristian König1-2/+3
We should not add the BO to the swap LRU when the new mem is fixed and the TTM object about to be destroyed. Signed-off-by: Christian König <[email protected]> Reviewed-by: Kevin Wang <[email protected]> Link: https://patchwork.freedesktop.org/patch/335246/
2019-10-14drm/ttm: Restore ttm prefaultingThomas Hellstrom1-9/+7
Commit 4daa4fba3a38 ("gpu: drm: ttm: Adding new return type vm_fault_t") broke TTM prefaulting. Since vmf_insert_mixed() typically always returns VM_FAULT_NOPAGE, prefaulting stops after the second PTE. Restore (almost) the original behaviour. Unfortunately we can no longer with the new vm_fault_t return type determine whether a prefaulting PTE insertion hit an already populated PTE, and terminate the insertion loop. Instead we continue with the pre-determined number of prefaults. Fixes: 4daa4fba3a38 ("gpu: drm: ttm: Adding new return type vm_fault_t") Cc: Souptick Joarder <[email protected]> Cc: Christian König <[email protected]> Signed-off-by: Thomas Hellstrom <[email protected]> Reviewed-by: Christian König <[email protected]> Cc: [email protected] # v4.19+ Signed-off-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/330387/
2019-10-14drm/ttm: fix busy reference in ttm_mem_evict_firstChristian König1-2/+2
The busy BO might actually be already deleted, so grab only a list reference. Signed-off-by: Christian König <[email protected]> Reviewed-by: Thomas Hellström <[email protected]> Link: https://patchwork.freedesktop.org/patch/332877/
2019-10-14ath10k: fix latency issue for QCA988xMiaoqing Pan1-6/+9
(kvalo: cherry picked from commit 1340cc631bd00431e2f174525c971f119df9efa1 in wireless-drivers-next to wireless-drivers as this a frequently reported regression) Bad latency is found on QCA988x, the issue was introduced by commit 4504f0e5b571 ("ath10k: sdio: workaround firmware UART pin configuration bug"). If uart_pin_workaround is false, this change will set uart pin even if uart_print is false. Tested HW: QCA9880 Tested FW: 10.2.4-1.0-00037 Fixes: 4504f0e5b571 ("ath10k: sdio: workaround firmware UART pin configuration bug") Signed-off-by: Miaoqing Pan <[email protected]> Signed-off-by: Kalle Valo <[email protected]>
2019-10-14cpuidle: teo: Fix "early hits" handling for disabled idle statesRafael J. Wysocki1-9/+26
The TEO governor uses idle duration "bins" defined in accordance with the CPU idle states table provided by the driver, so that each "bin" covers the idle duration range between the target residency of the idle state corresponding to it and the target residency of the closest deeper idle state. The governor collects statistics for each bin regardless of whether or not the idle state corresponding to it is currently enabled. In particular, the "early hits" metric measures the likelihood of a situation in which the idle duration measured after wakeup falls into to given bin, but the time till the next timer (sleep length) falls into a bin corresponding to one of the deeper idle states. It is used when the "hits" and "misses" metrics indicate that the state "matching" the sleep length should not be selected, so that the state with the maximum "early hits" value is selected instead of it. If the idle state corresponding to the given bin is disabled, it cannot be selected and if it turns out to be the one that should be selected, a shallower idle state needs to be used instead of it. Nevertheless, the metrics collected for the bin corresponding to it are still valid and need to be taken into account as though that state had not been disabled. As far as the "early hits" metric is concerned, teo_select() tries to take disabled states into account, but the state index corresponding to the maximum "early hits" value computed by it may be incorrect. Namely, it always uses the index of the previous maximum "early hits" state then, but there may be enabled idle states closer to the disabled one in question. In particular, if the current candidate state (whose index is the idx value) is closer to the disabled one and the "early hits" value of the disabled state is greater than the current maximum, the index of the current candidate state (idx) should replace the "maximum early hits state" index. Modify the code to handle that case correctly. Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Reported-by: Doug Smythies <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Cc: 5.1+ <[email protected]> # 5.1+
2019-10-14cpuidle: teo: Consider hits and misses metrics of disabled statesRafael J. Wysocki1-4/+21
The TEO governor uses idle duration "bins" defined in accordance with the CPU idle states table provided by the driver, so that each "bin" covers the idle duration range between the target residency of the idle state corresponding to it and the target residency of the closest deeper idle state. The governor collects statistics for each bin regardless of whether or not the idle state corresponding to it is currently enabled. In particular, the "hits" and "misses" metrics measure the likelihood of a situation in which both the time till the next timer (sleep length) and the idle duration measured after wakeup fall into the given bin. Namely, if the "hits" value is greater than the "misses" one, that situation is more likely than the one in which the sleep length falls into the given bin, but the idle duration measured after wakeup falls into a bin corresponding to one of the shallower idle states. If the idle state corresponding to the given bin is disabled, it cannot be selected and if it turns out to be the one that should be selected, a shallower idle state needs to be used instead of it. Nevertheless, the metrics collected for the bin corresponding to it are still valid and need to be taken into account as though that state had not been disabled. For this reason, make teo_select() always use the "hits" and "misses" values of the idle duration range that the sleep length falls into even if the specific idle state corresponding to it is disabled and if the "hits" values is greater than the "misses" one, select the closest enabled shallower idle state in that case. Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Signed-off-by: Rafael J. Wysocki <[email protected]> Cc: 5.1+ <[email protected]> # 5.1+
2019-10-14cpuidle: teo: Rename local variable in teo_select()Rafael J. Wysocki1-10/+9
Rename a local variable in teo_select() in preparation for subsequent code modifications, no intentional impact. Signed-off-by: Rafael J. Wysocki <[email protected]> Cc: 5.1+ <[email protected]> # 5.1+
2019-10-14cpuidle: teo: Ignore disabled idle states that are too deepRafael J. Wysocki1-0/+7
Prevent disabled CPU idle state with target residencies beyond the anticipated idle duration from being taken into account by the TEO governor. Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Signed-off-by: Rafael J. Wysocki <[email protected]> Cc: 5.1+ <[email protected]> # 5.1+
2019-10-13Linux 5.4-rc3Linus Torvalds1-1/+1
2019-10-13Merge tag 'trace-v5.4-rc2' of ↵Linus Torvalds15-132/+223
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "A few tracing fixes: - Remove lockdown from tracefs itself and moved it to the trace directory. Have the open functions there do the lockdown checks. - Fix a few races with opening an instance file and the instance being deleted (Discovered during the lockdown updates). Kept separate from the clean up code such that they can be backported to stable easier. - Clean up and consolidated the checks done when opening a trace file, as there were multiple checks that need to be done, and it did not make sense having them done in each open instance. - Fix a regression in the record mcount code. - Small hw_lat detector tracer fixes. - A trace_pipe read fix due to not initializing trace_seq" * tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Initialize iter->seq after zeroing in tracing_read_pipe() tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency tracing/hwlat: Report total time spent in all NMIs during the sample recordmcount: Fix nop_mcount() function tracing: Do not create tracefs files if tracefs lockdown is in effect tracing: Add locked_down checks to the open calls of files created for tracefs tracing: Add tracing_check_open_get_tr() tracing: Have trace events system open call tracing_open_generic_tr() tracing: Get trace_array reference for available_tracers files ftrace: Get a reference counter for the trace_array on filter files tracefs: Revert ccbd54ff54e8 ("tracefs: Restrict tracefs when the kernel is locked down")
2019-10-13netdevsim: Fix error handling in nsim_fib_init and nsim_fib_exitYueHaibing1-1/+2
In nsim_fib_init(), if register_fib_notifier failed, nsim_fib_net_ops should be unregistered before return. In nsim_fib_exit(), unregister_fib_notifier should be called before nsim_fib_net_ops be unregistered, otherwise may cause use-after-free: BUG: KASAN: use-after-free in nsim_fib_event_nb+0x342/0x570 [netdevsim] Read of size 8 at addr ffff8881daaf4388 by task kworker/0:3/3499 CPU: 0 PID: 3499 Comm: kworker/0:3 Not tainted 5.3.0-rc7+ #30 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Workqueue: ipv6_addrconf addrconf_dad_work [ipv6] Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xa9/0x10e lib/dump_stack.c:113 print_address_description+0x65/0x380 mm/kasan/report.c:351 __kasan_report+0x149/0x18d mm/kasan/report.c:482 kasan_report+0xe/0x20 mm/kasan/common.c:618 nsim_fib_event_nb+0x342/0x570 [netdevsim] notifier_call_chain+0x52/0xf0 kernel/notifier.c:95 __atomic_notifier_call_chain+0x78/0x140 kernel/notifier.c:185 call_fib_notifiers+0x30/0x60 net/core/fib_notifier.c:30 call_fib6_entry_notifiers+0xc1/0x100 [ipv6] fib6_add+0x92e/0x1b10 [ipv6] __ip6_ins_rt+0x40/0x60 [ipv6] ip6_ins_rt+0x84/0xb0 [ipv6] __ipv6_ifa_notify+0x4b6/0x550 [ipv6] ipv6_ifa_notify+0xa5/0x180 [ipv6] addrconf_dad_completed+0xca/0x640 [ipv6] addrconf_dad_work+0x296/0x960 [ipv6] process_one_work+0x5c0/0xc00 kernel/workqueue.c:2269 worker_thread+0x5c/0x670 kernel/workqueue.c:2415 kthread+0x1d7/0x200 kernel/kthread.c:255 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 Allocated by task 3388: save_stack+0x19/0x80 mm/kasan/common.c:69 set_track mm/kasan/common.c:77 [inline] __kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:493 kmalloc include/linux/slab.h:557 [inline] kzalloc include/linux/slab.h:748 [inline] ops_init+0xa9/0x220 net/core/net_namespace.c:127 __register_pernet_operations net/core/net_namespace.c:1135 [inline] register_pernet_operations+0x1d4/0x420 net/core/net_namespace.c:1212 register_pernet_subsys+0x24/0x40 net/core/net_namespace.c:1253 nsim_fib_init+0x12/0x70 [netdevsim] veth_get_link_ksettings+0x2b/0x50 [veth] do_one_initcall+0xd4/0x454 init/main.c:939 do_init_module+0xe0/0x330 kernel/module.c:3490 load_module+0x3c2f/0x4620 kernel/module.c:3841 __do_sys_finit_module+0x163/0x190 kernel/module.c:3931 do_syscall_64+0x72/0x2e0 arch/x86/entry/common.c:296 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 3534: save_stack+0x19/0x80 mm/kasan/common.c:69 set_track mm/kasan/common.c:77 [inline] __kasan_slab_free+0x130/0x180 mm/kasan/common.c:455 slab_free_hook mm/slub.c:1423 [inline] slab_free_freelist_hook mm/slub.c:1474 [inline] slab_free mm/slub.c:3016 [inline] kfree+0xe9/0x2d0 mm/slub.c:3957 ops_free net/core/net_namespace.c:151 [inline] ops_free_list.part.7+0x156/0x220 net/core/net_namespace.c:184 ops_free_list net/core/net_namespace.c:182 [inline] __unregister_pernet_operations net/core/net_namespace.c:1165 [inline] unregister_pernet_operations+0x221/0x2a0 net/core/net_namespace.c:1224 unregister_pernet_subsys+0x1d/0x30 net/core/net_namespace.c:1271 nsim_fib_exit+0x11/0x20 [netdevsim] nsim_module_exit+0x16/0x21 [netdevsim] __do_sys_delete_module kernel/module.c:1015 [inline] __se_sys_delete_module kernel/module.c:958 [inline] __x64_sys_delete_module+0x244/0x330 kernel/module.c:958 do_syscall_64+0x72/0x2e0 arch/x86/entry/common.c:296 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reported-by: Hulk Robot <[email protected]> Fixes: 59c84b9fcf42 ("netdevsim: Restore per-network namespace accounting for fib entries") Signed-off-by: YueHaibing <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>