aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-09-09parisc: Add missing FORCE prerequisite in MakefileHelge Deller1-9/+9
Signed-off-by: Helge Deller <[email protected]>
2021-09-08Merge branches 'akpm' and 'akpm-hotfixes' (patches from Andrew)Linus Torvalds47-663/+267
Merge yet more updates and hotfixes from Andrew Morton: "Post-linux-next material, based upon latest upstream to catch the now-merged dependencies: - 10 patches. Subsystems affected by this patch series: mm (vmstat and migration) and compat. And bunch of hotfixes, mostly cc:stable: - 8 patches. Subsystems affected by this patch series: mm (hmm, hugetlb, vmscan, pagealloc, pagemap, kmemleak, mempolicy, and memblock)" * emailed patches from Andrew Morton <[email protected]>: arch: remove compat_alloc_user_space compat: remove some compat entry points mm: simplify compat numa syscalls mm: simplify compat_sys_move_pages kexec: avoid compat_alloc_user_space kexec: move locking into do_kexec_load mm: migrate: change to use bool type for 'page_was_mapped' mm: migrate: fix the incorrect function name in comments mm: migrate: introduce a local variable to get the number of pages mm/vmstat: protect per cpu variables with preempt disable on RT * emailed hotfixes from Andrew Morton <[email protected]>: nds32/setup: remove unused memblock_region variable in setup_memory() mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task mm/kmemleak: allow __GFP_NOLOCKDEP passed to kmemleak's gfp mmap_lock: change trace and locking order mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype mm,vmscan: fix divide by zero in get_scan_count mm/hugetlb: initialize hugetlb_usage in mm_init mm/hmm: bypass devmap pte when all pfn requested flags are fulfilled
2021-09-08nds32/setup: remove unused memblock_region variable in setup_memory()Mike Rapoport1-1/+0
kernel test robot reports unused variable warning: arch/nds32/kernel/setup.c:247:26: warning: Unused variable: region [unusedVariable] struct memblock_region *region; ^ Remove the unused variable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Nick Hu <[email protected]> Cc: Vincent Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/mempolicy: fix a race between offset_il_node and mpol_rebind_taskyanghui1-4/+13
Servers happened below panic: Kernel version:5.4.56 BUG: unable to handle page fault for address: 0000000000002c48 RIP: 0010:__next_zones_zonelist+0x1d/0x40 Call Trace: __alloc_pages_nodemask+0x277/0x310 alloc_page_interleave+0x13/0x70 handle_mm_fault+0xf99/0x1390 __do_page_fault+0x288/0x500 do_page_fault+0x30/0x110 page_fault+0x3e/0x50 The reason for the panic is that MAX_NUMNODES is passed in the third parameter in __alloc_pages_nodemask(preferred_nid). So access to zonelist->zoneref->zone_idx in __next_zones_zonelist will cause a panic. In offset_il_node(), first_node() returns nid from pol->v.nodes, after this other threads may chang pol->v.nodes before next_node(). This race condition will let next_node return MAX_NUMNODES. So put pol->nodes in a local variable. The race condition is between offset_il_node and cpuset_change_task_nodemask: CPU0: CPU1: alloc_pages_vma() interleave_nid(pol,) offset_il_node(pol,) first_node(pol->v.nodes) cpuset_change_task_nodemask //nodes==0xc mpol_rebind_task mpol_rebind_policy mpol_rebind_nodemask(pol,nodes) //nodes==0x3 next_node(nid, pol->v.nodes)//return MAX_NUMNODES Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: yanghui <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/kmemleak: allow __GFP_NOLOCKDEP passed to kmemleak's gfpNaohiro Aota1-1/+2
In a memory pressure situation, I'm seeing the lockdep WARNING below. Actually, this is similar to a known false positive which is already addressed by commit 6dcde60efd94 ("xfs: more lockdep whackamole with kmem_alloc*"). This warning still persists because it's not from kmalloc() itself but from an allocation for kmemleak object. While kmalloc() itself suppress the warning with __GFP_NOLOCKDEP, gfp_kmemleak_mask() is dropping the flag for the kmemleak's allocation. Allow __GFP_NOLOCKDEP to be passed to kmemleak's allocation, so that the warning for it is also suppressed. ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc7-BTRFS-ZNS+ #37 Not tainted ------------------------------------------------------ kswapd0/288 is trying to acquire lock: ffff88825ab45df0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0x8a/0x250 but task is already holding lock: ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x112/0x160 kmem_cache_alloc+0x48/0x400 create_object.isra.0+0x42/0xb10 kmemleak_alloc+0x48/0x80 __kmalloc+0x228/0x440 kmem_alloc+0xd3/0x2b0 kmem_alloc_large+0x5a/0x1c0 xfs_attr_copy_value+0x112/0x190 xfs_attr_shortform_getvalue+0x1fc/0x300 xfs_attr_get_ilocked+0x125/0x170 xfs_attr_get+0x329/0x450 xfs_get_acl+0x18d/0x430 get_acl.part.0+0xb6/0x1e0 posix_acl_xattr_get+0x13a/0x230 vfs_getxattr+0x21d/0x270 getxattr+0x126/0x310 __x64_sys_fgetxattr+0x1a6/0x2a0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}: __lock_acquire+0x2c0f/0x5a00 lock_acquire+0x1a1/0x4b0 down_read_nested+0x50/0x90 xfs_ilock+0x8a/0x250 xfs_can_free_eofblocks+0x34f/0x570 xfs_inactive+0x411/0x520 xfs_fs_destroy_inode+0x2c8/0x710 destroy_inode+0xc5/0x1a0 evict+0x444/0x620 dispose_list+0xfe/0x1c0 prune_icache_sb+0xdc/0x160 super_cache_scan+0x31e/0x510 do_shrink_slab+0x337/0x8e0 shrink_slab+0x362/0x5c0 shrink_node+0x7a7/0x1a40 balance_pgdat+0x64e/0xfe0 kswapd+0x590/0xa80 kthread+0x38c/0x460 ret_from_fork+0x22/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&xfs_nondir_ilock_class); lock(fs_reclaim); lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/288: #0: ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff848a08d8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x269/0x5c0 #2: ffff8881a7a820e8 (&type->s_umount_key#60){++++}-{3:3}, at: super_cache_scan+0x5a/0x510 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Naohiro Aota <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: "Darrick J . Wong" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mmap_lock: change trace and locking orderLiam Howlett1-4/+4
Print to the trace log before releasing the lock to avoid racing with other trace log printers of the same lock type. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Suggested-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/page_alloc.c: avoid accessing uninitialized pcp page migratetypeMiaohe Lin1-1/+3
If it's not prepared to free unref page, the pcp page migratetype is unset. Thus we will get rubbish from get_pcppage_migratetype() and might list_del(&page->lru) again after it's already deleted from the list leading to grumble about data corruption. Link: https://lkml.kernel.org/r/[email protected] Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock") Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm,vmscan: fix divide by zero in get_scan_countRik van Riel1-1/+1
Commit f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") introduced a divide by zero corner case when oomd is being used in combination with cgroup memory.low protection. When oomd decides to kill a cgroup, it will force the cgroup memory to be reclaimed after killing the tasks, by writing to the memory.max file for that cgroup, forcing the remaining page cache and reclaimable slab to be reclaimed down to zero. Previously, on cgroups with some memory.low protection that would result in the memory being reclaimed down to the memory.low limit, or likely not at all, having the page cache reclaimed asynchronously later. With f56ce412a59d the oomd write to memory.max tries to reclaim all the way down to zero, which may race with another reclaimer, to the point of ending up with the divide by zero below. This patch implements the obvious fix. Link: https://lkml.kernel.org/r/[email protected] Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim") Signed-off-by: Rik van Riel <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Chris Down <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/hugetlb: initialize hugetlb_usage in mm_initLiu Zixian2-0/+10
After fork, the child process will get incorrect (2x) hugetlb_usage. If a process uses 5 2MB hugetlb pages in an anonymous mapping, HugetlbPages: 10240 kB and then forks, the child will show, HugetlbPages: 20480 kB The reason for double the amount is because hugetlb_usage will be copied from the parent and then increased when we copy page tables from parent to child. Child will have 2x actual usage. Fix this by adding hugetlb_count_init in mm_init. Link: https://lkml.kernel.org/r/[email protected] Fixes: 5d317b2b6536 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status") Signed-off-by: Liu Zixian <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/hmm: bypass devmap pte when all pfn requested flags are fulfilledLi Zhijian1-1/+4
Previously, we noticed the one rpma example was failed[1] since commit 36f30e486dce ("IB/core: Improve ODP to use hmm_range_fault()"), where it will use ODP feature to do RDMA WRITE between fsdax files. After digging into the code, we found hmm_vma_handle_pte() will still return EFAULT even though all the its requesting flags has been fulfilled. That's because a DAX page will be marked as (_PAGE_SPECIAL | PAGE_DEVMAP) by pte_mkdevmap(). Link: https://github.com/pmem/rpma/issues/1142 [1] Link: https://lkml.kernel.org/r/[email protected] Fixes: 405506274922 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling") Signed-off-by: Li Zhijian <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08Merge tag 'tag-chrome-platform-for-v5.15' of ↵Linus Torvalds5-23/+123
git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux Pull chrome platform updates from Benson Leung: "cros_ec_typec: - make the cros_ec_typec driver to use the pre-existing cros_ec_check_features() function sensorhub: - add trace events for sample misc: - cros_ec_proto - re-send commands in the event of a timeout (for the FPMCU) - fix warnings in cros_ec_trace related to format output" * tag 'tag-chrome-platform-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: platform/chrome: cros_ec_trace: Fix format warnings platform/chrome: cros_ec_typec: Use existing feature check platform/chrome: cros_ec_proto: Send command again when timeout occurs platform/chrome: sensorhub: Add trace events for sample
2021-09-08Merge tag 'pm-5.15-rc1-2' of ↵Linus Torvalds39-767/+1441
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These are mostly ARM cpufreq driver updates, including one new MediaTek driver that has just passed all of the reviews, with the addition of a revert of a recent intel_pstate commit, some core cpufreq changes and a DT-related update of the operating performance points (OPP) support code. Specifics: - Add new cpufreq driver for the MediaTek MT6779 platform called mediatek-hw along with corresponding DT bindings (Hector.Yuan). - Add DCVS interrupt support to the qcom-cpufreq-hw driver (Thara Gopinath). - Make the qcom-cpufreq-hw driver set the dvfs_possible_from_any_cpu policy flag (Taniya Das). - Blocklist more Qualcomm platforms in cpufreq-dt-platdev (Bjorn Andersson). - Make the vexpress cpufreq driver set the CPUFREQ_IS_COOLING_DEV flag (Viresh Kumar). - Add new cpufreq driver callback to allow drivers to register with the Energy Model in a consistent way and make several drivers use it (Viresh Kumar). - Change the remaining users of the .ready() cpufreq driver callback to move the code from it elsewhere and drop it from the cpufreq core (Viresh Kumar). - Revert recent intel_pstate change adding HWP guaranteed performance change notification support to it that led to problems, because the notification in question is triggered prematurely on some systems (Rafael Wysocki). - Convert the OPP DT bindings to DT schema and clean them up while at it (Rob Herring)" * tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification" cpufreq: mediatek-hw: Add support for CPUFREQ HW cpufreq: Add of_perf_domain_get_sharing_cpumask dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW cpufreq: Remove ready() callback cpufreq: sh: Remove sh_cpufreq_cpu_ready() cpufreq: acpi: Remove acpi_cpufreq_cpu_ready() cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support cpufreq: scmi: Use .register_em() to register with energy model cpufreq: vexpress: Use .register_em() to register with energy model cpufreq: scpi: Use .register_em() to register with energy model dt-bindings: opp: Convert to DT schema dt-bindings: Clean-up OPP binding node names in examples ARM: dts: omap: Drop references to opp.txt cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model cpufreq: omap: Use .register_em() to register with energy model cpufreq: mediatek: Use .register_em() to register with energy model cpufreq: imx6q: Use .register_em() to register with energy model ...
2021-09-08Merge tag 'acpi-5.15-rc1-2' of ↵Linus Torvalds6-52/+197
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more ACPI updates from Rafael Wysocki: "These add ACPI support to the PCI VMD driver, improve suspend-to-idle support for AMD platforms and update documentation. Specifics: - Add ACPI support to the PCI VMD driver (Rafael Wysocki) - Rearrange suspend-to-idle support code to reflect the platform firmware expectations on some AMD platforms (Mario Limonciello) - Make SSDT overlays documentation follow the code documented by it more closely (Andy Shevchenko)" * tag 'acpi-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: PM: s2idle: Run both AMD and Microsoft methods if both are supported Documentation: ACPI: Align the SSDT overlays file with the code PCI: VMD: ACPI: Make ACPI companion lookup work for VMD bus
2021-09-08Merge tag 'docs-5.15-2' of git://git.lwn.net/linuxLinus Torvalds73-129/+2686
Pull more documentation updates from Jonathan Corbet: "Another collection of documentation patches, mostly fixes but also includes another set of traditional Chinese translations" * tag 'docs-5.15-2' of git://git.lwn.net/linux: docs: pdfdocs: Fix typo in CJK-language specific font settings docs: kernel-hacking: Remove inappropriate text docs/zh_TW: add translations for zh_TW/filesystems docs/zh_TW: add translations for zh_TW/cpu-freq docs/zh_TW: add translations for zh_TW/arm64 docs/zh_CN: Modify the translator tag and fix the wrong word Documentation/features/vm: correct huge-vmap APIs Documentation: block: blk-mq: Fix small typo in multi-queue docs Documentation: in_irq() cleanup Documentation: arm: marvell: Add 88F6825 model into list Documentation/process/maintainer-pgp-guide: Replace broken link to PGP path finder Documentation: locking: fix references Documentation: Update details of The Linux Kernel Module Programming Guide docs: x86: Remove obsolete information about x86_64 vmalloc() faulting Documentation/process/applying-patches: Activate linux-next man hyperlink
2021-09-08Merge tag 'modules-for-v5.15' of ↵Linus Torvalds2-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull module updates from Jessica Yu: "The only main change I have for this round of updates is the modules MAINTAINERS update. As I find myself with less time to devote to upstream these days, Luis has kindly agreed to help maintain the module loader, to eventually transition to being the primary maintainer. Since Luis is already very involved upstream with experience maintaining various areas of the kernel including the kmod usermode helper, I think he is a great fit for this area of the kernel. Summary: - Add Luis Chamberlain as modules maintainer - Fix for .ctors sections in module linker script" * tag 'modules-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: MAINTAINERS: Add Luis Chamberlain as modules maintainer module: combine constructors in module linker script
2021-09-08Merge tag 'microblaze-v5.15' of git://git.monstr.eu/linux-2.6-microblazeLinus Torvalds2-5/+4
Pull microblaze update from Michal Simek: - Kbuild clean up * tag 'microblaze-v5.15' of git://git.monstr.eu/linux-2.6-microblaze: microblaze: move core-y in arch/microblaze/Makefile to arch/microblaze/Kbuild
2021-09-08Merge tag 'nfsd-5.15-1' of ↵Linus Torvalds3-7/+10
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Restore performance on memory-starved servers * tag 'nfsd-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: SUNRPC: improve error response to over-size gss credential SUNRPC: don't pause on incomplete allocation
2021-09-08Merge tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds14-207/+438
Pull ceph updates from Ilya Dryomov: - a set of patches to address fsync stalls caused by depending on periodic rather than triggered MDS journal flushes in some cases (Xiubo Li) - a fix for mtime effectively not getting updated in case of competing writers (Jeff Layton) - a couple of fixes for inode reference leaks and various WARNs after "umount -f" (Xiubo Li) - a new ceph.auth_mds extended attribute (Jeff Layton) - a smattering of fixups and cleanups from Jeff, Xiubo and Colin. * tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client: ceph: fix dereference of null pointer cf ceph: drop the mdsc_get_session/put_session dout messages ceph: lockdep annotations for try_nonblocking_invalidate ceph: don't WARN if we're forcibly removing the session caps ceph: don't WARN if we're force umounting ceph: remove the capsnaps when removing caps ceph: request Fw caps before updating the mtime in ceph_write_iter ceph: reconnect to the export targets on new mdsmaps ceph: print more information when we can't find snaprealm ceph: add ceph_change_snap_realm() helper ceph: remove redundant initializations from mdsc and session ceph: cancel delayed work instead of flushing on mdsc teardown ceph: add a new vxattr to return auth mds for an inode ceph: remove some defunct forward declarations ceph: flush the mdlog before waiting on unsafe reqs ceph: flush mdlog before umounting ceph: make iterate_sessions a global symbol ceph: make ceph_create_session_msg a global symbol ceph: fix comment about short copies in ceph_write_end ceph: fix memory leak on decode error in ceph_handle_caps
2021-09-08Merge tag '9p-for-5.15-rc1' of git://github.com/martinetd/linuxLinus Torvalds4-6/+10
Pull 9p updates from Dominique Martinet: "A couple of harmless fixes, increase max tcp msize (64KB -> 1MB), and increase default msize (8KB -> 128KB) The default increase has been discussed with Christian for the qemu side of things but makes sense for all supported transports" * tag '9p-for-5.15-rc1' of git://github.com/martinetd/linux: net/9p: increase default msize to 128k net/9p: use macro to define default msize net/9p: increase tcp max msize to 1MB 9p/xen: Fix end of loop tests for list_for_each_entry 9p/trans_virtio: Remove sysfs file on probe failure
2021-09-08arch: remove compat_alloc_user_spaceArnd Bergmann24-333/+12
All users of compat_alloc_user_space() and copy_in_user() have been removed from the kernel, only a few functions in sparc remain that can be changed to calling arch_copy_in_user() instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Feng Tang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08compat: remove some compat entry pointsArnd Bergmann14-117/+42
These are all handled correctly when calling the native system call entry point, so remove the special cases. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Feng Tang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: simplify compat numa syscallsArnd Bergmann2-129/+64
The compat implementations for mbind, get_mempolicy, set_mempolicy and migrate_pages are just there to handle the subtly different layout of bitmaps on 32-bit hosts. The compat implementation however lacks some of the checks that are present in the native one, in particular for checking that the extra bits are all zero when user space has a larger mask size than the kernel. Worse, those extra bits do not get cleared when copying in or out of the kernel, which can lead to incorrect data as well. Unify the implementation to handle the compat bitmap layout directly in the get_nodes() and copy_nodes_to_user() helpers. Splitting out the get_bitmap() helper from get_nodes() also helps readability of the native case. On x86, two additional problems are addressed by this: compat tasks can pass a bitmap at the end of a mapping, causing a fault when reading across the page boundary for a 64-bit word. x32 tasks might also run into problems with get_mempolicy corrupting data when an odd number of 32-bit words gets passed. On parisc the migrate_pages() system call apparently had the wrong calling convention, as big-endian architectures expect the words inside of a bitmap to be swapped. This is not a problem though since parisc has no NUMA support. [[email protected]: fix mempolicy crash] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/lkml/YQPLG20V3dmOfq3a@osiris/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Feng Tang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: simplify compat_sys_move_pagesArnd Bergmann1-15/+30
The compat move_pages() implementation uses compat_alloc_user_space() for converting the pointer array. Moving the compat handling into the function itself is a bit simpler and lets us avoid the compat_alloc_user_space() call. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Feng Tang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08kexec: avoid compat_alloc_user_spaceArnd Bergmann1-36/+25
kimage_alloc_init() expects a __user pointer, so compat_sys_kexec_load() uses compat_alloc_user_space() to convert the layout and put it back onto the user space caller stack. Moving the user space access into the syscall handler directly actually makes the code simpler, as the conversion for compat mode can now be done on kernel memory. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/lkml/YPbtsU4GX6PL7%[email protected]/ Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Arnd Bergmann <[email protected]> Co-developed-by: Eric Biederman <[email protected]> Co-developed-by: Christoph Hellwig <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Cc: Al Viro <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Feng Tang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08kexec: move locking into do_kexec_loadArnd Bergmann1-28/+16
Patch series "compat: remove compat_alloc_user_space", v5. Going through compat_alloc_user_space() to convert indirect system call arguments tends to add complexity compared to handling the native and compat logic in the same code. This patch (of 6): The locking is the same between the native and compat version of sys_kexec_load(), so it can be done in the common implementation to reduce duplication. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Co-developed-by: Eric Biederman <[email protected]> Co-developed-by: Christoph Hellwig <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Al Viro <[email protected]> Cc: Feng Tang <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: migrate: change to use bool type for 'page_was_mapped'Baolin Wang1-2/+2
Change to use bool type for 'page_was_mapped' variable making it more readable. Link: https://lkml.kernel.org/r/ce1279df18d2c163998c403e0b5ec6d3f6f90f7a.1629447552.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: migrate: fix the incorrect function name in commentsBaolin Wang1-1/+1
since commit a98a2f0c8ce1 ("mm/rmap: split migration into its own function"), the migration ptes establishment has been split into a separate try_to_migrate() function, thus update the related comments. Link: https://lkml.kernel.org/r/5b824bad6183259c916ae6cf42f81d14c6118b06.1629447552.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: Yang Shi <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: migrate: introduce a local variable to get the number of pagesBaolin Wang1-2/+3
Use thp_nr_pages() instead of compound_nr() to get the number of pages for THP page, meanwhile introducing a local variable 'nr_pages' to avoid getting the number of pages repeatedly. Link: https://lkml.kernel.org/r/a8e331ac04392ee230c79186330fb05e86a2aa77.1629447552.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: Yang Shi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/vmstat: protect per cpu variables with preempt disable on RTIngo Molnar1-0/+48
Disable preemption on -RT for the vmstat code. On vanila the code runs in IRQ-off regions while on -RT it may not when stats are updated under a local_lock. "preempt_disable" ensures that the same resources is not updated in parallel due to preemption. This patch differs from the preempt-rt version where __count_vm_event and __count_vm_events are also protected. The counters are explicitly "allowed to be to be racy" so there is no need to protect them from preemption. Only the accurate page stats that are updated by a read-modify-write need protection. This patch also differs in that a preempt_[en|dis]able_rt helper is not used. As vmstat is the only user of the helper, it was suggested that it be open-coded in vmstat.c instead of risking the helper being used in unnecessary contexts. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08Merge branch 'akpm' (patches from Andrew)Linus Torvalds149-930/+5341
Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <[email protected]>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08Merge tag 'mm-slub-5.15-rc1' of ↵Linus Torvalds4-251/+563
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux Pull SLUB updates from Vlastimil Babka: "SLUB: reduce irq disabled scope and make it RT compatible This series was initially inspired by Mel's pcplist local_lock rewrite, and also interest to better understand SLUB's locking and the new primitives and RT variants and implications. It makes SLUB compatible with PREEMPT_RT and generally more preemption-friendly, apparently without significant regressions, as the fast paths are not affected. The main changes to SLUB by this series: - irq disabling is now only done for minimum amount of time needed to protect the strict kmem_cache_cpu fields, and as part of spin lock, local lock and bit lock operations to make them irq-safe - SLUB is fully PREEMPT_RT compatible The series should now be sufficiently tested in both RT and !RT configs, mainly thanks to Mike. The RFC/v1 version also got basic performance screening by Mel that didn't show major regressions. Mike's testing with hackbench of v2 on !RT reported negligible differences [6]: virgin(ish) tip 5.13.0.g60ab3ed-tip 7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% ) 221,215 context-switches # 0.030 M/sec ( +- 3.97% ) 16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% ) 13,233 page-faults # 0.002 M/sec ( +- 0.91% ) 27,592,205,252 cycles # 3.769 GHz ( +- 0.32% ) 8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% ) 1,555,210,607 branches # 212.441 M/sec ( +- 0.42% ) 5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% ) 0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% ) 0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat) 0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% ) 5.13.0.g60ab3ed-tip +slub-local-lock-v2r3 7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% ) 223,044 context-switches # 0.030 M/sec ( +- 3.02% ) 16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% ) 13,164 page-faults # 0.002 M/sec ( +- 0.97% ) 27,684,906,017 cycles # 3.770 GHz ( +- 0.45% ) 8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% ) 1,556,106,680 branches # 211.901 M/sec ( +- 0.31% ) 5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% ) 0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% ) 0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat) 0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat) RT configs showed some throughput regressions, but that's expected tradeoff for the preemption improvements through the RT mutex. It didn't prevent the v2 to be incorporated to the 5.13 RT tree [7], leading to testing exposure and bugfixes. Before the series, SLUB is lockless in both allocation and free fast paths, but elsewhere, it's disabling irqs for considerable periods of time - especially in allocation slowpath and the bulk allocation, where IRQs are re-enabled only when a new page from the page allocator is needed, and the context allows blocking. The irq disabled sections can then include deactivate_slab() which walks a full freelist and frees the slab back to page allocator or unfreeze_partials() going through a list of percpu partial slabs. The RT tree currently has some patches mitigating these, but we can do much better in mainline too. Patches 1-6 are straightforward improvements or cleanups that could exist outside of this series too, but are prerequsities. Patches 7-9 are also preparatory code changes without functional changes, but not so useful without the rest of the series. Patch 10 simplifies the fast paths on systems with preemption, based on (hopefully correct) observation that the current loops to verify tid are unnecessary. Patches 11-20 focus on reducing irq disabled scope in the allocation slowpath: - patch 11 moves disabling of irqs into ___slab_alloc() from its callers, which are the allocation slowpath, and bulk allocation. Instead these callers only disable preemption to stabilize the cpu. - The following patches then gradually reduce the scope of disabled irqs in ___slab_alloc() and the functions called from there. As of patch 14, the re-enabling of irqs based on gfp flags before calling the page allocator is removed from allocate_slab(). As of patch 17, it's possible to reach the page allocator (in case of existing slabs depleted) without disabling and re-enabling irqs a single time. Pathces 21-26 reduce the scope of disabled irqs in functions related to unfreezing percpu partial slab. Patch 27 is preparatory. Patch 28 is adopted from the RT tree and converts the flushing of percpu slabs on all cpus from using IPI to workqueue, so that the processing isn't happening with irqs disabled in the IPI handler. The flushing is not performance critical so it should be acceptable. Patch 29 also comes from RT tree and makes object_map_lock RT compatible. Patch 30 make slab_lock irq-safe on RT where we cannot rely on having irq disabled from the list_lock spin lock usage. Patch 31 changes kmem_cache_cpu->partial handling in put_cpu_partial() from cmpxchg loop to a short irq disabled section, which is used by all other code modifying the field. This addresses a theoretical race scenario pointed out by Jann, and makes the critical section safe wrt with RT local_lock semantics after the conversion in patch 35. Patch 32 changes preempt disable to migrate disable, so that the nested list_lock spinlock is safe to take on RT. Because migrate_disable() is a function call even on !RT, a small set of private wrappers is introduced to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations. As of this patch, SLUB should be already compatible with RT's lock semantics. Finally, patch 33 changes irq disabled sections that protect kmem_cache_cpu fields in the slow paths, with a local lock. However on PREEMPT_RT it means the lockless fast paths can now preempt slow paths which don't expect that, so the local lock has to be taken also in the fast paths and they are no longer lockless. RT folks seem to not mind this tradeoff. The patch also updates the locking documentation in the file's comment" Mike Galbraith and Mel Gorman verified that their earlier testing observations still hold for the final series: Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/lkml/[email protected]/ * tag 'mm-slub-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux: (33 commits) mm, slub: convert kmem_cpu_slab protection to local_lock mm, slub: use migrate_disable() on PREEMPT_RT mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg mm, slub: make slab_lock() disable irqs with PREEMPT_RT mm: slub: make object_map_lock a raw_spinlock_t mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context mm, slab: split out the cpu offline variant of flush_slab() mm, slub: don't disable irqs in slub_cpu_dead() mm, slub: only disable irq with spin_lock in __unfreeze_partials() mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing mm, slub: detach whole partial list at once in unfreeze_partials() mm, slub: discard slabs in unfreeze_partials() without irqs disabled mm, slub: move irq control into unfreeze_partials() mm, slub: call deactivate_slab() without disabling irqs mm, slub: make locking in deactivate_slab() irq-safe mm, slub: move reset of c->page and freelist out of deactivate_slab() mm, slub: stop disabling irqs around get_partial() mm, slub: check new pages with restored irqs mm, slub: validate slab from partial list or page allocator before making it cpu slab mm, slub: restore irqs around calling new_slab() ...
2021-09-08scripts: check_extable: fix typo in user error messageRandy Dunlap1-1/+1
Fix typo ("and" should be "an") in an error message. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Cc: Quentin Casasnovas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/workingset: correct kernel-doc notationsRandy Dunlap1-1/+1
Use the documented kernel-doc format to prevent kernel-doc warnings. mm/workingset.c:256: warning: No description found for return value of 'workingset_eviction' mm/workingset.c:285: warning: Function parameter or member 'folio' not described in 'workingset_refault' mm/workingset.c:285: warning: Excess function parameter 'page' description in 'workingset_refault' Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08ipc: replace costly bailout check in sysvipc_find_ipc()Rafael Aquini1-12/+4
sysvipc_find_ipc() was left with a costly way to check if the offset position fed to it is bigger than the total number of IPC IDs in use. So much so that the time it takes to iterate over /proc/sysvipc/* files grows exponentially for a custom benchmark that creates "N" SYSV shm segments and then times the read of /proc/sysvipc/shm (milliseconds): 12 msecs to read 1024 segs from /proc/sysvipc/shm 18 msecs to read 2048 segs from /proc/sysvipc/shm 65 msecs to read 4096 segs from /proc/sysvipc/shm 325 msecs to read 8192 segs from /proc/sysvipc/shm 1303 msecs to read 16384 segs from /proc/sysvipc/shm 5182 msecs to read 32768 segs from /proc/sysvipc/shm The root problem lies with the loop that computes the total amount of ids in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than "ids->in_use". That is a quite inneficient way to get to the maximum index in the id lookup table, specially when that value is already provided by struct ipc_ids.max_idx. This patch follows up on the optimization introduced via commit 15df03c879836 ("sysvipc: make get_maxid O(1) again") and gets rid of the aforementioned costly loop replacing it by a simpler checkpoint based on ipc_get_maxidx() returned value, which allows for a smooth linear increase in time complexity for the same custom benchmark: 2 msecs to read 1024 segs from /proc/sysvipc/shm 2 msecs to read 2048 segs from /proc/sysvipc/shm 4 msecs to read 4096 segs from /proc/sysvipc/shm 9 msecs to read 8192 segs from /proc/sysvipc/shm 19 msecs to read 16384 segs from /proc/sysvipc/shm 39 msecs to read 32768 segs from /proc/sysvipc/shm Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rafael Aquini <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Manfred Spraul <[email protected]> Cc: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08selftests/memfd: remove unused variableGreg Thelen1-1/+1
Commit 544029862cbb ("selftests/memfd: add tests for F_SEAL_FUTURE_WRITE seal") added an unused variable to mfd_assert_reopen_fd(). Delete the unused variable. Link: https://lkml.kernel.org/r/[email protected] Fixes: 544029862cbb ("selftests/memfd: add tests for F_SEAL_FUTURE_WRITE seal") Signed-off-by: Greg Thelen <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Joel Fernandes (Google)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCHLukas Bulwahn1-1/+0
Commit 05a4a9527931 ("kernel/watchdog: split up config options") adds a new config HARDLOCKUP_DETECTOR, which selects the non-existing config HARDLOCKUP_DETECTOR_ARCH. Hence, ./scripts/checkkconfigsymbols.py warns: HARDLOCKUP_DETECTOR_ARCH Referencing files: lib/Kconfig.debug Simply drop selecting the non-existing HARDLOCKUP_DETECTOR_ARCH. Link: https://lkml.kernel.org/r/[email protected] Fixes: 05a4a9527931 ("kernel/watchdog: split up config options") Signed-off-by: Lukas Bulwahn <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Babu Moger <[email protected]> Cc: Don Zickus <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08configs: remove the obsolete CONFIG_INPUT_POLLDEVZenghui Yu9-9/+0
This CONFIG option was removed in commit 278b13ce3a89 ("Input: remove input_polled_dev implementation") so there's no point to keep it in defconfigs any longer. Get rid of the leftover for all arches. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zenghui Yu <[email protected]> Cc: Dmitry Torokhov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08prctl: allow to setup brk for et_dyn executablesCyrill Gorcunov1-7/+0
Keno Fischer reported that when a binray loaded via ld-linux-x the prctl(PR_SET_MM_MAP) doesn't allow to setup brk value because it lays before mm:end_data. For example a test program shows | # ~/t | | start_code 401000 | end_code 401a15 | start_stack 7ffce4577dd0 | start_data 403e10 | end_data 40408c | start_brk b5b000 | sbrk(0) b5b000 and when executed via ld-linux | # /lib64/ld-linux-x86-64.so.2 ~/t | | start_code 7fc25b0a4000 | end_code 7fc25b0c4524 | start_stack 7fffcc6b2400 | start_data 7fc25b0ce4c0 | end_data 7fc25b0cff98 | start_brk 55555710c000 | sbrk(0) 55555710c000 This of course prevent criu from restoring such programs. Looking into how kernel operates with brk/start_brk inside brk() syscall I don't see any problem if we allow to setup brk/start_brk without checking for end_data. Even if someone pass some weird address here on a purpose then the worst possible result will be an unexpected unmapping of existing vma (own vma, since prctl works with the callers memory) but test for RLIMIT_DATA is still valid and a user won't be able to gain more memory in case of expanding VMAs via new values shipped with prctl call. Link: https://lkml.kernel.org/r/20210121221207.GB2174@grain Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec") Signed-off-by: Cyrill Gorcunov <[email protected]> Reported-by: Keno Fischer <[email protected]> Acked-by: Andrey Vagin <[email protected]> Tested-by: Andrey Vagin <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Pavel Tikhomirov <[email protected]> Cc: Alexander Mikhalitsyn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08pid: cleanup the stale comment mentioning pidmap_init().Takahiro Itazuri1-1/+1
pidmap_init() has already been replaced with pid_idr_init() in the commit 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API"). Cleanup the stale comment which still mentions it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Takahiro Itazuri <[email protected]> Cc: Kuniyuki Iwashima <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08kernel/fork.c: unexport get_{mm,task}_exe_fileChristoph Hellwig1-2/+0
Only used by core code and the tomoyo which can't be a module either. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08coredump: fix memleak in dump_vma_snapshot()QiuXi1-1/+3
dump_vma_snapshot() allocs memory for *vma_meta, when dump_vma_snapshot() returns -EFAULT, the memory will be leaked, so we free it correctly. Link: https://lkml.kernel.org/r/[email protected] Fixes: a07279c9a8cd7 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot") Signed-off-by: QiuXi <[email protected]> Cc: Al Viro <[email protected]> Cc: Jann Horn <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08fs/coredump.c: log if a core dump is aborted due to changed file permissionsDavid Oberhollenzer1-2/+9
For obvious security reasons, a core dump is aborted if the filesystem cannot preserve ownership or permissions of the dump file. This affects filesystems like e.g. vfat, but also something like a 9pfs share in a Qemu test setup, running as a regular user, depending on the security model used. In those cases, the result is an empty core file and a confused user. To hopefully save other people a lot of time figuring out the cause, this patch adds a simple log message for those specific cases. [[email protected]: s/|%s/%s/ in printk text] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Oberhollenzer <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08nilfs2: use refcount_dec_and_lock() to fix potential UAFZhen Lei1-5/+4
When the refcount is decreased to 0, the resource reclamation branch is entered. Before CPU0 reaches the race point (1), CPU1 may obtain the spinlock and traverse the rbtree to find 'root', see nilfs_lookup_root(). Although CPU1 will call refcount_inc() to increase the refcount, it is obviously too late. CPU0 will release 'root' directly, CPU1 then accesses 'root' and triggers UAF. Use refcount_dec_and_lock() to ensure that both the operations of decrease refcount to 0 and link deletion are lock protected eliminates this risk. CPU0 CPU1 nilfs_put_root(): <-------- (1) spin_lock(&nilfs->ns_cptree_lock); rb_erase(&root->rb_node, &nilfs->ns_cptree); spin_unlock(&nilfs->ns_cptree_lock); kfree(root); <-------- use-after-free refcount_t: underflow; use-after-free. WARNING: CPU: 2 PID: 9476 at lib/refcount.c:28 \ refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28 Modules linked in: CPU: 2 PID: 9476 Comm: syz-executor.0 Not tainted 5.10.45-rc1+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ... RIP: 0010:refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28 ... ... Call Trace: __refcount_sub_and_test include/linux/refcount.h:283 [inline] __refcount_dec_and_test include/linux/refcount.h:315 [inline] refcount_dec_and_test include/linux/refcount.h:333 [inline] nilfs_put_root+0xc1/0xd0 fs/nilfs2/the_nilfs.c:795 nilfs_segctor_destroy fs/nilfs2/segment.c:2749 [inline] nilfs_detach_log_writer+0x3fa/0x570 fs/nilfs2/segment.c:2812 nilfs_put_super+0x2f/0xf0 fs/nilfs2/super.c:467 generic_shutdown_super+0xcd/0x1f0 fs/super.c:464 kill_block_super+0x4a/0x90 fs/super.c:1446 deactivate_locked_super+0x6a/0xb0 fs/super.c:335 deactivate_super+0x85/0x90 fs/super.c:366 cleanup_mnt+0x277/0x2e0 fs/namespace.c:1118 __cleanup_mnt+0x15/0x20 fs/namespace.c:1125 task_work_run+0x8e/0x110 kernel/task_work.c:151 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_user_mode_loop kernel/entry/common.c:164 [inline] exit_to_user_mode_prepare+0x13c/0x170 kernel/entry/common.c:191 syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266 do_syscall_64+0x45/0x80 arch/x86/entry/common.c:56 entry_SYSCALL_64_after_hwframe+0x44/0xa9 There is no reproduction program, and the above is only theoretical analysis. Link: https://lkml.kernel.org/r/[email protected] Fixes: ba65ae4729bf ("nilfs2: add checkpoint tree to nilfs object") Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhen Lei <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_groupNanyong Sun1-1/+1
kobject_put() should be used to cleanup the memory associated with the kobject instead of kobject_del(). See the section "Kobject removal" of "Documentation/core-api/kobject.rst". Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_groupNanyong Sun1-2/+2
If kobject_init_and_add returns with error, kobject_put() is needed here to avoid memory leak, because kobject_init_and_add may return error without freeing the memory associated with the kobject it allocated. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_groupNanyong Sun1-1/+1
The kobject_put() should be used to cleanup the memory associated with the kobject instead of kobject_del. See the section "Kobject removal" of "Documentation/core-api/kobject.rst". Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_create_##name##_groupNanyong Sun1-2/+2
If kobject_init_and_add return with error, kobject_put() is needed here to avoid memory leak, because kobject_init_and_add may return error without freeing the memory associated with the kobject it allocated. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08nilfs2: fix NULL pointer in nilfs_##name##_attr_releaseNanyong Sun1-5/+3
In nilfs_##name##_attr_release, kobj->parent should not be referenced because it is a NULL pointer. The release() method of kobject is always called in kobject_put(kobj), in the implementation of kobject_put(), the kobj->parent will be assigned as NULL before call the release() method. So just use kobj to get the subgroups, which is more efficient and can fix a NULL pointer reference problem. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08nilfs2: fix memory leak in nilfs_sysfs_create_device_groupNanyong Sun1-4/+2
Patch series "nilfs2: fix incorrect usage of kobject". This patchset from Nanyong Sun fixes memory leak issues and a NULL pointer dereference issue caused by incorrect usage of kboject in nilfs2 sysfs implementation. This patch (of 6): Reported by syzkaller: BUG: memory leak unreferenced object 0xffff888100ca8988 (size 8): comm "syz-executor.1", pid 1930, jiffies 4294745569 (age 18.052s) hex dump (first 8 bytes): 6c 6f 6f 70 31 00 ff ff loop1... backtrace: kstrdup+0x36/0x70 mm/util.c:60 kstrdup_const+0x35/0x60 mm/util.c:83 kvasprintf_const+0xf1/0x180 lib/kasprintf.c:48 kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xc9/0x150 lib/kobject.c:473 nilfs_sysfs_create_device_group+0x150/0x7d0 fs/nilfs2/sysfs.c:986 init_nilfs+0xa21/0xea0 fs/nilfs2/the_nilfs.c:637 nilfs_fill_super fs/nilfs2/super.c:1046 [inline] nilfs_mount+0x7b4/0xe80 fs/nilfs2/super.c:1316 legacy_get_tree+0x105/0x210 fs/fs_context.c:592 vfs_get_tree+0x8e/0x2d0 fs/super.c:1498 do_new_mount fs/namespace.c:2905 [inline] path_mount+0xf9b/0x1990 fs/namespace.c:3235 do_mount+0xea/0x100 fs/namespace.c:3248 __do_sys_mount fs/namespace.c:3456 [inline] __se_sys_mount fs/namespace.c:3433 [inline] __x64_sys_mount+0x14b/0x1f0 fs/namespace.c:3433 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae If kobject_init_and_add return with error, then the cleanup of kobject is needed because memory may be allocated in kobject_init_and_add without freeing. And the place of cleanup_dev_kobject should use kobject_put to free the memory associated with the kobject. As the section "Kobject removal" of "Documentation/core-api/kobject.rst" says, kobject_del() just makes the kobject "invisible", but it is not cleaned up. And no more cleanup will do after cleanup_dev_kobject, so kobject_put is needed here. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reported-by: Hulk Robot <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08trap: cleanup trap_init()Kefeng Wang12-51/+2
There are some empty trap_init() definitions in different ARCHs, Introduce a new weak trap_init() function to clean them up. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Acked-by: Russell King (Oracle) <[email protected]> [arm32] Acked-by: Vineet Gupta [arc] Acked-by: Michael Ellerman <[email protected]> [powerpc] Cc: Yoshinori Sato <[email protected]> Cc: Ley Foon Tan <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Stafford Horne <[email protected]> Cc: James E.J. Bottomley <[email protected]> Cc: Helge Deller <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Anton Ivanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>