Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Helge Deller <[email protected]>
|
|
Merge yet more updates and hotfixes from Andrew Morton:
"Post-linux-next material, based upon latest upstream to catch the
now-merged dependencies:
- 10 patches.
Subsystems affected by this patch series: mm (vmstat and migration)
and compat.
And bunch of hotfixes, mostly cc:stable:
- 8 patches.
Subsystems affected by this patch series: mm (hmm, hugetlb, vmscan,
pagealloc, pagemap, kmemleak, mempolicy, and memblock)"
* emailed patches from Andrew Morton <[email protected]>:
arch: remove compat_alloc_user_space
compat: remove some compat entry points
mm: simplify compat numa syscalls
mm: simplify compat_sys_move_pages
kexec: avoid compat_alloc_user_space
kexec: move locking into do_kexec_load
mm: migrate: change to use bool type for 'page_was_mapped'
mm: migrate: fix the incorrect function name in comments
mm: migrate: introduce a local variable to get the number of pages
mm/vmstat: protect per cpu variables with preempt disable on RT
* emailed hotfixes from Andrew Morton <[email protected]>:
nds32/setup: remove unused memblock_region variable in setup_memory()
mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
mm/kmemleak: allow __GFP_NOLOCKDEP passed to kmemleak's gfp
mmap_lock: change trace and locking order
mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype
mm,vmscan: fix divide by zero in get_scan_count
mm/hugetlb: initialize hugetlb_usage in mm_init
mm/hmm: bypass devmap pte when all pfn requested flags are fulfilled
|
|
kernel test robot reports unused variable warning:
arch/nds32/kernel/setup.c:247:26: warning: Unused variable: region
[unusedVariable]
struct memblock_region *region;
^
Remove the unused variable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Servers happened below panic:
Kernel version:5.4.56
BUG: unable to handle page fault for address: 0000000000002c48
RIP: 0010:__next_zones_zonelist+0x1d/0x40
Call Trace:
__alloc_pages_nodemask+0x277/0x310
alloc_page_interleave+0x13/0x70
handle_mm_fault+0xf99/0x1390
__do_page_fault+0x288/0x500
do_page_fault+0x30/0x110
page_fault+0x3e/0x50
The reason for the panic is that MAX_NUMNODES is passed in the third
parameter in __alloc_pages_nodemask(preferred_nid). So access to
zonelist->zoneref->zone_idx in __next_zones_zonelist will cause a panic.
In offset_il_node(), first_node() returns nid from pol->v.nodes, after
this other threads may chang pol->v.nodes before next_node(). This race
condition will let next_node return MAX_NUMNODES. So put pol->nodes in
a local variable.
The race condition is between offset_il_node and cpuset_change_task_nodemask:
CPU0: CPU1:
alloc_pages_vma()
interleave_nid(pol,)
offset_il_node(pol,)
first_node(pol->v.nodes) cpuset_change_task_nodemask
//nodes==0xc mpol_rebind_task
mpol_rebind_policy
mpol_rebind_nodemask(pol,nodes)
//nodes==0x3
next_node(nid, pol->v.nodes)//return MAX_NUMNODES
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: yanghui <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In a memory pressure situation, I'm seeing the lockdep WARNING below.
Actually, this is similar to a known false positive which is already
addressed by commit 6dcde60efd94 ("xfs: more lockdep whackamole with
kmem_alloc*").
This warning still persists because it's not from kmalloc() itself but
from an allocation for kmemleak object. While kmalloc() itself suppress
the warning with __GFP_NOLOCKDEP, gfp_kmemleak_mask() is dropping the
flag for the kmemleak's allocation.
Allow __GFP_NOLOCKDEP to be passed to kmemleak's allocation, so that the
warning for it is also suppressed.
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc7-BTRFS-ZNS+ #37 Not tainted
------------------------------------------------------
kswapd0/288 is trying to acquire lock:
ffff88825ab45df0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0x8a/0x250
but task is already holding lock:
ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x112/0x160
kmem_cache_alloc+0x48/0x400
create_object.isra.0+0x42/0xb10
kmemleak_alloc+0x48/0x80
__kmalloc+0x228/0x440
kmem_alloc+0xd3/0x2b0
kmem_alloc_large+0x5a/0x1c0
xfs_attr_copy_value+0x112/0x190
xfs_attr_shortform_getvalue+0x1fc/0x300
xfs_attr_get_ilocked+0x125/0x170
xfs_attr_get+0x329/0x450
xfs_get_acl+0x18d/0x430
get_acl.part.0+0xb6/0x1e0
posix_acl_xattr_get+0x13a/0x230
vfs_getxattr+0x21d/0x270
getxattr+0x126/0x310
__x64_sys_fgetxattr+0x1a6/0x2a0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #0 (&xfs_nondir_ilock_class){++++}-{3:3}:
__lock_acquire+0x2c0f/0x5a00
lock_acquire+0x1a1/0x4b0
down_read_nested+0x50/0x90
xfs_ilock+0x8a/0x250
xfs_can_free_eofblocks+0x34f/0x570
xfs_inactive+0x411/0x520
xfs_fs_destroy_inode+0x2c8/0x710
destroy_inode+0xc5/0x1a0
evict+0x444/0x620
dispose_list+0xfe/0x1c0
prune_icache_sb+0xdc/0x160
super_cache_scan+0x31e/0x510
do_shrink_slab+0x337/0x8e0
shrink_slab+0x362/0x5c0
shrink_node+0x7a7/0x1a40
balance_pgdat+0x64e/0xfe0
kswapd+0x590/0xa80
kthread+0x38c/0x460
ret_from_fork+0x22/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&xfs_nondir_ilock_class);
lock(fs_reclaim);
lock(&xfs_nondir_ilock_class);
*** DEADLOCK ***
3 locks held by kswapd0/288:
#0: ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff848a08d8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x269/0x5c0
#2: ffff8881a7a820e8 (&type->s_umount_key#60){++++}-{3:3}, at: super_cache_scan+0x5a/0x510
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Naohiro Aota <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: "Darrick J . Wong" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Print to the trace log before releasing the lock to avoid racing with
other trace log printers of the same lock type.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Suggested-by: Steven Rostedt (VMware) <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If it's not prepared to free unref page, the pcp page migratetype is
unset. Thus we will get rubbish from get_pcppage_migratetype() and
might list_del(&page->lru) again after it's already deleted from the list
leading to grumble about data corruption.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
proportional memory.low reclaim") introduced a divide by zero corner
case when oomd is being used in combination with cgroup memory.low
protection.
When oomd decides to kill a cgroup, it will force the cgroup memory to
be reclaimed after killing the tasks, by writing to the memory.max file
for that cgroup, forcing the remaining page cache and reclaimable slab
to be reclaimed down to zero.
Previously, on cgroups with some memory.low protection that would result
in the memory being reclaimed down to the memory.low limit, or likely
not at all, having the page cache reclaimed asynchronously later.
With f56ce412a59d the oomd write to memory.max tries to reclaim all the
way down to zero, which may race with another reclaimer, to the point of
ending up with the divide by zero below.
This patch implements the obvious fix.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Chris Down <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
After fork, the child process will get incorrect (2x) hugetlb_usage. If
a process uses 5 2MB hugetlb pages in an anonymous mapping,
HugetlbPages: 10240 kB
and then forks, the child will show,
HugetlbPages: 20480 kB
The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child. Child will have 2x actual usage.
Fix this by adding hugetlb_count_init in mm_init.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5d317b2b6536 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status")
Signed-off-by: Liu Zixian <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Previously, we noticed the one rpma example was failed[1] since commit
36f30e486dce ("IB/core: Improve ODP to use hmm_range_fault()"), where it
will use ODP feature to do RDMA WRITE between fsdax files.
After digging into the code, we found hmm_vma_handle_pte() will still
return EFAULT even though all the its requesting flags has been
fulfilled. That's because a DAX page will be marked as (_PAGE_SPECIAL |
PAGE_DEVMAP) by pte_mkdevmap().
Link: https://github.com/pmem/rpma/issues/1142 [1]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 405506274922 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling")
Signed-off-by: Li Zhijian <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
Pull chrome platform updates from Benson Leung:
"cros_ec_typec:
- make the cros_ec_typec driver to use the pre-existing
cros_ec_check_features() function
sensorhub:
- add trace events for sample
misc:
- cros_ec_proto - re-send commands in the event of a timeout (for the
FPMCU)
- fix warnings in cros_ec_trace related to format output"
* tag 'tag-chrome-platform-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: cros_ec_trace: Fix format warnings
platform/chrome: cros_ec_typec: Use existing feature check
platform/chrome: cros_ec_proto: Send command again when timeout occurs
platform/chrome: sensorhub: Add trace events for sample
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are mostly ARM cpufreq driver updates, including one new
MediaTek driver that has just passed all of the reviews, with the
addition of a revert of a recent intel_pstate commit, some core
cpufreq changes and a DT-related update of the operating performance
points (OPP) support code.
Specifics:
- Add new cpufreq driver for the MediaTek MT6779 platform called
mediatek-hw along with corresponding DT bindings (Hector.Yuan).
- Add DCVS interrupt support to the qcom-cpufreq-hw driver (Thara
Gopinath).
- Make the qcom-cpufreq-hw driver set the dvfs_possible_from_any_cpu
policy flag (Taniya Das).
- Blocklist more Qualcomm platforms in cpufreq-dt-platdev (Bjorn
Andersson).
- Make the vexpress cpufreq driver set the CPUFREQ_IS_COOLING_DEV
flag (Viresh Kumar).
- Add new cpufreq driver callback to allow drivers to register with
the Energy Model in a consistent way and make several drivers use
it (Viresh Kumar).
- Change the remaining users of the .ready() cpufreq driver callback
to move the code from it elsewhere and drop it from the cpufreq
core (Viresh Kumar).
- Revert recent intel_pstate change adding HWP guaranteed performance
change notification support to it that led to problems, because the
notification in question is triggered prematurely on some systems
(Rafael Wysocki).
- Convert the OPP DT bindings to DT schema and clean them up while at
it (Rob Herring)"
* tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification"
cpufreq: mediatek-hw: Add support for CPUFREQ HW
cpufreq: Add of_perf_domain_get_sharing_cpumask
dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW
cpufreq: Remove ready() callback
cpufreq: sh: Remove sh_cpufreq_cpu_ready()
cpufreq: acpi: Remove acpi_cpufreq_cpu_ready()
cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag
cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev
cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support
cpufreq: scmi: Use .register_em() to register with energy model
cpufreq: vexpress: Use .register_em() to register with energy model
cpufreq: scpi: Use .register_em() to register with energy model
dt-bindings: opp: Convert to DT schema
dt-bindings: Clean-up OPP binding node names in examples
ARM: dts: omap: Drop references to opp.txt
cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model
cpufreq: omap: Use .register_em() to register with energy model
cpufreq: mediatek: Use .register_em() to register with energy model
cpufreq: imx6q: Use .register_em() to register with energy model
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI updates from Rafael Wysocki:
"These add ACPI support to the PCI VMD driver, improve suspend-to-idle
support for AMD platforms and update documentation.
Specifics:
- Add ACPI support to the PCI VMD driver (Rafael Wysocki)
- Rearrange suspend-to-idle support code to reflect the platform
firmware expectations on some AMD platforms (Mario Limonciello)
- Make SSDT overlays documentation follow the code documented by it
more closely (Andy Shevchenko)"
* tag 'acpi-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: PM: s2idle: Run both AMD and Microsoft methods if both are supported
Documentation: ACPI: Align the SSDT overlays file with the code
PCI: VMD: ACPI: Make ACPI companion lookup work for VMD bus
|
|
Pull more documentation updates from Jonathan Corbet:
"Another collection of documentation patches, mostly fixes but also
includes another set of traditional Chinese translations"
* tag 'docs-5.15-2' of git://git.lwn.net/linux:
docs: pdfdocs: Fix typo in CJK-language specific font settings
docs: kernel-hacking: Remove inappropriate text
docs/zh_TW: add translations for zh_TW/filesystems
docs/zh_TW: add translations for zh_TW/cpu-freq
docs/zh_TW: add translations for zh_TW/arm64
docs/zh_CN: Modify the translator tag and fix the wrong word
Documentation/features/vm: correct huge-vmap APIs
Documentation: block: blk-mq: Fix small typo in multi-queue docs
Documentation: in_irq() cleanup
Documentation: arm: marvell: Add 88F6825 model into list
Documentation/process/maintainer-pgp-guide: Replace broken link to PGP path finder
Documentation: locking: fix references
Documentation: Update details of The Linux Kernel Module Programming Guide
docs: x86: Remove obsolete information about x86_64 vmalloc() faulting
Documentation/process/applying-patches: Activate linux-next man hyperlink
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
"The only main change I have for this round of updates is the modules
MAINTAINERS update.
As I find myself with less time to devote to upstream these days, Luis
has kindly agreed to help maintain the module loader, to eventually
transition to being the primary maintainer. Since Luis is already very
involved upstream with experience maintaining various areas of the
kernel including the kmod usermode helper, I think he is a great fit
for this area of the kernel.
Summary:
- Add Luis Chamberlain as modules maintainer
- Fix for .ctors sections in module linker script"
* tag 'modules-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
MAINTAINERS: Add Luis Chamberlain as modules maintainer
module: combine constructors in module linker script
|
|
Pull microblaze update from Michal Simek:
- Kbuild clean up
* tag 'microblaze-v5.15' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: move core-y in arch/microblaze/Makefile to arch/microblaze/Kbuild
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Restore performance on memory-starved servers
* tag 'nfsd-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: improve error response to over-size gss credential
SUNRPC: don't pause on incomplete allocation
|
|
Pull ceph updates from Ilya Dryomov:
- a set of patches to address fsync stalls caused by depending on
periodic rather than triggered MDS journal flushes in some cases
(Xiubo Li)
- a fix for mtime effectively not getting updated in case of competing
writers (Jeff Layton)
- a couple of fixes for inode reference leaks and various WARNs after
"umount -f" (Xiubo Li)
- a new ceph.auth_mds extended attribute (Jeff Layton)
- a smattering of fixups and cleanups from Jeff, Xiubo and Colin.
* tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client:
ceph: fix dereference of null pointer cf
ceph: drop the mdsc_get_session/put_session dout messages
ceph: lockdep annotations for try_nonblocking_invalidate
ceph: don't WARN if we're forcibly removing the session caps
ceph: don't WARN if we're force umounting
ceph: remove the capsnaps when removing caps
ceph: request Fw caps before updating the mtime in ceph_write_iter
ceph: reconnect to the export targets on new mdsmaps
ceph: print more information when we can't find snaprealm
ceph: add ceph_change_snap_realm() helper
ceph: remove redundant initializations from mdsc and session
ceph: cancel delayed work instead of flushing on mdsc teardown
ceph: add a new vxattr to return auth mds for an inode
ceph: remove some defunct forward declarations
ceph: flush the mdlog before waiting on unsafe reqs
ceph: flush mdlog before umounting
ceph: make iterate_sessions a global symbol
ceph: make ceph_create_session_msg a global symbol
ceph: fix comment about short copies in ceph_write_end
ceph: fix memory leak on decode error in ceph_handle_caps
|
|
Pull 9p updates from Dominique Martinet:
"A couple of harmless fixes, increase max tcp msize (64KB -> 1MB), and
increase default msize (8KB -> 128KB)
The default increase has been discussed with Christian for the qemu
side of things but makes sense for all supported transports"
* tag '9p-for-5.15-rc1' of git://github.com/martinetd/linux:
net/9p: increase default msize to 128k
net/9p: use macro to define default msize
net/9p: increase tcp max msize to 1MB
9p/xen: Fix end of loop tests for list_for_each_entry
9p/trans_virtio: Remove sysfs file on probe failure
|
|
All users of compat_alloc_user_space() and copy_in_user() have been
removed from the kernel, only a few functions in sparc remain that can be
changed to calling arch_copy_in_user() instead.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
These are all handled correctly when calling the native system call entry
point, so remove the special cases.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The compat implementations for mbind, get_mempolicy, set_mempolicy and
migrate_pages are just there to handle the subtly different layout of
bitmaps on 32-bit hosts.
The compat implementation however lacks some of the checks that are
present in the native one, in particular for checking that the extra bits
are all zero when user space has a larger mask size than the kernel.
Worse, those extra bits do not get cleared when copying in or out of the
kernel, which can lead to incorrect data as well.
Unify the implementation to handle the compat bitmap layout directly in
the get_nodes() and copy_nodes_to_user() helpers. Splitting out the
get_bitmap() helper from get_nodes() also helps readability of the native
case.
On x86, two additional problems are addressed by this: compat tasks can
pass a bitmap at the end of a mapping, causing a fault when reading across
the page boundary for a 64-bit word. x32 tasks might also run into
problems with get_mempolicy corrupting data when an odd number of 32-bit
words gets passed.
On parisc the migrate_pages() system call apparently had the wrong calling
convention, as big-endian architectures expect the words inside of a
bitmap to be swapped. This is not a problem though since parisc has no
NUMA support.
[[email protected]: fix mempolicy crash]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/YQPLG20V3dmOfq3a@osiris/
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The compat move_pages() implementation uses compat_alloc_user_space() for
converting the pointer array. Moving the compat handling into the
function itself is a bit simpler and lets us avoid the
compat_alloc_user_space() call.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kimage_alloc_init() expects a __user pointer, so compat_sys_kexec_load()
uses compat_alloc_user_space() to convert the layout and put it back onto
the user space caller stack.
Moving the user space access into the syscall handler directly actually
makes the code simpler, as the conversion for compat mode can now be done
on kernel memory.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/YPbtsU4GX6PL7%[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Arnd Bergmann <[email protected]>
Co-developed-by: Eric Biederman <[email protected]>
Co-developed-by: Christoph Hellwig <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "compat: remove compat_alloc_user_space", v5.
Going through compat_alloc_user_space() to convert indirect system call
arguments tends to add complexity compared to handling the native and
compat logic in the same code.
This patch (of 6):
The locking is the same between the native and compat version of
sys_kexec_load(), so it can be done in the common implementation to reduce
duplication.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Co-developed-by: Eric Biederman <[email protected]>
Co-developed-by: Christoph Hellwig <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Change to use bool type for 'page_was_mapped' variable making it more
readable.
Link: https://lkml.kernel.org/r/ce1279df18d2c163998c403e0b5ec6d3f6f90f7a.1629447552.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
since commit a98a2f0c8ce1 ("mm/rmap: split migration into its own
function"), the migration ptes establishment has been split into a
separate try_to_migrate() function, thus update the related comments.
Link: https://lkml.kernel.org/r/5b824bad6183259c916ae6cf42f81d14c6118b06.1629447552.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use thp_nr_pages() instead of compound_nr() to get the number of pages for
THP page, meanwhile introducing a local variable 'nr_pages' to avoid
getting the number of pages repeatedly.
Link: https://lkml.kernel.org/r/a8e331ac04392ee230c79186330fb05e86a2aa77.1629447552.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Disable preemption on -RT for the vmstat code. On vanila the code runs in
IRQ-off regions while on -RT it may not when stats are updated under a
local_lock. "preempt_disable" ensures that the same resources is not
updated in parallel due to preemption.
This patch differs from the preempt-rt version where __count_vm_event and
__count_vm_events are also protected. The counters are explicitly
"allowed to be to be racy" so there is no need to protect them from
preemption. Only the accurate page stats that are updated by a
read-modify-write need protection. This patch also differs in that a
preempt_[en|dis]able_rt helper is not used. As vmstat is the only user of
the helper, it was suggested that it be open-coded in vmstat.c instead of
risking the helper being used in unnecessary contexts.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.
Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"
* emailed patches from Andrew Morton <[email protected]>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux
Pull SLUB updates from Vlastimil Babka:
"SLUB: reduce irq disabled scope and make it RT compatible
This series was initially inspired by Mel's pcplist local_lock
rewrite, and also interest to better understand SLUB's locking and the
new primitives and RT variants and implications. It makes SLUB
compatible with PREEMPT_RT and generally more preemption-friendly,
apparently without significant regressions, as the fast paths are not
affected.
The main changes to SLUB by this series:
- irq disabling is now only done for minimum amount of time needed to
protect the strict kmem_cache_cpu fields, and as part of spin lock,
local lock and bit lock operations to make them irq-safe
- SLUB is fully PREEMPT_RT compatible
The series should now be sufficiently tested in both RT and !RT
configs, mainly thanks to Mike.
The RFC/v1 version also got basic performance screening by Mel that
didn't show major regressions. Mike's testing with hackbench of v2 on
!RT reported negligible differences [6]:
virgin(ish) tip
5.13.0.g60ab3ed-tip
7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% )
221,215 context-switches # 0.030 M/sec ( +- 3.97% )
16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% )
13,233 page-faults # 0.002 M/sec ( +- 0.91% )
27,592,205,252 cycles # 3.769 GHz ( +- 0.32% )
8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% )
1,555,210,607 branches # 212.441 M/sec ( +- 0.42% )
5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% )
0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% )
0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat)
0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% )
5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% )
223,044 context-switches # 0.030 M/sec ( +- 3.02% )
16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% )
13,164 page-faults # 0.002 M/sec ( +- 0.97% )
27,684,906,017 cycles # 3.770 GHz ( +- 0.45% )
8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% )
1,556,106,680 branches # 211.901 M/sec ( +- 0.31% )
5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% )
0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% )
0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat)
0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat)
RT configs showed some throughput regressions, but that's expected
tradeoff for the preemption improvements through the RT mutex. It
didn't prevent the v2 to be incorporated to the 5.13 RT tree [7],
leading to testing exposure and bugfixes.
Before the series, SLUB is lockless in both allocation and free fast
paths, but elsewhere, it's disabling irqs for considerable periods of
time - especially in allocation slowpath and the bulk allocation,
where IRQs are re-enabled only when a new page from the page allocator
is needed, and the context allows blocking. The irq disabled sections
can then include deactivate_slab() which walks a full freelist and
frees the slab back to page allocator or unfreeze_partials() going
through a list of percpu partial slabs. The RT tree currently has some
patches mitigating these, but we can do much better in mainline too.
Patches 1-6 are straightforward improvements or cleanups that could
exist outside of this series too, but are prerequsities.
Patches 7-9 are also preparatory code changes without functional
changes, but not so useful without the rest of the series.
Patch 10 simplifies the fast paths on systems with preemption, based
on (hopefully correct) observation that the current loops to verify
tid are unnecessary.
Patches 11-20 focus on reducing irq disabled scope in the allocation
slowpath:
- patch 11 moves disabling of irqs into ___slab_alloc() from its
callers, which are the allocation slowpath, and bulk allocation.
Instead these callers only disable preemption to stabilize the cpu.
- The following patches then gradually reduce the scope of disabled
irqs in ___slab_alloc() and the functions called from there. As of
patch 14, the re-enabling of irqs based on gfp flags before calling
the page allocator is removed from allocate_slab(). As of patch 17,
it's possible to reach the page allocator (in case of existing
slabs depleted) without disabling and re-enabling irqs a single
time.
Pathces 21-26 reduce the scope of disabled irqs in functions related
to unfreezing percpu partial slab.
Patch 27 is preparatory. Patch 28 is adopted from the RT tree and
converts the flushing of percpu slabs on all cpus from using IPI to
workqueue, so that the processing isn't happening with irqs disabled
in the IPI handler. The flushing is not performance critical so it
should be acceptable.
Patch 29 also comes from RT tree and makes object_map_lock RT
compatible.
Patch 30 make slab_lock irq-safe on RT where we cannot rely on having
irq disabled from the list_lock spin lock usage.
Patch 31 changes kmem_cache_cpu->partial handling in put_cpu_partial()
from cmpxchg loop to a short irq disabled section, which is used by
all other code modifying the field. This addresses a theoretical race
scenario pointed out by Jann, and makes the critical section safe wrt
with RT local_lock semantics after the conversion in patch 35.
Patch 32 changes preempt disable to migrate disable, so that the
nested list_lock spinlock is safe to take on RT. Because
migrate_disable() is a function call even on !RT, a small set of
private wrappers is introduced to keep using the cheaper
preempt_disable() on !PREEMPT_RT configurations. As of this patch,
SLUB should be already compatible with RT's lock semantics.
Finally, patch 33 changes irq disabled sections that protect
kmem_cache_cpu fields in the slow paths, with a local lock. However on
PREEMPT_RT it means the lockless fast paths can now preempt slow paths
which don't expect that, so the local lock has to be taken also in the
fast paths and they are no longer lockless. RT folks seem to not mind
this tradeoff. The patch also updates the locking documentation in the
file's comment"
Mike Galbraith and Mel Gorman verified that their earlier testing
observations still hold for the final series:
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
* tag 'mm-slub-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux: (33 commits)
mm, slub: convert kmem_cpu_slab protection to local_lock
mm, slub: use migrate_disable() on PREEMPT_RT
mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
mm, slub: make slab_lock() disable irqs with PREEMPT_RT
mm: slub: make object_map_lock a raw_spinlock_t
mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context
mm, slab: split out the cpu offline variant of flush_slab()
mm, slub: don't disable irqs in slub_cpu_dead()
mm, slub: only disable irq with spin_lock in __unfreeze_partials()
mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing
mm, slub: detach whole partial list at once in unfreeze_partials()
mm, slub: discard slabs in unfreeze_partials() without irqs disabled
mm, slub: move irq control into unfreeze_partials()
mm, slub: call deactivate_slab() without disabling irqs
mm, slub: make locking in deactivate_slab() irq-safe
mm, slub: move reset of c->page and freelist out of deactivate_slab()
mm, slub: stop disabling irqs around get_partial()
mm, slub: check new pages with restored irqs
mm, slub: validate slab from partial list or page allocator before making it cpu slab
mm, slub: restore irqs around calling new_slab()
...
|
|
Fix typo ("and" should be "an") in an error message.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Quentin Casasnovas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use the documented kernel-doc format to prevent kernel-doc warnings.
mm/workingset.c:256: warning: No description found for return value of 'workingset_eviction'
mm/workingset.c:285: warning: Function parameter or member 'folio' not described in 'workingset_refault'
mm/workingset.c:285: warning: Excess function parameter 'page' description in 'workingset_refault'
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
sysvipc_find_ipc() was left with a costly way to check if the offset
position fed to it is bigger than the total number of IPC IDs in use. So
much so that the time it takes to iterate over /proc/sysvipc/* files grows
exponentially for a custom benchmark that creates "N" SYSV shm segments
and then times the read of /proc/sysvipc/shm (milliseconds):
12 msecs to read 1024 segs from /proc/sysvipc/shm
18 msecs to read 2048 segs from /proc/sysvipc/shm
65 msecs to read 4096 segs from /proc/sysvipc/shm
325 msecs to read 8192 segs from /proc/sysvipc/shm
1303 msecs to read 16384 segs from /proc/sysvipc/shm
5182 msecs to read 32768 segs from /proc/sysvipc/shm
The root problem lies with the loop that computes the total amount of ids
in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than
"ids->in_use". That is a quite inneficient way to get to the maximum
index in the id lookup table, specially when that value is already
provided by struct ipc_ids.max_idx.
This patch follows up on the optimization introduced via commit
15df03c879836 ("sysvipc: make get_maxid O(1) again") and gets rid of the
aforementioned costly loop replacing it by a simpler checkpoint based on
ipc_get_maxidx() returned value, which allows for a smooth linear increase
in time complexity for the same custom benchmark:
2 msecs to read 1024 segs from /proc/sysvipc/shm
2 msecs to read 2048 segs from /proc/sysvipc/shm
4 msecs to read 4096 segs from /proc/sysvipc/shm
9 msecs to read 8192 segs from /proc/sysvipc/shm
19 msecs to read 16384 segs from /proc/sysvipc/shm
39 msecs to read 32768 segs from /proc/sysvipc/shm
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rafael Aquini <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Acked-by: Manfred Spraul <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 544029862cbb ("selftests/memfd: add tests for F_SEAL_FUTURE_WRITE
seal") added an unused variable to mfd_assert_reopen_fd().
Delete the unused variable.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 544029862cbb ("selftests/memfd: add tests for F_SEAL_FUTURE_WRITE seal")
Signed-off-by: Greg Thelen <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 05a4a9527931 ("kernel/watchdog: split up config options") adds a
new config HARDLOCKUP_DETECTOR, which selects the non-existing config
HARDLOCKUP_DETECTOR_ARCH.
Hence, ./scripts/checkkconfigsymbols.py warns:
HARDLOCKUP_DETECTOR_ARCH Referencing files: lib/Kconfig.debug
Simply drop selecting the non-existing HARDLOCKUP_DETECTOR_ARCH.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 05a4a9527931 ("kernel/watchdog: split up config options")
Signed-off-by: Lukas Bulwahn <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Babu Moger <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This CONFIG option was removed in commit 278b13ce3a89 ("Input: remove
input_polled_dev implementation") so there's no point to keep it in
defconfigs any longer.
Get rid of the leftover for all arches.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zenghui Yu <[email protected]>
Cc: Dmitry Torokhov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Keno Fischer reported that when a binray loaded via ld-linux-x the
prctl(PR_SET_MM_MAP) doesn't allow to setup brk value because it lays
before mm:end_data.
For example a test program shows
| # ~/t
|
| start_code 401000
| end_code 401a15
| start_stack 7ffce4577dd0
| start_data 403e10
| end_data 40408c
| start_brk b5b000
| sbrk(0) b5b000
and when executed via ld-linux
| # /lib64/ld-linux-x86-64.so.2 ~/t
|
| start_code 7fc25b0a4000
| end_code 7fc25b0c4524
| start_stack 7fffcc6b2400
| start_data 7fc25b0ce4c0
| end_data 7fc25b0cff98
| start_brk 55555710c000
| sbrk(0) 55555710c000
This of course prevent criu from restoring such programs. Looking into
how kernel operates with brk/start_brk inside brk() syscall I don't see
any problem if we allow to setup brk/start_brk without checking for
end_data. Even if someone pass some weird address here on a purpose then
the worst possible result will be an unexpected unmapping of existing vma
(own vma, since prctl works with the callers memory) but test for
RLIMIT_DATA is still valid and a user won't be able to gain more memory in
case of expanding VMAs via new values shipped with prctl call.
Link: https://lkml.kernel.org/r/20210121221207.GB2174@grain
Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec")
Signed-off-by: Cyrill Gorcunov <[email protected]>
Reported-by: Keno Fischer <[email protected]>
Acked-by: Andrey Vagin <[email protected]>
Tested-by: Andrey Vagin <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Alexander Mikhalitsyn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
pidmap_init() has already been replaced with pid_idr_init() in the commit
95846ecf9dac ("pid: replace pid bitmap implementation with IDR API").
Cleanup the stale comment which still mentions it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Takahiro Itazuri <[email protected]>
Cc: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Only used by core code and the tomoyo which can't be a module either.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
dump_vma_snapshot() allocs memory for *vma_meta, when dump_vma_snapshot()
returns -EFAULT, the memory will be leaked, so we free it correctly.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: a07279c9a8cd7 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Signed-off-by: QiuXi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For obvious security reasons, a core dump is aborted if the filesystem
cannot preserve ownership or permissions of the dump file.
This affects filesystems like e.g. vfat, but also something like a 9pfs
share in a Qemu test setup, running as a regular user, depending on the
security model used. In those cases, the result is an empty core file and
a confused user.
To hopefully save other people a lot of time figuring out the cause, this
patch adds a simple log message for those specific cases.
[[email protected]: s/|%s/%s/ in printk text]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Oberhollenzer <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When the refcount is decreased to 0, the resource reclamation branch is
entered. Before CPU0 reaches the race point (1), CPU1 may obtain the
spinlock and traverse the rbtree to find 'root', see
nilfs_lookup_root().
Although CPU1 will call refcount_inc() to increase the refcount, it is
obviously too late. CPU0 will release 'root' directly, CPU1 then
accesses 'root' and triggers UAF.
Use refcount_dec_and_lock() to ensure that both the operations of
decrease refcount to 0 and link deletion are lock protected eliminates
this risk.
CPU0 CPU1
nilfs_put_root():
<-------- (1)
spin_lock(&nilfs->ns_cptree_lock);
rb_erase(&root->rb_node, &nilfs->ns_cptree);
spin_unlock(&nilfs->ns_cptree_lock);
kfree(root);
<-------- use-after-free
refcount_t: underflow; use-after-free.
WARNING: CPU: 2 PID: 9476 at lib/refcount.c:28 \
refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
Modules linked in:
CPU: 2 PID: 9476 Comm: syz-executor.0 Not tainted 5.10.45-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
RIP: 0010:refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
... ...
Call Trace:
__refcount_sub_and_test include/linux/refcount.h:283 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
nilfs_put_root+0xc1/0xd0 fs/nilfs2/the_nilfs.c:795
nilfs_segctor_destroy fs/nilfs2/segment.c:2749 [inline]
nilfs_detach_log_writer+0x3fa/0x570 fs/nilfs2/segment.c:2812
nilfs_put_super+0x2f/0xf0 fs/nilfs2/super.c:467
generic_shutdown_super+0xcd/0x1f0 fs/super.c:464
kill_block_super+0x4a/0x90 fs/super.c:1446
deactivate_locked_super+0x6a/0xb0 fs/super.c:335
deactivate_super+0x85/0x90 fs/super.c:366
cleanup_mnt+0x277/0x2e0 fs/namespace.c:1118
__cleanup_mnt+0x15/0x20 fs/namespace.c:1125
task_work_run+0x8e/0x110 kernel/task_work.c:151
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
exit_to_user_mode_prepare+0x13c/0x170 kernel/entry/common.c:191
syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
do_syscall_64+0x45/0x80 arch/x86/entry/common.c:56
entry_SYSCALL_64_after_hwframe+0x44/0xa9
There is no reproduction program, and the above is only theoretical
analysis.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: ba65ae4729bf ("nilfs2: add checkpoint tree to nilfs object")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zhen Lei <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del(). See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nanyong Sun <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If kobject_init_and_add returns with error, kobject_put() is needed here
to avoid memory leak, because kobject_init_and_add may return error
without freeing the memory associated with the kobject it allocated.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nanyong Sun <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del. See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nanyong Sun <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If kobject_init_and_add return with error, kobject_put() is needed here to
avoid memory leak, because kobject_init_and_add may return error without
freeing the memory associated with the kobject it allocated.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nanyong Sun <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In nilfs_##name##_attr_release, kobj->parent should not be referenced
because it is a NULL pointer. The release() method of kobject is always
called in kobject_put(kobj), in the implementation of kobject_put(), the
kobj->parent will be assigned as NULL before call the release() method.
So just use kobj to get the subgroups, which is more efficient and can fix
a NULL pointer reference problem.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nanyong Sun <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "nilfs2: fix incorrect usage of kobject".
This patchset from Nanyong Sun fixes memory leak issues and a NULL
pointer dereference issue caused by incorrect usage of kboject in nilfs2
sysfs implementation.
This patch (of 6):
Reported by syzkaller:
BUG: memory leak
unreferenced object 0xffff888100ca8988 (size 8):
comm "syz-executor.1", pid 1930, jiffies 4294745569 (age 18.052s)
hex dump (first 8 bytes):
6c 6f 6f 70 31 00 ff ff loop1...
backtrace:
kstrdup+0x36/0x70 mm/util.c:60
kstrdup_const+0x35/0x60 mm/util.c:83
kvasprintf_const+0xf1/0x180 lib/kasprintf.c:48
kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289
kobject_add_varg lib/kobject.c:384 [inline]
kobject_init_and_add+0xc9/0x150 lib/kobject.c:473
nilfs_sysfs_create_device_group+0x150/0x7d0 fs/nilfs2/sysfs.c:986
init_nilfs+0xa21/0xea0 fs/nilfs2/the_nilfs.c:637
nilfs_fill_super fs/nilfs2/super.c:1046 [inline]
nilfs_mount+0x7b4/0xe80 fs/nilfs2/super.c:1316
legacy_get_tree+0x105/0x210 fs/fs_context.c:592
vfs_get_tree+0x8e/0x2d0 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0xf9b/0x1990 fs/namespace.c:3235
do_mount+0xea/0x100 fs/namespace.c:3248
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount fs/namespace.c:3433 [inline]
__x64_sys_mount+0x14b/0x1f0 fs/namespace.c:3433
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kobject_init_and_add return with error, then the cleanup of kobject
is needed because memory may be allocated in kobject_init_and_add
without freeing.
And the place of cleanup_dev_kobject should use kobject_put to free the
memory associated with the kobject. As the section "Kobject removal" of
"Documentation/core-api/kobject.rst" says, kobject_del() just makes the
kobject "invisible", but it is not cleaned up. And no more cleanup will
do after cleanup_dev_kobject, so kobject_put is needed here.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Hulk Robot <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nanyong Sun <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There are some empty trap_init() definitions in different ARCHs, Introduce
a new weak trap_init() function to clean them up.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Acked-by: Russell King (Oracle) <[email protected]> [arm32]
Acked-by: Vineet Gupta [arc]
Acked-by: Michael Ellerman <[email protected]> [powerpc]
Cc: Yoshinori Sato <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Anton Ivanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|