Age | Commit message (Collapse) | Author | Files | Lines |
|
A fix for a regression in slub_debug caches that could cause slab page
leaks and subsequent warnings on cache shutdown, by Feng Tang.
|
|
When enable kasan and kfence's in-kernel kunit test with slub_debug on,
it caught a problem (in linux-next tree):
------------[ cut here ]------------
kmem_cache_destroy test: Slab cache still has objects when called from test_exit+0x1a/0x30
WARNING: CPU: 3 PID: 240 at mm/slab_common.c:492 kmem_cache_destroy+0x16c/0x170
Modules linked in:
CPU: 3 PID: 240 Comm: kunit_try_catch Tainted: G B N 6.0.0-rc7-next-20220929 #52
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:kmem_cache_destroy+0x16c/0x170
Code: 41 5c 41 5d e9 a5 04 0b 00 c3 cc cc cc cc 48 8b 55 60 48 8b 4c 24 20 48 c7 c6 40 37 d2 82 48 c7 c7 e8 a0 33 83 e8 4e d7 14 01 <0f> 0b eb a7 41 56 41 89 d6 41 55 49 89 f5 41 54 49 89 fc 55 48 89
RSP: 0000:ffff88800775fea0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffffff83bdec48 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 1ffff11000eebf9e RDI: ffffed1000eebfc6
RBP: ffff88804362fa00 R08: ffffffff81182e58 R09: ffff88800775fbdf
R10: ffffed1000eebf7b R11: 0000000000000001 R12: 000000008c800d00
R13: ffff888005e78040 R14: 0000000000000000 R15: ffff888005cdfad0
FS: 0000000000000000(0000) GS:ffff88807ed00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000360e001 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
test_exit+0x1a/0x30
kunit_try_run_case+0xad/0xc0
kunit_generic_run_threadfn_adapter+0x26/0x50
kthread+0x17b/0x1b0
It was biscted to commit c7323a5ad078 ("mm/slub: restrict sysfs
validation to debug caches and make it safe")
The problem is inside free_debug_processing(), under certain
circumstances the slab can be removed from the partial list but not
freed by discard_slab() and thus n->nr_slabs is not decreased
accordingly. During shutdown, this non-zero n->nr_slabs is detected and
reported.
Specifically, the problem is that there are two checks for detecting a
full partial list by comparing n->nr_partial >= s->min_partial where the
latter check is affected by remove_partial() decreasing n->nr_partial
between the checks. Reoganize the code so there is a single check
upfront.
Link: https://lore.kernel.org/all/[email protected]/
Fixes: c7323a5ad078 ("mm/slub: restrict sysfs validation to debug caches and make it safe")
Signed-off-by: Feng Tang <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
The first two patches from a series by Kees Cook [1] that introduce
kmalloc_size_roundup(). This will allow merging of per-subsystem patches using
the new function and ultimately stop (ab)using ksize() in a way that causes
ongoing trouble for debugging functionality and static checkers.
[1] https://lore.kernel.org/all/[email protected]/
--
Resolved a conflict of modifying mm/slab.c __ksize() comment with a commit that
unifies __ksize() implementation into mm/slab_common.c
|
|
A patch from Feng Tang that enhances the existing debugfs alloc_traces
file for kmalloc caches with information about how much space is wasted
by allocations that needs less space than the particular kmalloc cache
provides.
|
|
Additional cleanup by Chao Yu removing a BUG_ON() in create_unique_id().
|
|
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.
There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:
1) An allocation has been determined to be too small, and needs to be
resized. Instead of the caller choosing its own next best size, it
wants to minimize the number of calls to krealloc(), so it just uses
ksize() plus some additional bytes, forcing the realloc into the next
bucket size, from which it can learn how large it is now. For example:
data = krealloc(data, ksize(data) + 1, gfp);
data_len = ksize(data);
2) The minimum size of an allocation is calculated, but since it may
grow in the future, just use all the space available in the chosen
bucket immediately, to avoid needing to reallocate later. A good
example of this is skbuff's allocators:
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
...
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
* to allow max possible filling before reallocation.
*/
osize = ksize(data);
size = SKB_WITH_OVERHEAD(osize);
In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.
We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".
Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().
[1] https://github.com/ClangBuiltLinux/linux/issues/1599
https://github.com/KSPP/linux/issues/183
[ [email protected]: add SLOB version ]
Cc: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
The __malloc attribute should not be applied to "realloc" functions, as
the returned pointer may alias the storage of the prior pointer. Instead
of splitting __malloc from __alloc_size, which would be a huge amount of
churn, just create __realloc_size for the few cases where it is needed.
Thanks to Geert Uytterhoeven <[email protected]> for reporting build
failures with gcc-8 in earlier version which tried to remove the #ifdef.
While the "alloc_size" attribute is available on all GCC versions, I
forgot that it gets disabled explicitly by the kernel in GCC < 9.1 due
to misbehaviors. Add a note to the compiler_attributes.h entry for it.
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: [email protected]
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
As Christophe JAILLET suggested [1]
In create_unique_id(),
"looks that ID_STR_LENGTH could even be reduced to 32 or 16.
The 2nd BUG_ON at the end of the function could certainly be just
removed as well or remplaced by a:
if (p > name + ID_STR_LENGTH - 1) {
kfree(name);
return -E<something>;
}
"
According to above suggestion, let's do below cleanups:
1. reduce ID_STR_LENGTH to 32, as the buffer size should be enough;
2. use WARN_ON instead of BUG_ON() and return error if check condition
is true;
3. use snprintf instead of sprintf to avoid overflow.
[1] https://lore.kernel.org/linux-mm/[email protected]/
Suggested-by: Christophe JAILLET <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
kmalloc's API family is critical for mm, with one nature that it will
round up the request size to a fixed one (mostly power of 2). Say
when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory
space waste.
The wastage is not a big issue for requests that get allocated/freed
quickly, but may cause problems with objects that have longer life
time.
We've met a kernel boot OOM panic (v5.10), and from the dumped slab
info:
[ 26.062145] kmalloc-2k 814056KB 814056KB
From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).
And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:
[ 4.332648] iommu: Adding device 0000:20:02.0 to group 16
[ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
...
[ 4.857565] kmalloc-2048 59164KB 59164KB
The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are just picked to demo the problem, also
a patch changing its size to 1024 bytes has been merged)
So add an way to track each kmalloc's memory waste info, and
leverage the existing SLUB debug framework (specifically
SLUB_STORE_USER) to show its call stack of original allocation,
so that user can evaluate the waste situation, identify some hot
spots and optimize accordingly, for a better utilization of memory.
The waste info is integrated into existing interface:
'/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
'kmalloc-4k' after boot is:
126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
__kmem_cache_alloc_node+0x11f/0x4e0
__kmalloc_node+0x4e/0x140
ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
ixgbe_probe+0x165f/0x1d20 [ixgbe]
local_pci_probe+0x78/0xc0
work_for_cpu_fn+0x26/0x40
...
which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total), from ixgbe_alloc_q_vector().
And when system starts some real workload like multiple docker
instances, there could are more severe waste.
[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/[email protected]/
[Thanks Hyeonggon for pointing out several bugs about sorting/format]
[Thanks Vlastimil for suggesting way to reduce memory usage of
orig_size and keep it only for kmalloc objects]
Signed-off-by: Feng Tang <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: John Garry <[email protected]>
Cc: Kefeng Wang <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
My series [1] to fix validation races for caches with enabled debugging.
By decoupling the debug cache operation more from non-debug fastpaths,
additional locking simplifications were possible and done afterwards.
Additional cleanup of PREEMPT_RT specific code on top, by Thomas Gleixner.
[1] https://lore.kernel.org/all/[email protected]/
|
|
The "common kmalloc v4" series [1] by Hyeonggon Yoo.
- Improves the mm/slab_common.c wrappers to allow deleting duplicated
code between SLAB and SLUB.
- Large kmalloc() allocations in SLAB are passed to page allocator like
in SLUB, reducing number of kmalloc caches.
- Removes the {kmem_cache_alloc,kmalloc}_node variants of tracepoints,
node id parameter added to non-_node variants.
- 8 files changed, 341 insertions(+), 651 deletions(-)
[1] https://lore.kernel.org/all/[email protected]/
--
Merge resolves trivial conflict in mm/slub.c with commit 5373b8a09d6e
("kasan: call kasan_malloc() from __kmalloc_*track_caller()")
|
|
Trivial fixes and cleanups:
- unneeded variable removals, by ye xingchen
|
|
Commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations
__free_slab() invocations out of IRQ context") moved all flush_cpu_slab()
invocations to the global workqueue to avoid a problem related
with deactivate_slab()/__free_slab() being called from an IRQ context
on PREEMPT_RT kernels.
When the flush_all_cpu_locked() function is called from a task context
it may happen that a workqueue with WQ_MEM_RECLAIM bit set ends up
flushing the global workqueue, this will cause a dependency issue.
workqueue: WQ_MEM_RECLAIM nvme-delete-wq:nvme_delete_ctrl_work [nvme_core]
is flushing !WQ_MEM_RECLAIM events:flush_cpu_slab
WARNING: CPU: 37 PID: 410 at kernel/workqueue.c:2637
check_flush_dependency+0x10a/0x120
Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core]
RIP: 0010:check_flush_dependency+0x10a/0x120[ 453.262125] Call Trace:
__flush_work.isra.0+0xbf/0x220
? __queue_work+0x1dc/0x420
flush_all_cpus_locked+0xfb/0x120
__kmem_cache_shutdown+0x2b/0x320
kmem_cache_destroy+0x49/0x100
bioset_exit+0x143/0x190
blk_release_queue+0xb9/0x100
kobject_cleanup+0x37/0x130
nvme_fc_ctrl_free+0xc6/0x150 [nvme_fc]
nvme_free_ctrl+0x1ac/0x2b0 [nvme_core]
Fix this bug by creating a workqueue for the flush operation with
the WQ_MEM_RECLAIM bit set.
Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
Cc: <[email protected]>
Signed-off-by: Maurizio Lombardi <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
When doing slub_debug test, kfence's 'test_memcache_typesafe_by_rcu'
kunit test case cause a use-after-free error:
BUG: KASAN: use-after-free in kobject_del+0x14/0x30
Read of size 8 at addr ffff888007679090 by task kunit_try_catch/261
CPU: 1 PID: 261 Comm: kunit_try_catch Tainted: G B N 6.0.0-rc5-next-20220916 #17
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x48
print_address_description.constprop.0+0x87/0x2a5
print_report+0x103/0x1ed
kasan_report+0xb7/0x140
kobject_del+0x14/0x30
kmem_cache_destroy+0x130/0x170
test_exit+0x1a/0x30
kunit_try_run_case+0xad/0xc0
kunit_generic_run_threadfn_adapter+0x26/0x50
kthread+0x17b/0x1b0
</TASK>
The cause is inside kmem_cache_destroy():
kmem_cache_destroy
acquire lock/mutex
shutdown_cache
schedule_work(kmem_cache_release) (if RCU flag set)
release lock/mutex
kmem_cache_release (if RCU flag not set)
In some certain timing, the scheduled work could be run before
the next RCU flag checking, which can then get a wrong value
and lead to double kmem_cache_release().
Fix it by caching the RCU flag inside protected area, just like 'refcnt'
Fixes: 0495e337b703 ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock")
Signed-off-by: Feng Tang <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
The slub code already has a few helpers depending on PREEMPT_RT. Add a few
more and get rid of the CONFIG_PREEMPT_RT conditionals all over the place.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
The PREEMPT_RT specific disabling of irqs in __cmpxchg_double_slab()
(through slab_[un]lock()) is unnecessary as bit_spin_lock() disables
preemption and that's sufficient on PREEMPT_RT where no allocation/free
operation is performed in hardirq context and so can't interrupt the
current operation.
That means we no longer need the slab_[un]lock() wrappers, so delete
them and rename the current __slab_[un]lock() to slab_[un]lock().
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
|
|
The only remaining user of object_map_lock is list_slab_objects().
Obtaining the lock there used to happen under slab_lock() which implied
disabling irqs on PREEMPT_RT, thus it's a raw_spinlock. With the
slab_lock() removed, we can convert it to a normal spinlock.
Also remove the get_map()/put_map() wrappers as list_slab_objects()
became their only remaining user.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
|
|
All alloc and free operations on debug caches are now serialized by
n->list_lock, so we can remove slab_lock() usage in validate_slab()
and list_slab_objects() as those also happen under n->list_lock.
Note the usage in list_slab_objects() could happen even on non-debug
caches, but only during cache shutdown time, so there should not be any
parallel freeing activity anymore. Except for buggy slab users, but in
that case the slab_lock() would not help against the common cmpxchg
based fast paths (in non-debug caches) anyway.
Also adjust documentation comments accordingly.
Suggested-by: Hyeonggon Yoo <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Acked-by: David Rientjes <[email protected]>
|
|
Rongwei Wang reports [1] that cache validation triggered by writing to
/sys/kernel/slab/<cache>/validate is racy against normal cache
operations (e.g. freeing) in a way that can cause false positive
inconsistency reports for caches with debugging enabled. The problem is
that debugging actions that mark object free or active and actual
freelist operations are not atomic, and the validation can see an
inconsistent state.
For caches that do or don't have debugging enabled, additional races
involving n->nr_slabs are possible that result in false reports of wrong
slab counts.
This patch attempts to solve these issues while not adding overhead to
normal (especially fastpath) operations for caches that do not have
debugging enabled. Such overhead would not be justified to make possible
userspace-triggered validation safe. Instead, disable the validation for
caches that don't have debugging enabled and make their sysfs validate
handler return -EINVAL.
For caches that do have debugging enabled, we can instead extend the
existing approach of not using percpu freelists to force all alloc/free
operations to the slow paths where debugging flags is checked and acted
upon. There can adjust the debug-specific paths to increase n->list_lock
coverage against concurrent validation as necessary.
The processing on free in free_debug_processing() already happens under
n->list_lock so we can extend it to actually do the freeing as well and
thus make it atomic against concurrent validation. As observed by
Hyeonggon Yoo, we do not really need to take slab_lock() anymore here
because all paths we could race with are protected by n->list_lock under
the new scheme, so drop its usage here.
The processing on alloc in alloc_debug_processing() currently doesn't
take any locks, but we have to first allocate the object from a slab on
the partial list (as debugging caches have no percpu slabs) and thus
take the n->list_lock anyway. Add a function alloc_single_from_partial()
that grabs just the allocated object instead of the whole freelist, and
does the debug processing. The n->list_lock coverage again makes it
atomic against validation and it is also ultimately more efficient than
the current grabbing of freelist immediately followed by slab
deactivation.
To prevent races on n->nr_slabs updates, make sure that for caches with
debugging enabled, inc_slabs_node() or dec_slabs_node() is called under
n->list_lock. When allocating a new slab for a debug cache, handle the
allocation by a new function alloc_single_from_new_slab() instead of the
current forced deactivation path.
Neither of these changes affect the fast paths at all. The changes in
slow paths are negligible for non-debug caches.
[1] https://lore.kernel.org/all/[email protected]/
Reported-by: Rongwei Wang <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
|
|
We were failing to call kasan_malloc() from __kmalloc_*track_caller()
which was causing us to sometimes fail to produce KASAN error reports
for allocations made using e.g. devm_kcalloc(), as the KASAN poison was
not being initialized. Fix it.
Signed-off-by: Peter Collingbourne <[email protected]>
Cc: <[email protected]> # 5.15
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to
out-of-memory, if it fails, return errno correctly rather than
triggering panic via BUG_ON();
kernel BUG at mm/slub.c:5893!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Call trace:
sysfs_slab_add+0x258/0x260 mm/slub.c:5973
__kmem_cache_create+0x60/0x118 mm/slub.c:4899
create_cache mm/slab_common.c:229 [inline]
kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335
kmem_cache_create+0x1c/0x28 mm/slab_common.c:390
f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline]
f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808
f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149
mount_bdev+0x1b8/0x210 fs/super.c:1400
f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512
legacy_get_tree+0x30/0x74 fs/fs_context.c:610
vfs_get_tree+0x40/0x140 fs/super.c:1530
do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040
path_mount+0x358/0x914 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568
Cc: <[email protected]>
Fixes: 81819f0fc8285 ("SLUB core")
Reported-by: [email protected]
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Signed-off-by: Chao Yu <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
slab_mutex/cpu_hotplug_lock
A circular locking problem is reported by lockdep due to the following
circular locking dependency.
+--> cpu_hotplug_lock --> slab_mutex --> kn->active --+
| |
+-----------------------------------------------------+
The forward cpu_hotplug_lock ==> slab_mutex ==> kn->active dependency
happens in
kmem_cache_destroy(): cpus_read_lock(); mutex_lock(&slab_mutex);
==> sysfs_slab_unlink()
==> kobject_del()
==> kernfs_remove()
==> __kernfs_remove()
==> kernfs_drain(): rwsem_acquire(&kn->dep_map, ...);
The backward kn->active ==> cpu_hotplug_lock dependency happens in
kernfs_fop_write_iter(): kernfs_get_active();
==> slab_attr_store()
==> cpu_partial_store()
==> flush_all(): cpus_read_lock()
One way to break this circular locking chain is to avoid holding
cpu_hotplug_lock and slab_mutex while deleting the kobject in
sysfs_slab_unlink() which should be equivalent to doing a write_lock
and write_unlock pair of the kn->active virtual lock.
Since the kobject structures are not protected by slab_mutex or the
cpu_hotplug_lock, we can certainly release those locks before doing
the delete operation.
Move sysfs_slab_unlink() and sysfs_slab_release() to the newly
created kmem_cache_release() and call it outside the slab_mutex &
cpu_hotplug_lock critical sections. There will be a slight delay
in the deletion of sysfs files if kmem_cache_release() is called
indirectly from a work function.
Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: David Rientjes <[email protected]>
Link: https://lore.kernel.org/all/YwOImVd+nRUsSAga@hyeyoo/
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
If address of large object is not beginning of folio or size of the
folio is too small, it must be invalid. WARN() and return 0 in such
cases.
Cc: Marco Elver <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
__ksize() is only called by KASAN. Remove export symbol and move
declaration to mm/slab.h as we don't want to grow its callers.
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Drop kmem_alloc event class, and define kmalloc and kmem_cache_alloc
using TRACE_EVENT() macro.
And then this patch does:
- Do not pass pointer to struct kmem_cache to trace_kmalloc.
gfp flag is enough to know if it's accounted or not.
- Avoid dereferencing s->object_size and s->size when not using kmem_cache_alloc event.
- Avoid dereferencing s->name in when not using kmem_cache_free event.
- Adjust s->size to SLOB_UNITS(s->size) * SLOB_UNIT in SLOB
Cc: Vasily Averin <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Drop kmem_alloc event class, rename kmem_alloc_node to kmem_alloc, and
remove _node postfix for NUMA version of tracepoints.
This will break some tools that depend on {kmem_cache_alloc,kmalloc}_node,
but at this point maintaining both kmem_alloc and kmem_alloc_node
event classes does not makes sense at all.
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Despite its name, kmem_cache_alloc[_node]_trace() is hook for inlined
kmalloc. So rename it to kmalloc[_node]_trace().
Move its implementation to slab_common.c by using
__kmem_cache_alloc_node(), but keep CONFIG_TRACING=n varients to save a
function call when CONFIG_TRACING=n.
Use __assume_kmalloc_alignment for kmalloc[_node]_trace instead of
__assume_slab_alignement. Generally kmalloc has larger alignment
requirements.
Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Now everything in kmalloc subsystem can be generalized.
Let's do it!
Generalize __do_kmalloc_node(), __kmalloc_node_track_caller(),
kfree(), __ksize(), __kmalloc(), __kmalloc_node() and move them
to slab_common.c.
In the meantime, rename kmalloc_large_node_notrace()
to __kmalloc_large_node() and make it static as it's now only called in
slab_common.c.
[ [email protected]: adjust kfence skip list to include
__kmem_cache_free so that kfence kunit tests do not fail ]
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
In the following patch, the function free_debug_processing() will be
calling add_partial(), remove_partial() and discard_slab(), se move it
below their definitions to avoid forward declarations. To make review
easier, separate the move from functional changes.
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Acked-by: David Rientjes <[email protected]>
|
|
To unify kmalloc functions in later patch, introduce common alloc/free
functions that does not have tracepoint.
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
There is not much benefit for serving large objects in kmalloc().
Let's pass large requests to page allocator like SLUB for better
maintenance of common code.
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Now that kmalloc_large() and kmalloc_large_node() do mostly same job,
make kmalloc_large() wrapper of kmalloc_large_node_notrace().
In the meantime, add missing flag fix code in
kmalloc_large_node_notrace().
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Now that kmalloc_large_node() is in common code, pass large requests
to page allocator in kmalloc_node() using kmalloc_large_node().
One problem is that currently there is no tracepoint in
kmalloc_large_node(). Instead of simply putting tracepoint in it,
use kmalloc_large_node{,_notrace} depending on its caller to show
useful address for both inlined kmalloc_node() and
__kmalloc_node_track_caller() when large objects are allocated.
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
In later patch SLAB will also pass requests larger than order-1 page
to page allocator. Move kmalloc_large_node() to slab_common.c.
Fold kmalloc_large_node_hook() into kmalloc_large_node() as there is
no other caller.
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
There is no caller of kmalloc_order_trace() except kmalloc_large().
Fold it into kmalloc_large() and remove kmalloc_order{,_trace}().
Also add tracepoint in kmalloc_large() that was previously
in kmalloc_order_trace().
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
__kmalloc(), __kmalloc_node(), __kmalloc_node_track_caller()
mostly do same job. Factor out common code into __do_kmalloc_node().
Note that this patch also fixes missing kasan_kmalloc() in SLUB's
__kmalloc_node_track_caller().
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Make kmalloc_track_caller() wrapper of kmalloc_node_track_caller().
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Now that slab_alloc_node() is available for SLAB when CONFIG_NUMA=n,
remove CONFIG_NUMA ifdefs for common kmalloc functions.
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Make slab_alloc_node() available even when CONFIG_NUMA=n and
make slab_alloc() wrapper of slab_alloc_node().
This is necessary for further cleanup.
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
To implement slab_alloc_node() independent of NUMA configuration,
move NUMA fallback/alternate allocation code into __do_cache_alloc().
One functional change here is not to check availability of node
when allocating from local node.
Signed-off-by: Hyeonggon Yoo <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Return the value from attribute->store(s, buf, len) and
attribute->show(s, buf) directly instead of storing it in
another redundant variable.
Reported-by: Zeal Robot <[email protected]>
Acked-by: Hyeonggon Yoo <[email protected]>
Signed-off-by: ye xingchen <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Return the value from __kmem_cache_shrink() directly instead of storing it
in another redundant variable.
Reported-by: Zeal Robot <[email protected]>
Signed-off-by: ye xingchen <[email protected]>
Acked-by: Hyeonggon Yoo <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
"Misc irqchip fixes: LoongArch driver fixes and a Hyper-V IOMMU fix"
* tag 'irq-urgent-2022-08-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/loongson-liointc: Fix an error handling path in liointc_init()
irqchip/loongarch: Fix irq_domain_alloc_fwnode() abuse
irqchip/loongson-pch-pic: Move find_pch_pic() into CONFIG_ACPI
irqchip/loongson-eiointc: Fix a build warning
irqchip/loongson-eiointc: Fix irq affinity setting
iommu/hyper-v: Use helper instead of directly accessing affinity
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 kprobes fix from Ingo Molnar:
"Fix a kprobes bug in JNG/JNLE emulation when a kprobe is installed at
such instructions, possibly resulting in incorrect execution (the
wrong branch taken)"
* tag 'perf-urgent-2022-08-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kprobes: Fix JNG/JNLE emulation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fixes for tracing:
- Fix a return value of traceprobe_parse_event_name()
- Fix NULL pointer dereference from failed ftrace enabling
- Fix NULL pointer dereference when asking for registers from eprobes
- Make eprobes consistent with kprobes/uprobes, filters and
histograms"
* tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have filter accept "common_cpu" to be consistent
tracing/probes: Have kprobes and uprobes use $COMM too
tracing/eprobes: Have event probes be consistent with kprobes and uprobes
tracing/eprobes: Fix reading of string fields
tracing/eprobes: Do not hardcode $comm as a string
tracing/eprobes: Do not allow eprobes to use $stack, or % for regs
ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead
tracing/perf: Fix double put of trace event when init fails
tracing: React to error return from traceprobe_parse_event_name()
|
|
Make filtering consistent with histograms. As "cpu" can be a field of an
event, allow for "common_cpu" to keep it from being confused with the
"cpu" field of the event.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tzvetomir Stoyanov <[email protected]>
Cc: Tom Zanussi <[email protected]>
Fixes: 1e3bac71c5053 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Suggested-by: Masami Hiramatsu (Google) <[email protected]>
Acked-by: Masami Hiramatsu (Google) <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Both $comm and $COMM can be used to get current->comm in eprobes and the
filtering and histogram logic. Make kprobes and uprobes consistent in this
regard and allow both $comm and $COMM as well. Currently kprobes and
uprobes only handle $comm, which is inconsistent with the other utilities,
and can be confusing to users.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tzvetomir Stoyanov <[email protected]>
Cc: Tom Zanussi <[email protected]>
Fixes: 533059281ee5 ("tracing: probeevent: Introduce new argument fetching code")
Suggested-by: Masami Hiramatsu (Google) <[email protected]>
Acked-by: Masami Hiramatsu (Google) <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Currently, if a symbol "@" is attempted to be used with an event probe
(eprobes), it will cause a NULL pointer dereference crash.
Both kprobes and uprobes can reference data other than the main registers.
Such as immediate address, symbols and the current task name. Have eprobes
do the same thing.
For "comm", if "comm" is used and the event being attached to does not
have the "comm" field, then make it the "$comm" that kprobes has. This is
consistent to the way histograms and filters work.
Link: https://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Tzvetomir Stoyanov <[email protected]>
Cc: Tom Zanussi <[email protected]>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Currently when an event probe (eprobe) hooks to a string field, it does
not display it as a string, but instead as a number. This makes the field
rather useless. Handle the different kinds of strings, dynamic, static,
relational/dynamic etc.
Now when a string field is used, the ":string" type can be used to display
it:
echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events
Link: https://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tzvetomir Stoyanov <[email protected]>
Cc: Tom Zanussi <[email protected]>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|