Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Hyeonggon Yoo <[email protected]>
Tested-by: Mark Rutland <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
struct 'kasan_cache' has a member 'is_kmalloc' indicating whether its host
kmem_cache is a kmalloc cache. With newly introduced is_kmalloc_cache()
helper, 'is_kmalloc' and its related function can be replaced and removed.
Also 'kasan_cache' is only needed by KASAN generic mode, and not by SW/HW
tag modes, so refine its protection macro accordingly, suggested by Andrey
Konoval.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Feng Tang <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
SLUB gets most of its scalability by percpu slabs. However for
CONFIG_SLUB_TINY the goal is minimal memory overhead, not scalability.
Thus, #ifdef out the whole kmem_cache_cpu percpu structure and
associated code. Additionally to the slab page savings, this reduces
percpu allocator usage, and code size.
This change builds on recent commit c7323a5ad078 ("mm/slub: restrict
sysfs validation to debug caches and make it safe"), as caches with
enabled debugging also avoid percpu slabs and all allocations and
freeing ends up working with the partial list. With a bit more
refactoring by the preceding patches, use the same code paths with
CONFIG_SLUB_TINY.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
|
|
Currently SLUB enables its sysfs support depending unconditionally on
the general CONFIG_SYSFS setting. To reduce the configuration
combination space, make CONFIG_SLUB_TINY disable SLUB's sysfs support by
reusing the existing SLAB_SUPPORTS_SYSFS define. It is unlikely that
real tiny systems would combine CONFIG_SLUB_TINY with CONFIG_SYSFS, but
a randconfig might.
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
|
|
With CONFIG_HARDENED_USERCOPY not enabled, there are no
__check_heap_object() checks happening that would use the struct
kmem_cache useroffset and usersize fields. Yet the fields are still
initialized, preventing merging of otherwise compatible caches.
Also the fields contribute to struct kmem_cache size unnecessarily when
unused. Thus #ifdef them out completely when CONFIG_HARDENED_USERCOPY is
disabled. In kmem_dump_obj() print object_size instead of usersize, as
that's actually the intention.
In a quick virtme boot test, this has reduced the number of caches in
/proc/slabinfo from 131 to 111.
Cc: Kees Cook <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
|
|
max field holds the largest slab order that was ever used for a slab cache.
But it's unused now. Remove it.
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
KASAN, KFENCE and memcg interact with SLAB or SLUB internals through
functions nearest_obj(), obj_to_index() and objs_per_slab() that use
struct page as parameter. This patch converts it to struct slab
including all callers, through a coccinelle semantic patch.
// Options: --include-headers --no-includes --smpl-spacing include/linux/slab_def.h include/linux/slub_def.h mm/slab.h mm/kasan/*.c mm/kfence/kfence_test.c mm/memcontrol.c mm/slab.c mm/slub.c
// Note: needs coccinelle 1.1.1 to avoid breaking whitespace
@@
@@
-objs_per_slab_page(
+objs_per_slab(
...
)
{ ... }
@@
@@
-objs_per_slab_page(
+objs_per_slab(
...
)
@@
identifier fn =~ "obj_to_index|objs_per_slab";
@@
fn(...,
- const struct page *page
+ const struct slab *slab
,...)
{
<...
(
- page_address(page)
+ slab_address(slab)
|
- page
+ slab
)
...>
}
@@
identifier fn =~ "nearest_obj";
@@
fn(...,
- struct page *page
+ const struct slab *slab
,...)
{
<...
(
- page_address(page)
+ slab_address(slab)
|
- page
+ slab
)
...>
}
@@
identifier fn =~ "nearest_obj|obj_to_index|objs_per_slab";
expression E;
@@
fn(...,
(
- slab_page(E)
+ E
|
- virt_to_page(E)
+ virt_to_slab(E)
|
- virt_to_head_page(E)
+ virt_to_slab(E)
|
- page
+ page_slab(page)
)
,...)
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
|
|
Update comments mentioning pages to mention slabs where appropriate.
Also some goto labels.
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
|
|
The majority of conversion from struct page to struct slab in SLUB
internals can be delegated to a coccinelle semantic patch. This includes
renaming of variables with 'page' in name to 'slab', and similar.
Big thanks to Julia Lawall and Luis Chamberlain for help with
coccinelle.
// Options: --include-headers --no-includes --smpl-spacing include/linux/slub_def.h mm/slub.c
// Note: needs coccinelle 1.1.1 to avoid breaking whitespace, and ocaml for the
// embedded script
// build list of functions to exclude from applying the next rule
@initialize:ocaml@
@@
let ok_function p =
not (List.mem (List.hd p).current_element ["nearest_obj";"obj_to_index";"objs_per_slab_page";"__slab_lock";"__slab_unlock";"free_nonslab_page";"kmalloc_large_node"])
// convert the type from struct page to struct page in all functions except the
// list from previous rule
// this also affects struct kmem_cache_cpu, but that's ok
@@
position p : script:ocaml() { ok_function p };
@@
- struct page@p
+ struct slab
// in struct kmem_cache_cpu, change the name from page to slab
// the type was already converted by the previous rule
@@
@@
struct kmem_cache_cpu {
...
-struct slab *page;
+struct slab *slab;
...
}
// there are many places that use c->page which is now c->slab after the
// previous rule
@@
struct kmem_cache_cpu *c;
@@
-c->page
+c->slab
@@
@@
struct kmem_cache {
...
- unsigned int cpu_partial_pages;
+ unsigned int cpu_partial_slabs;
...
}
@@
struct kmem_cache *s;
@@
- s->cpu_partial_pages
+ s->cpu_partial_slabs
@@
@@
static void
- setup_page_debug(
+ setup_slab_debug(
...)
{...}
@@
@@
- setup_page_debug(
+ setup_slab_debug(
...);
// for all functions (with exceptions), change any "struct slab *page"
// parameter to "struct slab *slab" in the signature, and generally all
// occurences of "page" to "slab" in the body - with some special cases.
@@
identifier fn !~ "free_nonslab_page|obj_to_index|objs_per_slab_page|nearest_obj";
@@
fn(...,
- struct slab *page
+ struct slab *slab
,...)
{
<...
- page
+ slab
...>
}
// similar to previous but the param is called partial_page
@@
identifier fn;
@@
fn(...,
- struct slab *partial_page
+ struct slab *partial_slab
,...)
{
<...
- partial_page
+ partial_slab
...>
}
// similar to previous but for functions that take pointer to struct page ptr
@@
identifier fn;
@@
fn(...,
- struct slab **ret_page
+ struct slab **ret_slab
,...)
{
<...
- ret_page
+ ret_slab
...>
}
// functions converted by previous rules that were temporarily called using
// slab_page(E) so we want to remove the wrapper now that they accept struct
// slab ptr directly
@@
identifier fn =~ "slab_free|do_slab_free";
expression E;
@@
fn(...,
- slab_page(E)
+ E
,...)
// similar to previous but for another pattern
@@
identifier fn =~ "slab_pad_check|check_object";
@@
fn(...,
- folio_page(folio, 0)
+ slab
,...)
// functions that were returning struct page ptr and now will return struct
// slab ptr, including slab_page() wrapper removal
@@
identifier fn =~ "allocate_slab|new_slab";
expression E;
@@
static
-struct slab *
+struct slab *
fn(...)
{
<...
- slab_page(E)
+ E
...>
}
// rename any former struct page * declarations
@@
@@
struct slab *
(
- page
+ slab
|
- partial_page
+ partial_slab
|
- oldpage
+ oldslab
)
;
// this has to be separate from previous rule as page and page2 appear at the
// same line
@@
@@
struct slab *
-page2
+slab2
;
// similar but with initial assignment
@@
expression E;
@@
struct slab *
(
- page
+ slab
|
- flush_page
+ flush_slab
|
- discard_page
+ slab_to_discard
|
- page_to_unfreeze
+ slab_to_unfreeze
)
= E;
// convert most of struct page to struct slab usage inside functions (with
// exceptions), including specific variable renames
@@
identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node";
expression E;
@@
fn(...)
{
<...
(
- int pages;
+ int slabs;
|
- int pages = E;
+ int slabs = E;
|
- page
+ slab
|
- flush_page
+ flush_slab
|
- partial_page
+ partial_slab
|
- oldpage->pages
+ oldslab->slabs
|
- oldpage
+ oldslab
|
- unsigned int nr_pages;
+ unsigned int nr_slabs;
|
- nr_pages
+ nr_slabs
|
- unsigned int partial_pages = E;
+ unsigned int partial_slabs = E;
|
- partial_pages
+ partial_slabs
)
...>
}
// this has to be split out from the previous rule so that lines containing
// multiple matching changes will be fully converted
@@
identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node";
@@
fn(...)
{
<...
(
- slab->pages
+ slab->slabs
|
- pages
+ slabs
|
- page2
+ slab2
|
- discard_page
+ slab_to_discard
|
- page_to_unfreeze
+ slab_to_unfreeze
)
...>
}
// after we simply changed all occurences of page to slab, some usages need
// adjustment for slab-specific functions, or use slab_page() wrapper
@@
identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node";
@@
fn(...)
{
<...
(
- page_slab(slab)
+ slab
|
- kasan_poison_slab(slab)
+ kasan_poison_slab(slab_page(slab))
|
- page_address(slab)
+ slab_address(slab)
|
- page_size(slab)
+ slab_size(slab)
|
- PageSlab(slab)
+ folio_test_slab(slab_folio(slab))
|
- page_to_nid(slab)
+ slab_nid(slab)
|
- compound_order(slab)
+ slab_order(slab)
)
...>
}
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Reviewed-by: Hyeonggon Yoo <[email protected]>
Tested-by: Hyeonggon Yoo <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Luis Chamberlain <[email protected]>
|
|
There are no callers outside of mm/slub.c anymore.
Move freelist_corrupted() that calls object_err() to avoid a need for
forward declaration.
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
|
|
With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of
partial slabs that can be promoted to cpu slab when the previous one is
depleted, without accessing the shared partial list. A slab can be
added to this list by 1) refill of an empty list from get_partial_node()
- once we really have to access the shared partial list, we acquire
multiple slabs to amortize the cost of locking, and 2) first free to a
previously full slab - instead of putting the slab on a shared partial
list, we can more cheaply freeze it and put it on the per-cpu list.
To control how large a percpu partial list can grow for a kmem cache,
set_cpu_partial() calculates a target number of free objects on each
cpu's percpu partial list, and this can be also set by the sysfs file
cpu_partial.
However, the tracking of actual number of objects is imprecise, in order
to limit overhead from cpu X freeing an objects to a slab on percpu
partial list of cpu Y. Basically, the percpu partial slabs form a
single linked list, and when we add a new slab to the list with current
head "oldpage", we set in the struct page of the slab we're adding:
page->pages = oldpage->pages + 1; // this is precise
page->pobjects = oldpage->pobjects + (page->objects - page->inuse);
page->next = oldpage;
Thus the real number of free objects in the slab (objects - inuse) is
only determined at the moment of adding the slab to the percpu partial
list, and further freeing doesn't update the pobjects counter nor
propagate it to the current list head. As Jann reports [1], this can
easily lead to large inaccuracies, where the target number of objects
(up to 30 by default) can translate to the same number of (empty) slab
pages on the list. In case 2) above, we put a slab with 1 free object
on the list, thus only increase page->pobjects by 1, even if there are
subsequent frees on the same slab. Jann has noticed this in practice
and so did we [2] when investigating significant increase of kmemcg
usage after switching from SLAB to SLUB.
While this is no longer a problem in kmemcg context thanks to the
accounting rewrite in 5.9, the memory waste is still not ideal and it's
questionable whether it makes sense to perform free object count based
control when object counts can easily become so much inaccurate. So
this patch converts the accounting to be based on number of pages only
(which is precise) and removes the page->pobjects field completely.
This is also ultimately simpler.
To retain the existing set_cpu_partial() heuristic, first calculate the
target number of objects as previously, but then convert it to target
number of pages by assuming the pages will be half-filled on average.
This assumption might obviously also be inaccurate in practice, but
cannot degrade to actual number of pages being equal to the target
number of objects.
We could also skip the intermediate step with target number of objects
and rewrite the heuristic in terms of pages. However we still have the
sysfs file cpu_partial which uses number of objects and could break
existing users if it suddenly becomes number of pages, so this patch
doesn't do that.
In practice, after this patch the heuristics limit the size of percpu
partial list up to 2 pages. In case of a reported regression (which
would mean some workload has benefited from the previous imprecise
object based counting), we can tune the heuristics to get a better
compromise within the new scheme, while still avoid the unexpectedly
long percpu partial lists.
[1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/
[2] https://lore.kernel.org/all/[email protected]/
==========
Evaluation
==========
Mel was kind enough to run v1 through mmtests machinery for netperf
(localhost) and hackbench and, for most significant results see below.
So there are some apparent regressions, especially with hackbench, which
I think ultimately boils down to having shorter percpu partial lists on
average and some benchmarks benefiting from longer ones. Monitoring
slab usage also indicated less memory usage by slab. Based on that, the
following patch will bump the defaults to allow longer percpu partial
lists than after this patch.
However the goal is certainly not such that we would limit the percpu
partial lists to 30 pages just because previously a specific alloc/free
pattern could lead to the limit of 30 objects translate to a limit to 30
pages - that would make little sense. This is a correctness patch, and
if a workload benefits from larger lists, the sysfs tuning knobs are
still there to allow that.
Netperf
2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM
TCP-RR:
hmean before 127045.79 after 121092.94 (-4.69%, worse)
stddev before 2634.37 after 1254.08
UDP-RR:
hmean before 166985.45 after 160668.94 ( -3.78%, worse)
stddev before 4059.69 after 1943.63
2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM
TCP-RR:
hmean before 84173.25 after 76914.72 ( -8.62%, worse)
UDP-RR:
hmean before 93571.12 after 96428.69 ( 3.05%, better)
stddev before 23118.54 after 16828.14
2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM
TCP-RR:
hmean before 49984.92 after 48922.27 ( -2.13%, worse)
stddev before 6248.15 after 4740.51
UDP-RR:
hmean before 61854.31 after 68761.81 ( 11.17%, better)
stddev before 4093.54 after 5898.91
other machines - within 2%
Hackbench
(results before and after the patch, negative % means worse)
2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM
hackbench-process-sockets
Amean 1 0.5380 0.5583 ( -3.78%)
Amean 4 0.7510 0.8150 ( -8.52%)
Amean 7 0.7930 0.9533 ( -20.22%)
Amean 12 0.7853 1.1313 ( -44.06%)
Amean 21 1.1520 1.4993 ( -30.15%)
Amean 30 1.6223 1.9237 ( -18.57%)
Amean 48 2.6767 2.9903 ( -11.72%)
Amean 79 4.0257 5.1150 ( -27.06%)
Amean 110 5.5193 7.4720 ( -35.38%)
Amean 141 7.2207 9.9840 ( -38.27%)
Amean 172 8.4770 12.1963 ( -43.88%)
Amean 203 9.6473 14.3137 ( -48.37%)
Amean 234 11.3960 18.7917 ( -64.90%)
Amean 265 13.9627 22.4607 ( -60.86%)
Amean 296 14.9163 26.0483 ( -74.63%)
hackbench-thread-sockets
Amean 1 0.5597 0.5877 ( -5.00%)
Amean 4 0.7913 0.8960 ( -13.23%)
Amean 7 0.8190 1.0017 ( -22.30%)
Amean 12 0.9560 1.1727 ( -22.66%)
Amean 21 1.7587 1.5660 ( 10.96%)
Amean 30 2.4477 1.9807 ( 19.08%)
Amean 48 3.4573 3.0630 ( 11.41%)
Amean 79 4.7903 5.1733 ( -8.00%)
Amean 110 6.1370 7.4220 ( -20.94%)
Amean 141 7.5777 9.2617 ( -22.22%)
Amean 172 9.2280 11.0907 ( -20.18%)
Amean 203 10.2793 13.3470 ( -29.84%)
Amean 234 11.2410 17.1070 ( -52.18%)
Amean 265 12.5970 23.3323 ( -85.22%)
Amean 296 17.1540 24.2857 ( -41.57%)
2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads
per socket), 384GB RAM
hackbench-process-sockets
Amean 1 0.5760 0.4793 ( 16.78%)
Amean 4 0.9430 0.9707 ( -2.93%)
Amean 7 1.5517 1.8843 ( -21.44%)
Amean 12 2.4903 2.7267 ( -9.49%)
Amean 21 3.9560 4.2877 ( -8.38%)
Amean 30 5.4613 5.8343 ( -6.83%)
Amean 48 8.5337 9.2937 ( -8.91%)
Amean 79 14.0670 15.2630 ( -8.50%)
Amean 110 19.2253 21.2467 ( -10.51%)
Amean 141 23.7557 25.8550 ( -8.84%)
Amean 172 28.4407 29.7603 ( -4.64%)
Amean 203 33.3407 33.9927 ( -1.96%)
Amean 234 38.3633 39.1150 ( -1.96%)
Amean 265 43.4420 43.8470 ( -0.93%)
Amean 296 48.3680 48.9300 ( -1.16%)
hackbench-thread-sockets
Amean 1 0.6080 0.6493 ( -6.80%)
Amean 4 1.0000 1.0513 ( -5.13%)
Amean 7 1.6607 2.0260 ( -22.00%)
Amean 12 2.7637 2.9273 ( -5.92%)
Amean 21 5.0613 4.5153 ( 10.79%)
Amean 30 6.3340 6.1140 ( 3.47%)
Amean 48 9.0567 9.5577 ( -5.53%)
Amean 79 14.5657 15.7983 ( -8.46%)
Amean 110 19.6213 21.6333 ( -10.25%)
Amean 141 24.1563 26.2697 ( -8.75%)
Amean 172 28.9687 30.2187 ( -4.32%)
Amean 203 33.9763 34.6970 ( -2.12%)
Amean 234 38.8647 39.3207 ( -1.17%)
Amean 265 44.0813 44.1507 ( -0.16%)
Amean 296 49.2040 49.4330 ( -0.47%)
2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads
per socket), 512GB RAM
hackbench-process-sockets
Amean 1 0.5027 0.5017 ( 0.20%)
Amean 4 1.1053 1.2033 ( -8.87%)
Amean 7 1.8760 2.1820 ( -16.31%)
Amean 12 2.9053 3.1810 ( -9.49%)
Amean 21 4.6777 4.9920 ( -6.72%)
Amean 30 6.5180 6.7827 ( -4.06%)
Amean 48 10.0710 10.5227 ( -4.48%)
Amean 79 16.4250 17.5053 ( -6.58%)
Amean 110 22.6203 24.4617 ( -8.14%)
Amean 141 28.0967 31.0363 ( -10.46%)
Amean 172 34.4030 36.9233 ( -7.33%)
Amean 203 40.5933 43.0850 ( -6.14%)
Amean 234 46.6477 48.7220 ( -4.45%)
Amean 265 53.0530 53.9597 ( -1.71%)
Amean 296 59.2760 59.9213 ( -1.09%)
hackbench-thread-sockets
Amean 1 0.5363 0.5330 ( 0.62%)
Amean 4 1.1647 1.2157 ( -4.38%)
Amean 7 1.9237 2.2833 ( -18.70%)
Amean 12 2.9943 3.3110 ( -10.58%)
Amean 21 4.9987 5.1880 ( -3.79%)
Amean 30 6.7583 7.0043 ( -3.64%)
Amean 48 10.4547 10.8353 ( -3.64%)
Amean 79 16.6707 17.6790 ( -6.05%)
Amean 110 22.8207 24.4403 ( -7.10%)
Amean 141 28.7090 31.0533 ( -8.17%)
Amean 172 34.9387 36.8260 ( -5.40%)
Amean 203 41.1567 43.0450 ( -4.59%)
Amean 234 47.3790 48.5307 ( -2.43%)
Amean 265 53.9543 54.6987 ( -1.38%)
Amean 296 60.0820 60.2163 ( -0.22%)
1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads),
32 GB RAM
hackbench-process-sockets
Amean 1 1.4760 1.5773 ( -6.87%)
Amean 3 3.9370 4.0910 ( -3.91%)
Amean 5 6.6797 6.9357 ( -3.83%)
Amean 7 9.3367 9.7150 ( -4.05%)
Amean 12 15.7627 16.1400 ( -2.39%)
Amean 18 23.5360 23.6890 ( -0.65%)
Amean 24 31.0663 31.3137 ( -0.80%)
Amean 30 38.7283 39.0037 ( -0.71%)
Amean 32 41.3417 41.6097 ( -0.65%)
hackbench-thread-sockets
Amean 1 1.5250 1.6043 ( -5.20%)
Amean 3 4.0897 4.2603 ( -4.17%)
Amean 5 6.7760 7.0933 ( -4.68%)
Amean 7 9.4817 9.9157 ( -4.58%)
Amean 12 15.9610 16.3937 ( -2.71%)
Amean 18 23.9543 24.3417 ( -1.62%)
Amean 24 31.4400 31.7217 ( -0.90%)
Amean 30 39.2457 39.5467 ( -0.77%)
Amean 32 41.8267 42.1230 ( -0.71%)
2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads
per socket), 64GB RAM
hackbench-process-sockets
Amean 1 1.0347 1.0880 ( -5.15%)
Amean 4 1.7267 1.8527 ( -7.30%)
Amean 7 2.6707 2.8110 ( -5.25%)
Amean 12 4.1617 4.3383 ( -4.25%)
Amean 21 7.0070 7.2600 ( -3.61%)
Amean 30 9.9187 10.2397 ( -3.24%)
Amean 48 15.6710 16.3923 ( -4.60%)
Amean 79 24.7743 26.1247 ( -5.45%)
Amean 110 34.3000 35.9307 ( -4.75%)
Amean 141 44.2043 44.8010 ( -1.35%)
Amean 172 54.2430 54.7260 ( -0.89%)
Amean 192 60.6557 60.9777 ( -0.53%)
hackbench-thread-sockets
Amean 1 1.0610 1.1353 ( -7.01%)
Amean 4 1.7543 1.9140 ( -9.10%)
Amean 7 2.7840 2.9573 ( -6.23%)
Amean 12 4.3813 4.4937 ( -2.56%)
Amean 21 7.3460 7.5350 ( -2.57%)
Amean 30 10.2313 10.5190 ( -2.81%)
Amean 48 15.9700 16.5940 ( -3.91%)
Amean 79 25.3973 26.6637 ( -4.99%)
Amean 110 35.1087 36.4797 ( -3.91%)
Amean 141 45.8220 46.3053 ( -1.05%)
Amean 172 55.4917 55.7320 ( -0.43%)
Amean 192 62.7490 62.5410 ( 0.33%)
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Reported-by: Jann Horn <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of
local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's
equivalent, with better lockdep visibility. On PREEMPT_RT that means better
preemption.
However, the cost on PREEMPT_RT is the loss of lockless fast paths which only
work with cpu freelist. Those are designed to detect and recover from being
preempted by other conflicting operations (both fast or slow path), but the
slow path operations assume they cannot be preempted by a fast path operation,
which is guaranteed naturally with disabled irqs. With local locks on
PREEMPT_RT, the fast paths now also need to take the local lock to avoid races.
In the allocation fastpath slab_alloc_node() we can just defer to the slowpath
__slab_alloc() which also works with cpu freelist, but under the local lock.
In the free fastpath do_slab_free() we have to add a new local lock protected
version of freeing to the cpu freelist, as the existing slowpath only works
with the page freelist.
Also update the comment about locking scheme in SLUB to reflect changes done
by this series.
[ Mike Galbraith <[email protected]>: use local_lock() without irq in PREEMPT_RT
scope; debugging of RT crashes resulting in put_cpu_partial() locking changes ]
Signed-off-by: Vlastimil Babka <[email protected]>
|
|
Inserts KFENCE hooks into the SLUB allocator.
To pass the originally requested size to KFENCE, add an argument
'orig_size' to slab_alloc*(). The additional argument is required to
preserve the requested original size for kmalloc() allocations, which
uses size classes (e.g. an allocation of 272 bytes will return an object
of size 512). Therefore, kmem_cache::size does not represent the
kmalloc-caller's requested size, and we must introduce the argument
'orig_size' to propagate the originally requested size to KFENCE.
Without the originally requested size, we would not be able to detect
out-of-bounds accesses for objects placed at the end of a KFENCE object
page if that object is not equal to the kmalloc-size class it was
bucketed into.
When KFENCE is disabled, there is no additional overhead, since
slab_alloc*() functions are __always_inline.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Co-developed-by: Marco Elver <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Instead of having two sets of kmem_caches: one for system-wide and
non-accounted allocations and the second one shared by all accounted
allocations, we can use just one.
The idea is simple: space for obj_cgroup metadata can be allocated on
demand and filled only for accounted allocations.
It allows to remove a bunch of code which is required to handle kmem_cache
clones for accounted allocations. There is no more need to create them,
accumulate statistics, propagate attributes, etc. It's a quite
significant simplification.
Also, because the total number of slab_caches is reduced almost twice (not
all kmem_caches have a memcg clone), some additional memory savings are
expected. On my devvm it additionally saves about 3.5% of slab memory.
[[email protected]: fix build on MIPS]
Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Naresh Kamboju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Allocate and release memory to store obj_cgroup pointers for each non-root
slab page. Reuse page->mem_cgroup pointer to store a pointer to the
allocated space.
This commit temporarily increases the memory footprint of the kernel memory
accounting. To store obj_cgroup pointers we'll need a place for an
objcg_pointer for each allocated object. However, the following patches
in the series will enable sharing of slab pages between memory cgroups,
which will dramatically increase the total slab utilization. And the final
memory footprint will be significantly smaller than before.
To distinguish between obj_cgroups and memcg pointers in case when it's
not obvious which one is used (as in page_cgroup_ino()), let's always set
the lowest bit in the obj_cgroup case. The original obj_cgroups
pointer is marked to be ignored by kmemleak, which otherwise would
report a memory leak for each allocated vector.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This commit implements SLUB version of the obj_to_index() function, which
will be required to calculate the offset of obj_cgroup in the obj_cgroups
vector to store/obtain the objcg ownership data.
To make it faster, let's repeat the SLAB's trick introduced by commit
6a2d7a955d8d ("SLAB: use a multiply instead of a divide in
obj_to_index()") and avoid an expensive division.
Vlastimil Babka noticed, that SLUB does have already a similar function
called slab_index(), which is defined only if SLUB_DEBUG is enabled. The
function does a similar math, but with a division, and it also takes a
page address instead of a page pointer.
Let's remove slab_index() and replace it with the new helper
__obj_to_index(), which takes a page address. obj_to_index() will be a
simple wrapper taking a page pointer and passing page_address(page) into
__obj_to_index().
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Capitialize comment string, use C89 comment style, correct
grammar/punctuation in comments.
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tobin C. Harding <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In kernel 4.17 I removed some code from dm-bufio that did slab cache
merging (commit 21bb13276768: "dm bufio: remove code that merges slab
caches") - both slab and slub support merging caches with identical
attributes, so dm-bufio now just calls kmem_cache_create and relies on
implicit merging.
This uncovered a bug in the slub subsystem - if we delete a cache and
immediatelly create another cache with the same attributes, it fails
because of duplicate filename in /sys/kernel/slab/. The slub subsystem
offloads freeing the cache to a workqueue - and if we create the new
cache before the workqueue runs, it complains because of duplicate
filename in sysfs.
This patch fixes the bug by moving the call of kobject_del from
sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
while we hold slab_mutex - so that the sysfs entry is deleted before a
cache with the same attributes could be created.
Running device-mapper-test-suite with:
dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/
triggered:
Buffer I/O error on dev dm-0, logical block 1572848, async page read
device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
device-mapper: thin: 253:1: aborting current metadata transaction
sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
Workqueue: dm-thin do_worker [dm_thin_pool]
Call Trace:
dump_stack+0x5a/0x73
sysfs_warn_dup+0x58/0x70
sysfs_create_dir_ns+0x77/0x80
kobject_add_internal+0xba/0x2e0
kobject_init_and_add+0x70/0xb0
sysfs_slab_add+0xb1/0x250
__kmem_cache_create+0x116/0x150
create_cache+0xd9/0x1f0
kmem_cache_create_usercopy+0x1c1/0x250
kmem_cache_create+0x18/0x20
dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
__create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
metadata_operation_failed+0x59/0x100 [dm_thin_pool]
alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
process_cell+0x2a3/0x550 [dm_thin_pool]
do_worker+0x28d/0x8f0 [dm_thin_pool]
process_one_work+0x171/0x370
worker_thread+0x49/0x3f0
kthread+0xf8/0x130
ret_from_fork+0x35/0x40
kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
kmem_cache_create(dm_bufio_buffer-16) failed with error -17
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <[email protected]>
Reported-by: Mike Snitzer <[email protected]>
Tested-by: Mike Snitzer <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The reserved field was only used for embedding an rcu_head in the data
structure. With the previous commit, we no longer need it. That lets us
remove the 'reserved' argument to a lot of functions.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
struct kmem_cache_order_objects is for mixing order and number of
objects, and orders aren't big enough to warrant 64-bit width.
Propagate unsignedness down so that everything fits.
!!! Patch assumes that "PAGE_SIZE << order" doesn't overflow. !!!
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If kmem case sizes are 32-bit, then usecopy region should be too.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: David Miller <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Linux doesn't support negative length objects (including meta data).
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Linux doesn't support negative length objects.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
->offset is free pointer offset from the start of the object, can't be
negative.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
/*
* cpu_partial determined the maximum number of objects
* kept in the per cpu partial lists of a processor.
*/
Can't be negative.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
->inuse is "the number of bytes in actual use by the object",
can't be negative.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Kmem cache alignment can't be negative.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
->reserved is either 0 or sizeof(struct rcu_head), can't be negative.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Padding length can't be negative.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
->max_attr_size is maximum length of every SLAB memcg attribute
ever written. VFS limits those to INT_MAX.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
->remote_node_defrag_ratio is in range 0..1000.
This also adds a check and modifies the behavior to return an error
code. Before this patch invalid values were ignored.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <[email protected]>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Kees Cook <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
|
|
Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON,
etc).
SLAB is bloated temporarily by switching to "unsigned long", but only
temporarily.
Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Acked-by: Pekka Enberg <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <[email protected]>
Reviewed-by: Philippe Ombredanne <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
This SLUB free list pointer obfuscation code is modified from Brad
Spengler/PaX Team's code in the last public patch of grsecurity/PaX
based on my understanding of the code. Changes or omissions from the
original code are mine and don't reflect the original grsecurity/PaX
code.
This adds a per-cache random value to SLUB caches that is XORed with
their freelist pointer address and value. This adds nearly zero
overhead and frustrates the very common heap overflow exploitation
method of overwriting freelist pointers.
A recent example of the attack is written up here:
http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit
and there is a section dedicated to the technique the book "A Guide to
Kernel Exploitation: Attacking the Core".
This is based on patches by Daniel Micay, and refactored to minimize the
use of #ifdef.
With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following
run times:
before:
mean 10.11882499999999999995
variance .03320378329145728642
stdev .18221905304181911048
after:
mean 10.12654000000000000014
variance .04700556623115577889
stdev .21680767106160192064
The difference gets lost in the noise, but if the above is to be taken
literally, using CONFIG_FREELIST_HARDENED is 0.07% slower.
Link: http://lkml.kernel.org/r/20170802180609.GA66807@beast
Signed-off-by: Kees Cook <[email protected]>
Suggested-by: Daniel Micay <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kmem_cache->cpu_partial is just used when CONFIG_SLUB_CPU_PARTIAL is
set, so wrap it with config CONFIG_SLUB_CPU_PARTIAL will save some space
on 32bit arch.
This patch wraps kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIAL
and wraps its sysfs too.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
cpu_slab's field partial is used when CONFIG_SLUB_CPU_PARTIAL is set,
which means we can save a pointer's space on each cpu for every slub
item.
This patch wraps cpu_slab->partial in CONFIG_SLUB_CPU_PARTIAL and wraps
its sysfs use too.
[[email protected]: avoid strange 80-col tricks]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "try to save some memory for kmem_cache in some cases", v2.
kmem_cache is a frequently used data in kernel. During the code
reading, I found maybe we could save some space in some cases.
1. On 64bit arch, type int will occupy a word if it doesn't sit well.
2. cpu_slab->partial is just used when CONFIG_SLUB_CPU_PARTIAL is set
3. cpu_partial is just used when CONFIG_SLUB_CPU_PARTIAL is set, while
just save some space on 32bit arch.
This patch (of 3):
On 64bit arch, struct is 8-bytes aligned, so int will occupy a word if
it doesn't sit well.
This patch pack red_left_pad with reserved to save 8 bytes for struct
kmem_cache on a 64bit arch.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit bf5eb3de3847 ("slub: separate out sysfs_slab_release() from
sysfs_slab_remove()") made slub sysfs file removals synchronous to
kmem_cache shutdown.
Unfortunately, this created a possible ABBA deadlock between slab_mutex
and sysfs draining mechanism triggering the following lockdep warning.
======================================================
[ INFO: possible circular locking dependency detected ]
4.10.0-test+ #48 Not tainted
-------------------------------------------------------
rmmod/1211 is trying to acquire lock:
(s_active#120){++++.+}, at: [<ffffffff81308073>] kernfs_remove+0x23/0x40
but task is already holding lock:
(slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (slab_mutex){+.+.+.}:
lock_acquire+0xf6/0x1f0
__mutex_lock+0x75/0x950
mutex_lock_nested+0x1b/0x20
slab_attr_store+0x75/0xd0
sysfs_kf_write+0x45/0x60
kernfs_fop_write+0x13c/0x1c0
__vfs_write+0x28/0x120
vfs_write+0xc8/0x1e0
SyS_write+0x49/0xa0
entry_SYSCALL_64_fastpath+0x1f/0xc2
-> #0 (s_active#120){++++.+}:
__lock_acquire+0x10ed/0x1260
lock_acquire+0xf6/0x1f0
__kernfs_remove+0x254/0x320
kernfs_remove+0x23/0x40
sysfs_remove_dir+0x51/0x80
kobject_del+0x18/0x50
__kmem_cache_shutdown+0x3e6/0x460
kmem_cache_destroy+0x1fb/0x2d0
kvm_exit+0x2d/0x80 [kvm]
vmx_exit+0x19/0xa1b [kvm_intel]
SyS_delete_module+0x198/0x1f0
entry_SYSCALL_64_fastpath+0x1f/0xc2
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(slab_mutex);
lock(s_active#120);
lock(slab_mutex);
lock(s_active#120);
*** DEADLOCK ***
2 locks held by rmmod/1211:
#0: (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a7877>] get_online_cpus+0x37/0x80
#1: (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0
stack backtrace:
CPU: 3 PID: 1211 Comm: rmmod Not tainted 4.10.0-test+ #48
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
Call Trace:
print_circular_bug+0x1be/0x210
__lock_acquire+0x10ed/0x1260
lock_acquire+0xf6/0x1f0
__kernfs_remove+0x254/0x320
kernfs_remove+0x23/0x40
sysfs_remove_dir+0x51/0x80
kobject_del+0x18/0x50
__kmem_cache_shutdown+0x3e6/0x460
kmem_cache_destroy+0x1fb/0x2d0
kvm_exit+0x2d/0x80 [kvm]
vmx_exit+0x19/0xa1b [kvm_intel]
SyS_delete_module+0x198/0x1f0
? SyS_delete_module+0x5/0x1f0
entry_SYSCALL_64_fastpath+0x1f/0xc2
It'd be the cleanest to deal with the issue by removing sysfs files
without holding slab_mutex before the rest of shutdown; however, given
the current code structure, it is pretty difficult to do so.
This patch punts sysfs file removal to a work item. Before commit
bf5eb3de3847, the removal was punted to a RCU delayed work item which is
executed after release. Now, we're punting to a different work item on
shutdown which still maintains the goal removing the sysfs files earlier
when destroying kmem_caches.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()")
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Steven Rostedt (VMware) <[email protected]>
Tested-by: Steven Rostedt (VMware) <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Separate out slub sysfs removal and release, and call the former earlier
from __kmem_cache_shutdown(). There's no reason to defer sysfs removal
through RCU and this will later allow us to remove sysfs files way
earlier during memory cgroup offline instead of release.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For KASAN builds:
- switch SLUB allocator to using stackdepot instead of storing the
allocation/deallocation stacks in the objects;
- change the freelist hook so that parts of the freelist can be put
into the quarantine.
[[email protected]: fixes]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Steven Rostedt (Red Hat) <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kostya Serebryany <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Kuthonuzo Luruo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When looking up the nearest SLUB object for a given address, correctly
calculate its offset if SLAB_RED_ZONE is enabled for that cache.
Previously, when KASAN had detected an error on an object from a cache
with SLAB_RED_ZONE set, the actual start address of the object was
miscalculated, which led to random stacks having been reported.
When looking up the nearest SLUB object for a given address, correctly
calculate its offset if SLAB_RED_ZONE is enabled for that cache.
Fixes: 7ed2f9e663854db ("mm, kasan: SLAB support")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Steven Rostedt (Red Hat) <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Kostya Serebryany <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Kuthonuzo Luruo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Implements freelist randomization for the SLUB allocator. It was
previous implemented for the SLAB allocator. Both use the same
configuration option (CONFIG_SLAB_FREELIST_RANDOM).
The list is randomized during initialization of a new set of pages. The
order on different freelist sizes is pre-computed at boot for
performance. Each kmem_cache has its own randomized freelist.
This security feature reduces the predictability of the kernel SLUB
allocator against heap overflows rendering attacks much less stable.
For example these attacks exploit the predictability of the heap:
- Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
- Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)
Performance results:
slab_test impact is between 3% to 4% on average for 100000 attempts
without smp. It is a very focused testing, kernbench show the overall
impact on the system is way lower.
Before:
Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
2. Kmalloc: alloc/free test
100000 times kmalloc(8)/kfree -> 70 cycles
100000 times kmalloc(16)/kfree -> 70 cycles
100000 times kmalloc(32)/kfree -> 70 cycles
100000 times kmalloc(64)/kfree -> 70 cycles
100000 times kmalloc(128)/kfree -> 70 cycles
100000 times kmalloc(256)/kfree -> 69 cycles
100000 times kmalloc(512)/kfree -> 70 cycles
100000 times kmalloc(1024)/kfree -> 73 cycles
100000 times kmalloc(2048)/kfree -> 72 cycles
100000 times kmalloc(4096)/kfree -> 71 cycles
After:
Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
2. Kmalloc: alloc/free test
100000 times kmalloc(8)/kfree -> 66 cycles
100000 times kmalloc(16)/kfree -> 66 cycles
100000 times kmalloc(32)/kfree -> 66 cycles
100000 times kmalloc(64)/kfree -> 66 cycles
100000 times kmalloc(128)/kfree -> 65 cycles
100000 times kmalloc(256)/kfree -> 67 cycles
100000 times kmalloc(512)/kfree -> 67 cycles
100000 times kmalloc(1024)/kfree -> 64 cycles
100000 times kmalloc(2048)/kfree -> 67 cycles
100000 times kmalloc(4096)/kfree -> 67 cycles
Kernbench, before:
Average Optimal load -j 12 Run (std deviation):
Elapsed Time 101.873 (1.16069)
User Time 1045.22 (1.60447)
System Time 88.969 (0.559195)
Percent CPU 1112.9 (13.8279)
Context Switches 189140 (2282.15)
Sleeps 99008.6 (768.091)
After:
Average Optimal load -j 12 Run (std deviation):
Elapsed Time 102.47 (0.562732)
User Time 1045.3 (1.34263)
System Time 88.311 (0.342554)
Percent CPU 1105.8 (6.49444)
Context Switches 189081 (2355.78)
Sleeps 99231.5 (800.358)
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Garnier <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It's unused since commit 7ed2f9e66385 ("mm, kasan: SLAB support")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add KASAN hooks to SLAB allocator.
This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Konstantin Serebryany <[email protected]>
Cc: Dmitry Chernenkov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
SLUB already has a redzone debugging feature. But it is only positioned
at the end of object (aka right redzone) so it cannot catch left oob.
Although current object's right redzone acts as left redzone of next
object, first object in a slab cannot take advantage of this effect.
This patch explicitly adds a left red zone to each object to detect left
oob more precisely.
Background:
Someone complained to me that left OOB doesn't catch even if KASAN is
enabled which does page allocation debugging. That page is out of our
control so it would be allocated when left OOB happens and, in this
case, we can't find OOB. Moreover, SLUB debugging feature can be
enabled without page allocator debugging and, in this case, we will miss
that OOB.
Before trying to implement, I expected that changes would be too
complex, but, it doesn't look that complex to me now. Almost changes
are applied to debug specific functions so I feel okay.
Signed-off-by: Joonsoo Kim <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The cgroup2 memory controller will account important in-kernel memory
consumers per default. Move all necessary components to CONFIG_MEMCG.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove static and add function declarations to linux/slub_def.h so it
could be used by kernel address sanitizer.
Signed-off-by: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Konstantin Serebryany <[email protected]>
Cc: Dmitry Chernenkov <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
Cc: Yuri Gribov <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
virt_to_obj takes kmem_cache address, address of slab page, address x
pointing somewhere inside slab object, and returns address of the
beginning of object.
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Konstantin Serebryany <[email protected]>
Cc: Dmitry Chernenkov <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
Cc: Yuri Gribov <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, kmem_cache stores a pointer to struct memcg_cache_params
instead of embedding it. The rationale is to save memory when kmem
accounting is disabled. However, the memcg_cache_params has shrivelled
drastically since it was first introduced:
* Initially:
struct memcg_cache_params {
bool is_root_cache;
union {
struct kmem_cache *memcg_caches[0];
struct {
struct mem_cgroup *memcg;
struct list_head list;
struct kmem_cache *root_cache;
bool dead;
atomic_t nr_pages;
struct work_struct destroy;
};
};
};
* Now:
struct memcg_cache_params {
bool is_root_cache;
union {
struct {
struct rcu_head rcu_head;
struct kmem_cache *memcg_caches[0];
};
struct {
struct mem_cgroup *memcg;
struct kmem_cache *root_cache;
};
};
};
So the memory saving does not seem to be a clear win anymore.
OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
it results in touching one more cache line on kmem alloc/free hot paths.
Besides, it makes linking kmem caches in a list chained by a field of
struct memcg_cache_params really painful due to a level of indirection,
while I want to make them linked in the following patch. That said, let
us embed it.
Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dan Carpenter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|